Observability & Replay Module

Overview

Runtime visibility and debugging for agent execution

The Observability Module records what the agent did, why it did it, what tools it touched, how long each step took, how much it cost, and exactly where it failed. It turns the agent from a black box into an inspectable system with full trace replay capabilities.

Traces

Complete execution records

Replay

Step-by-step debugging

Cost

Per-run breakdown

Anomalies

Automated detection

Core Features

Execution Tracing

Capture complete traces of agent execution with all spans, inputs, and outputs.

• LLM call tracking (model, tokens, cost)
• Tool call recording (args, results, duration)
• Decision point capture
• Error and retry tracking

Step-by-Step Replay

Replay past executions to debug failures or understand behavior.

• Interactive step mode
• Modified replay (test fixes)
• Span context inspection
• Redacted sensitive data

Failure Clustering

Automatically group similar failures to identify patterns.

• Error type clustering
• Impact analysis
• Related trace linking
• Trend detection

Cost Analysis

Track costs by model, tool, agent, and time period.

• Per-model cost breakdown
• Tool cost analysis
• Trend visualization
• Cost estimation for workflows

Trace Comparison

Compare two traces via POST /api/v1/observability/compare to understand what changed between executions:

Execution Flow

Compare step sequences and branching decisions

Outputs

Identify where results diverged

Performance

Duration and cost differences

Anomaly Detection

Automatic detection of unusual patterns in agent behavior:

Cost Spike

Sudden increase in execution costs

Auto-alert

Latency Increase

Response time degradation

Auto-alert

Error Rate

Unusual failure patterns

Auto-alert

API Endpoints

Trace Management

POST /api/v1/observability/traces

Start a new trace for an agent execution.

Request:

{
  "agent_id": "agent-uuid",
  "workflow_id": "optional-workflow-uuid",
  "trace_type": "workflow",
  "environment": "production"
}

GET /api/v1/traces

List traces (alias endpoint).

GET /api/v1/observability/traces

Search traces with filtering.

Query Parameters:

agent_id

Filter by agent

status

running | completed | failed

from_date / to_date

Time range filter

GET /api/v1/observability/traces/:id

Get trace details by ID.

PUT /api/v1/observability/traces/:id

End a trace.

Spans

POST


                    /api/v1/observability/traces/:trace_id/spans

Record a span within a trace.

PUT /api/v1/observability/spans/:id

End a span.

PUT /api/v1/observability/spans/:id/fail

Mark a span as failed.

GET


                    /api/v1/observability/traces/:trace_id/spans/:span_id/context

Get span context for a specific trace and span.

Search

GET /api/v1/observability/search

Search traces with query parameters.

GET /api/v1/observability/search/content

Search traces by content.

Replay

POST


                    /api/v1/observability/traces/:id/replay

Replay a trace for debugging.

Request:

{
  "step_mode": true,
  "stop_on_error": true,
  "redact_sensitive": true
}

POST


                    /api/v1/observability/replay/:replay_id/step

Advance a replay session by one step.

POST


                    /api/v1/observability/traces/:id/replay-modified

Replay a trace with modifications to test fixes.

POST /api/v1/observability/compare

Compare two traces to understand what changed between executions.

Cost Analysis

GET


                    /api/v1/observability/costs/breakdown/:agent_id

Get cost breakdown by model and tool for an agent.

GET /api/v1/observability/costs/by-model

Get aggregated costs grouped by model.

GET /api/v1/observability/costs/by-tool

Get aggregated costs grouped by tool.

GET


                    /api/v1/observability/costs/trend/:agent_id

Get cost trend over time for an agent.

POST /api/v1/observability/costs/estimate

Estimate cost for a workflow before execution.

Failure Clustering

GET /api/v1/observability/failures/clusters

List all failure clusters.

GET /api/v1/observability/failures/clusters/:id

Get details for a specific failure cluster.

GET


                    /api/v1/observability/failures/clusters/:cluster_id/traces

Get traces associated with a failure cluster.

Anomaly Detection

POST


                    /api/v1/observability/anomalies/detect/:agent_id

Detect anomalies for a specific agent.

GET /api/v1/observability/anomalies/active

List currently active anomalies across all agents.

PUT


                    /api/v1/observability/anomalies/:id/acknowledge

Acknowledge an anomaly.

Best Practices

Tracing

Always start traces at workflow beginning
Record all LLM and tool calls as spans
Tag traces with relevant metadata

Analysis

Review failure clusters weekly
Set up anomaly alerts for production
Use diff to validate changes

Back to Documentation