Observability & Replay Module
Overview
Runtime visibility and debugging for agent execution
The Observability Module records what the agent did, why it did it, what tools it touched, how long each step took, how much it cost, and exactly where it failed. It turns the agent from a black box into an inspectable system with full trace replay capabilities.
Core Features
Execution Tracing
Capture complete traces of agent execution with all spans, inputs, and outputs.
- • LLM call tracking (model, tokens, cost)
- • Tool call recording (args, results, duration)
- • Decision point capture
- • Error and retry tracking
Step-by-Step Replay
Replay past executions to debug failures or understand behavior.
- • Interactive step mode
- • Modified replay (test fixes)
- • Span context inspection
- • Redacted sensitive data
Failure Clustering
Automatically group similar failures to identify patterns.
- • Error type clustering
- • Impact analysis
- • Related trace linking
- • Trend detection
Cost Analysis
Track costs by model, tool, agent, and time period.
- • Per-model cost breakdown
- • Tool cost analysis
- • Trend visualization
- • Cost estimation for workflows
Trace Comparison
Compare two traces via POST /api/v1/observability/compare to understand what changed between executions:
Compare step sequences and branching decisions
Identify where results diverged
Duration and cost differences
Anomaly Detection
Automatic detection of unusual patterns in agent behavior:
Sudden increase in execution costs
Response time degradation
Unusual failure patterns
API Endpoints
Trace Management
/api/v1/observability/traces
Start a new trace for an agent execution.
{
"agent_id": "agent-uuid",
"workflow_id": "optional-workflow-uuid",
"trace_type": "workflow",
"environment": "production"
}
/api/v1/traces
List traces (alias endpoint).
/api/v1/observability/traces
Search traces with filtering.
/api/v1/observability/traces/:id
Get trace details by ID.
/api/v1/observability/traces/:id
End a trace.
Spans
/api/v1/observability/traces/:trace_id/spans
Record a span within a trace.
/api/v1/observability/spans/:id
End a span.
/api/v1/observability/spans/:id/fail
Mark a span as failed.
/api/v1/observability/traces/:trace_id/spans/:span_id/context
Get span context for a specific trace and span.
Search
/api/v1/observability/search
Search traces with query parameters.
/api/v1/observability/search/content
Search traces by content.
Replay
/api/v1/observability/traces/:id/replay
Replay a trace for debugging.
{
"step_mode": true,
"stop_on_error": true,
"redact_sensitive": true
}
/api/v1/observability/replay/:replay_id/step
Advance a replay session by one step.
/api/v1/observability/traces/:id/replay-modified
Replay a trace with modifications to test fixes.
/api/v1/observability/compare
Compare two traces to understand what changed between executions.
Cost Analysis
/api/v1/observability/costs/breakdown/:agent_id
Get cost breakdown by model and tool for an agent.
/api/v1/observability/costs/by-model
Get aggregated costs grouped by model.
/api/v1/observability/costs/by-tool
Get aggregated costs grouped by tool.
/api/v1/observability/costs/trend/:agent_id
Get cost trend over time for an agent.
/api/v1/observability/costs/estimate
Estimate cost for a workflow before execution.
Failure Clustering
/api/v1/observability/failures/clusters
List all failure clusters.
/api/v1/observability/failures/clusters/:id
Get details for a specific failure cluster.
/api/v1/observability/failures/clusters/:cluster_id/traces
Get traces associated with a failure cluster.
Anomaly Detection
/api/v1/observability/anomalies/detect/:agent_id
Detect anomalies for a specific agent.
/api/v1/observability/anomalies/active
List currently active anomalies across all agents.
/api/v1/observability/anomalies/:id/acknowledge
Acknowledge an anomaly.
Best Practices
Tracing
- Always start traces at workflow beginning
- Record all LLM and tool calls as spans
- Tag traces with relevant metadata
Analysis
- Review failure clusters weekly
- Set up anomaly alerts for production
- Use diff to validate changes