Task Monitoring & Observability
Self-Healing Task Execution
Event-sourced timelines, intelligent retry, and proactive alerting
The Task Monitoring system transforms task execution from "tasks disappear into failed status" to fully observable, self-healing workflows. Every task execution is tracked with event-sourced timelines, intelligent retry with exponential backoff, circuit breaker awareness, and automatic alerting when tasks exhaust all recovery options.
Key Capabilities
- ◆ Error Classification — Automatically distinguishes retryable errors (HTTP 5xx, timeouts, connection failures) from permanent ones (HTTP 4xx, missing callbacks)
- ◆ Exponential Backoff with Jitter — 1s base, 30s max, 25% jitter factor to prevent thundering herd
- ◆ Circuit Breaker Awareness — Skips retries when provider circuit breakers are open, saving time and resources
- ◆ Event-Sourced Timeline — Every attempt, error, state change, and recovery action is recorded as an append-only event
- ◆ Dead Letter Queue — Exhausted tasks are stored for manual review with full context, never silently lost
- ◆ Proactive Alerting — Severity classified as critical ($100+), high (payment/withdrawal), or medium for all others
-
◆
Real-Time PubSub — Live dashboard updates via
task:{task_id}anduser_dashboard:{user_id}topics
Task Timeline Events
Every task execution generates a rich event timeline. Events are append-only and include severity levels for filtering and alerting.
| Event Type | Severity | Description |
|---|---|---|
| task_created | info | Task record created |
| task_started | info | Execution began with retry configuration |
| attempt_started | info | Individual attempt began |
| attempt_success | info | Attempt completed successfully |
| attempt_failed | error | Attempt failed with classified error |
| task_retried | info | Retry scheduled with backoff |
| circuit_breaker_opened | critical | Circuit breaker tripped for provider |
| task_failed | error | All retries exhausted |
| moved_to_dlq | warning | Task sent to Dead Letter Queue |
| task_completed | info | Task completed successfully |
Monitoring API Endpoints
GET /api/v1/tasks/:id/timeline
Auth
Get the full event-sourced timeline for a task, including all attempts, errors, and state changes.
GET /api/v1/tasks/:id/attempts
Auth
List all execution attempts with timing, error details, and circuit breaker state.
POST /api/v1/tasks/:id/retry
Auth
Manually retry a failed or cancelled task through the full DLQ → TaskRouter pipeline.
GET /api/v1/tasks/monitoring/health
Auth
Monitoring health dashboard: queue depth, task counts, failure rate, avg execution time, and active circuit breakers.
GET /api/v1/failed-tasks
Auth
List all tasks in the Dead Letter Queue, filtered by status and agent.
POST /api/v1/failed-tasks/:id/retry
Auth
Re-enqueue a failed task from the DLQ for re-execution through the standard pipeline.
Dashboard Live UI
The Tasks dashboard provides real-time monitoring with:
- ◆ Active Tasks Panel — Live list of pending/in-progress tasks with progress bars and cancel buttons
- ◆ Task Detail Panel — Tabbed view with Overview, Timeline (event stream), Attempts (per-attempt breakdown), Input/Output, and Payment details
- ◆ Real-Time Updates — PubSub subscriptions push updates for task_started, task_completed, task_failed, and task_cancelled without page refresh
- ◆ Manual Retry — Retry failed tasks directly from the detail panel via the DLQ pipeline
Telemetry Events
Task execution emits telemetry for external monitoring integration:
[:agent_marketplace, :task, :started]
Task execution started
[:agent_marketplace, :task, :completed]
Task completed (includes duration_ms and attempt count)
[:agent_marketplace, :task, :failed]
Task failed after all retries (includes reason and retry_count)
[:agent_marketplace, :task, :retry]
Task retry triggered (includes attempt number and error)