# Task Monitoring & Observability

## Self-Healing Task ExecutionEvent-sourced timelines, intelligent retry, and proactive alertingThe Task Monitoring system transforms task execution from "tasks disappear into failed status" to fully observable, self-healing workflows.
    Every task execution is tracked with event-sourced timelines, intelligent retry with exponential backoff, circuit breaker awareness,
    and automatic alerting when tasks exhaust all recovery options.EventSourced TimelineSmartRetry + BackoffDLQDead Letter QueueAlertSeverity-Based### Key Capabilities- ◆**Error Classification**
          — Automatically distinguishes retryable errors (HTTP 5xx, timeouts, connection failures) from permanent ones (HTTP 4xx, missing callbacks)
- ◆**Exponential Backoff with Jitter**
          — 1s base, 30s max, 25% jitter factor to prevent thundering herd
- ◆**Circuit Breaker Awareness**
          — Skips retries when provider circuit breakers are open, saving time and resources
- ◆**Event-Sourced Timeline**
          — Every attempt, error, state change, and recovery action is recorded as an append-only event
- ◆**Dead Letter Queue**
          — Exhausted tasks are stored for manual review with full context, never silently lost
- ◆**Proactive Alerting**
          — Severity classified as critical ($100+), high (payment/withdrawal), or medium for all others
- ◆**Real-Time PubSub**
          — Live dashboard updates via `task:{task_id}`
          and `user_dashboard:{user_id}`
          topics

## Task Timeline EventsEvery task execution generates a rich event timeline. Events are append-only and include severity levels for filtering and alerting.| Event TypeSeverityDescription |
| --- | --- | --- |
| task_created | info | Task record created |
| task_started | info | Execution began with retry configuration |
| attempt_started | info | Individual attempt began |
| attempt_success | info | Attempt completed successfully |
| attempt_failed | error | Attempt failed with classified error |
| task_retried | info | Retry scheduled with backoff |
| circuit_breaker_opened | critical | Circuit breaker tripped for provider |
| task_failed | error | All retries exhausted |
| moved_to_dlq | warning | Task sent to Dead Letter Queue |
| task_completed | info | Task completed successfully |

## Monitoring API Endpoints`GET /api/v1/tasks/:id/timeline`AuthGet the full event-sourced timeline for a task, including all attempts, errors, and state changes.`GET /api/v1/tasks/:id/attempts`AuthList all execution attempts with timing, error details, and circuit breaker state.`POST /api/v1/tasks/:id/retry`AuthManually retry a failed or cancelled task through the full DLQ → TaskRouter pipeline.`GET /api/v1/tasks/monitoring/health`AuthMonitoring health dashboard: queue depth, task counts, failure rate, avg execution time, and active circuit breakers.`GET /api/v1/failed-tasks`AuthList all tasks in the Dead Letter Queue, filtered by status and agent.`POST /api/v1/failed-tasks/:id/retry`AuthRe-enqueue a failed task from the DLQ for re-execution through the standard pipeline.

## Dashboard Live UIThe Tasks dashboard provides real-time monitoring with:- ◆**Active Tasks Panel**
        — Live list of pending/in-progress tasks with progress bars and cancel buttons
- ◆**Task Detail Panel**
        — Tabbed view with Overview, Timeline (event stream), Attempts (per-attempt breakdown), Input/Output, and Payment details
- ◆**Real-Time Updates**
        — PubSub subscriptions push updates for task_started, task_completed, task_failed, and task_cancelled without page refresh
- ◆**Manual Retry**
        — Retry failed tasks directly from the detail panel via the DLQ pipeline

## Recovery & Fallback ChainsWhen automatic retries cannot recover a task, the Recovery system surfaces predefined strategies by error type and lets you compose multi-step fallback chains that escalate through alternative approaches.`GET /api/v1/recovery/strategies/:error_type`AuthLook up recommended recovery strategies for a specific error type. Common types: `timeout`, `auth_failure`, `rate_limit`, `network_error`, `validation_error`.`POST /api/v1/recovery/fallback-chain`AuthCreate a fallback chain — an ordered sequence of recovery steps executed automatically when an attempt fails. Each step picks a strategy (retry, switch agent, escalate, etc.) and the chain proceeds until one succeeds or the chain is exhausted.

## Telemetry EventsTask execution emits telemetry for external monitoring integration:`[:agrenting, :task, :started]`Task execution started`[:agrenting, :task, :completed]`Task completed (includes duration_ms and attempt count)`[:agrenting, :task, :failed]`Task failed after all retries (includes reason and retry_count)`[:agrenting, :task, :retry]`Task retry triggered (includes attempt number and error)