Evals-as-a-Service Module
Overview
Continuous evaluation and quality assurance for agent workflows
The Evals-as-a-Service Module determines whether agent workflows are improving over time, whether policy changes reduce harmful behavior, and whether new versions should be allowed into production. It combines automated scoring with human review for comprehensive quality control.
Core Features
Test Suite Management
Organize test cases by workflow, capability, or use case.
- • Versioned test suites
- • Priority levels (critical, high, medium, low)
- • Auto-generated tests from traces
- • Flaky test detection
Automated Scoring
Multi-dimensional scoring with configurable criteria.
- • Correctness scoring
- • Format validation
- • Latency thresholds
- • Custom scoring rules
Human Review
Structured human review for nuanced evaluation.
- • Rating scales (1-5)
- • Structured feedback forms
- • Label and flag system
- • Review assignments
Release Gates
Block deployments when quality thresholds aren't met.
- • Pass rate thresholds
- • Critical test requirements
- • Blocking vs advisory gates
- • CI/CD integration
Risk-Weighted Scoring
Test cases can be assigned priority levels that weight their impact on overall scores. Higher priority failures have greater impact on the final pass rate.
Critical Failures
Any critical priority test failure is flagged separately for immediate attention, regardless of overall pass rate.
Benchmark & Comparison
Compare agent versions systematically to measure improvement:
Compare baseline vs candidate versions side-by-side
Auto-generated recommendation based on results
Statistical confidence in comparison results
Shadow Testing
Test new versions safely by running them alongside production with real traffic:
Configure Traffic Split
Define what percentage of traffic to route to the candidate version (e.g., 10%).
Collect Metrics
Compare latency, accuracy, cost, and success rate between versions.
Analyze & Decide
Get automated recommendations on whether to promote, extend, or roll back.
Auto-Generated Tests
Automatically create test cases from real production traces:
Generate regression tests from production failures to prevent similar issues.
API Endpoints
Test Suites
/api/v1/evals/suites
List all test suites.
/api/v1/evals/suites
Create a new test suite.
{
"name": "Customer Support Evals",
"description": "Test suite for support agent",
"suite_type": "workflow",
"thresholds": {
"pass_rate": 0.85,
"latency_ms": 5000
}
}
/api/v1/evals/suites/:id
Get suite details.
/api/v1/evals/suites/:id
Update a test suite.
/api/v1/evals/suites/:id
Delete a test suite.
Test Cases
/api/v1/evals/suites/:suite_id/cases
List test cases in a suite.
/api/v1/evals/suites/:suite_id/cases
Add a test case to a suite.
{
"name": "Handle refund request",
"input": { "query": "I want a refund for order #123" },
"expected_output": { "action": "process_refund" },
"priority": "high",
"validation_rules": {
"required_fields": ["action", "order_id"]
}
}
/api/v1/evals/cases/:id
Get test case details.
/api/v1/evals/cases/:id
Update a test case.
/api/v1/evals/cases/:id
Delete a test case.
Eval Runs
/api/v1/evals/suites/:suite_id/runs
List eval runs for a suite.
/api/v1/evals/suites/:suite_id/runs
Create a new eval run for a suite.
/api/v1/evals/runs/:id
Get run details.
/api/v1/evals/runs/:id/start
Start an eval run.
/api/v1/evals/runs/:id/complete
Submit results and complete an eval run.
/api/v1/evals/runs/:id/score
Score an eval run.
Human Reviews
/api/v1/evals/runs/:eval_run_id/reviews
List human reviews for an eval run.
/api/v1/evals/runs/:eval_run_id/reviews/:test_case_id
Submit a human review for a test case.
/api/v1/evals/reviews/:id
Complete a human review.
Benchmarks
/api/v1/evals/suites/:suite_id/benchmarks
List benchmarks for a suite.
/api/v1/evals/suites/:suite_id/benchmarks
Create a benchmark for a suite.
/api/v1/evals/benchmarks/:id
Get benchmark details.
/api/v1/evals/benchmarks/:id/compare
Compare a benchmark between versions.
Release Gates
/api/v1/evals/suites/:suite_id/gates
List release gates for a suite.
/api/v1/evals/suites/:suite_id/gates
Create a release gate with thresholds.
{
"name": "Production Gate",
"gate_type": "blocking",
"thresholds": {
"pass_rate": 0.90,
"latency_ms": 3000,
"no_critical_failures": true
}
}
/api/v1/evals/gates/:id/evaluate
Evaluate a release gate against recent run results.
Auto-Generation
/api/v1/evals/generate/from-trace/:trace_id
Generate test cases from a production trace.
/api/v1/evals/generate/from-failures
Generate regression tests from production failures.
/api/v1/evals/suites/:suite_id/suggestions
Get improvement suggestions for a suite.
Shadow Testing
/api/v1/evals/shadow-tests/:suite_id/start
Start a shadow test for safe version testing.
{
"traffic_percentage": 10,
"baseline_version": "v1.0.0",
"candidate_version": "v1.1.0",
"duration_hours": 24
}
/api/v1/evals/shadow-tests/:id/stop
Stop a running shadow test.
/api/v1/evals/shadow-tests/:id/analyze
Get shadow test analysis and recommendation.
Best Practices
Test Design
- Start with critical path tests
- Auto-generate tests from failures
- Mark edge cases as high priority
Release Process
- Run shadow tests before production
- Set blocking gates for critical metrics
- Include human review for subjective tests