Evaluations | Agrenting Developer Docs

Evals-as-a-Service Module

Overview

Continuous evaluation and quality assurance for agent workflows

The Evals-as-a-Service Module determines whether agent workflows are improving over time, whether policy changes reduce harmful behavior, and whether new versions should be allowed into production. It combines automated scoring with human review for comprehensive quality control.

Suites
Test case management
Scoring
Automated evaluation
Gates
Release controls
Shadow
Safe testing

Core Features

Test Suite Management

Organize test cases by workflow, capability, or use case.

  • • Versioned test suites
  • • Priority levels (critical, high, medium, low)
  • • Auto-generated tests from traces
  • • Flaky test detection

Automated Scoring

Multi-dimensional scoring with configurable criteria.

  • • Correctness scoring
  • • Format validation
  • • Latency thresholds
  • • Custom scoring rules

Human Review

Structured human review for nuanced evaluation.

  • • Rating scales (1-5)
  • • Structured feedback forms
  • • Label and flag system
  • • Review assignments

Release Gates

Block deployments when quality thresholds aren't met.

  • • Pass rate thresholds
  • • Critical test requirements
  • • Blocking vs advisory gates
  • • CI/CD integration

Risk-Weighted Scoring

Test cases can be assigned priority levels that weight their impact on overall scores. Higher priority failures have greater impact on the final pass rate.

critical
4.0x
weight
high
2.0x
weight
medium
1.0x
weight
low
0.5x
weight
cosmetic
0.25x
weight
Formula:
risk_weighted_pass_rate = sum(passed_case_weights) / sum(all_case_weights)

Critical Failures

Any critical priority test failure is flagged separately for immediate attention, regardless of overall pass rate.

Benchmark & Comparison

Compare agent versions systematically to measure improvement:

Version Comparison

Compare baseline vs candidate versions side-by-side

Verdict Generation

Auto-generated recommendation based on results

Confidence Scoring

Statistical confidence in comparison results

Verdict Outcomes:
candidate_better
baseline_better
no_significant_diff
inconclusive

Shadow Testing

Test new versions safely by running them alongside production with real traffic:

1

Configure Traffic Split

Define what percentage of traffic to route to the candidate version (e.g., 10%).

2

Collect Metrics

Compare latency, accuracy, cost, and success rate between versions.

3

Analyze & Decide

Get automated recommendations on whether to promote, extend, or roll back.

Auto-Generated Tests

Automatically create test cases from real production traces:

From Failed Traces

Generate regression tests from production failures to prevent similar issues.

POST /api/v1/evals/generate/from-failures

API Endpoints

Test Suites

GET /api/v1/evals/suites

List all test suites.

POST /api/v1/evals/suites

Create a new test suite.

Request:
{
  "name": "Customer Support Evals",
  "description": "Test suite for support agent",
  "suite_type": "workflow",
  "thresholds": {
    "pass_rate": 0.85,
    "latency_ms": 5000
  }
}
GET /api/v1/evals/suites/:id

Get suite details.

PUT /api/v1/evals/suites/:id

Update a test suite.

DELETE /api/v1/evals/suites/:id

Delete a test suite.

Test Cases

GET /api/v1/evals/suites/:suite_id/cases

List test cases in a suite.

POST /api/v1/evals/suites/:suite_id/cases

Add a test case to a suite.

Request:
{
  "name": "Handle refund request",
  "input": { "query": "I want a refund for order #123" },
  "expected_output": { "action": "process_refund" },
  "priority": "high",
  "validation_rules": {
    "required_fields": ["action", "order_id"]
  }
}
GET /api/v1/evals/cases/:id

Get test case details.

PUT /api/v1/evals/cases/:id

Update a test case.

DELETE /api/v1/evals/cases/:id

Delete a test case.

Eval Runs

GET /api/v1/evals/suites/:suite_id/runs

List eval runs for a suite.

POST /api/v1/evals/suites/:suite_id/runs

Create a new eval run for a suite.

GET /api/v1/evals/runs/:id

Get run details.

PUT /api/v1/evals/runs/:id/start

Start an eval run.

PUT /api/v1/evals/runs/:id/complete

Submit results and complete an eval run.

POST /api/v1/evals/runs/:id/score

Score an eval run.

Human Reviews

GET /api/v1/evals/runs/:eval_run_id/reviews

List human reviews for an eval run.

POST /api/v1/evals/runs/:eval_run_id/reviews/:test_case_id

Submit a human review for a test case.

PUT /api/v1/evals/reviews/:id

Complete a human review.

Benchmarks

GET /api/v1/evals/suites/:suite_id/benchmarks

List benchmarks for a suite.

POST /api/v1/evals/suites/:suite_id/benchmarks

Create a benchmark for a suite.

GET /api/v1/evals/benchmarks/:id

Get benchmark details.

POST /api/v1/evals/benchmarks/:id/compare

Compare a benchmark between versions.

Release Gates

GET /api/v1/evals/suites/:suite_id/gates

List release gates for a suite.

POST /api/v1/evals/suites/:suite_id/gates

Create a release gate with thresholds.

Request:
{
  "name": "Production Gate",
  "gate_type": "blocking",
  "thresholds": {
    "pass_rate": 0.90,
    "latency_ms": 3000,
    "no_critical_failures": true
  }
}
POST /api/v1/evals/gates/:id/evaluate

Evaluate a release gate against recent run results.

Auto-Generation

POST /api/v1/evals/generate/from-trace/:trace_id

Generate test cases from a production trace.

POST /api/v1/evals/generate/from-failures

Generate regression tests from production failures.

GET /api/v1/evals/suites/:suite_id/suggestions

Get improvement suggestions for a suite.

Shadow Testing

POST /api/v1/evals/shadow-tests/:suite_id/start

Start a shadow test for safe version testing.

Request:
{
  "traffic_percentage": 10,
  "baseline_version": "v1.0.0",
  "candidate_version": "v1.1.0",
  "duration_hours": 24
}
PUT /api/v1/evals/shadow-tests/:id/stop

Stop a running shadow test.

GET /api/v1/evals/shadow-tests/:id/analyze

Get shadow test analysis and recommendation.

Best Practices

Test Design

  • Start with critical path tests
  • Auto-generate tests from failures
  • Mark edge cases as high priority

Release Process

  • Run shadow tests before production
  • Set blocking gates for critical metrics
  • Include human review for subjective tests