Skip to content

Running Evaluations

Terminal window
agentv eval evals/my-eval.yaml

Results are written to .agentv/results/<timestamp>.jsonl. Each line is a JSON object with one result per test case.

Each scores[] entry includes per-grader timing:

{
"scores": [
{
"name": "format_structure",
"type": "llm-grader",
"score": 0.9,
"verdict": "pass",
"assertions": [
{ "text": "clear structure", "passed": true }
],
"duration_ms": 9103,
"started_at": "2026-03-09T00:05:10.123Z",
"ended_at": "2026-03-09T00:05:19.226Z",
"token_usage": { "input": 2711, "output": 2535 }
}
]
}

The duration_ms, started_at, and ended_at fields are present on every grader result (including code-grader), enabling per-grader bottleneck analysis.

Run against a different target than specified in the eval file:

Terminal window
agentv eval --target azure-base evals/**/*.yaml

Tag a pipeline run with an experiment name to track different conditions (e.g. with vs without skills):

Terminal window
agentv pipeline run evals/my-eval.yaml --experiment with_skills
agentv pipeline run evals/my-eval.yaml --experiment without_skills

The experiment label is written to manifest.json and propagated to each entry in index.jsonl by pipeline bench. The eval file stays the same across experiments — what changes is the environment. Dashboards can filter and compare results by experiment.

Run a single test by ID:

Terminal window
agentv eval --test-id case-123 evals/my-eval.yaml

Test the harness flow with mock responses (does not call real providers):

Terminal window
agentv eval --dry-run evals/my-eval.yaml
Terminal window
agentv eval evals/my-eval.yaml --out results/baseline.jsonl

Export execution traces (tool calls, timing, spans) to files for debugging and analysis:

By default, AgentV writes a per-run workspace with index.jsonl as the canonical manifest for result-oriented workflows. For full-fidelity span inspection, export OTLP JSON explicitly.

Terminal window
# Summary-level inspection from the run manifest
agentv trace stats .agentv/results/runs/<timestamp>/index.jsonl
# Full-fidelity OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)
agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json
# Inspect the OTLP trace export
agentv trace show traces/eval.otlp.json --tree

index.jsonl contains aggregate metrics such as score, latency, cost, token usage, and summary trace counters. --otel-file writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.

Stream traces directly to an observability backend during evaluation using --export-otel:

Terminal window
# Use a backend preset (braintrust, langfuse, confident)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust
# Include message content and tool I/O in spans (disabled by default for privacy)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content
# Group messages into turn spans for multi-turn evaluations
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-group-turns

Set up your environment:

Terminal window
export BRAINTRUST_API_KEY=sk-...
export BRAINTRUST_PROJECT=my-project # associates traces with a Braintrust project

Run an eval with traces sent to Braintrust:

Terminal window
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content

The following environment variables control project association (at least one is required):

VariableFormatExample
BRAINTRUST_PROJECTProject namemy-evals
BRAINTRUST_PROJECT_IDProject UUIDproj_abc123
BRAINTRUST_PARENTRaw x-bt-parent headerproject_name:my-evals

Each eval test case produces a trace with:

  • Root span (agentv.eval) — test ID, target, score, duration
  • LLM call spans (chat <model>) — model name, token usage (input/output/cached)
  • Tool call spans (execute_tool <name>) — tool name, arguments, results (with --otel-capture-content)
  • Turn spans (agentv.turn.N) — groups messages by conversation turn (with --otel-group-turns)
  • Evaluator events — per-grader scores attached to the root span
Terminal window
export LANGFUSE_PUBLIC_KEY=pk-...
export LANGFUSE_SECRET_KEY=sk-...
# Optional: export LANGFUSE_HOST=https://cloud.langfuse.com
agentv eval evals/my-eval.yaml --export-otel --otel-backend langfuse --otel-capture-content

For backends not covered by presets, configure via environment variables:

Terminal window
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/traces
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"
agentv eval evals/my-eval.yaml --export-otel

Use workspace mode and finish policies instead of multiple conflicting booleans:

Terminal window
# Mode: pooled | temp | static
agentv eval evals/my-eval.yaml --workspace-mode pooled
# Static mode path
agentv eval evals/my-eval.yaml --workspace-mode static --workspace-path /path/to/workspace
# Pooled reset policy override: standard | full (CLI override)
agentv eval evals/my-eval.yaml --workspace-clean full
# Finish policy overrides: keep | cleanup (CLI)
agentv eval evals/my-eval.yaml --retain-on-success cleanup --retain-on-failure keep

Equivalent eval YAML:

workspace:
mode: pooled # pooled | temp | static
path: null # workspace path for mode=static; auto-materialised when empty/missing
hooks:
enabled: true # set false to skip all hooks
after_each:
reset: fast # none | fast | strict

Notes:

  • Pooling is default for shared workspaces with repos when mode is not specified.
  • mode: static (or --workspace-mode static) uses path / --workspace-path. When the path is empty or missing, the workspace is auto-materialised (template copied + repos cloned). Populated directories are reused as-is.
  • Static mode is incompatible with isolation: per_test.
  • hooks.enabled: false skips all lifecycle hooks (setup, teardown, reset).
  • Pool slots are managed separately (agentv workspace list|clean).

Re-run only the tests that had infrastructure/execution errors from a previous output:

Terminal window
agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonl

This reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.

Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:

execution:
fail_on_error: false # never halt on errors (default)
# fail_on_error: true # halt on first execution error
ValueBehavior
trueHalt immediately on first execution error
falseContinue despite errors (default)

When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.

Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates.

CLI flag:

Terminal window
agentv eval evals/ --threshold 0.8

YAML config:

execution:
threshold: 0.8

The CLI --threshold flag overrides the YAML value. The threshold is a number between 0 and 1. Mean score is computed from quality results only (execution errors are excluded).

When active, a summary line is printed after the eval results:

Suite score: 0.85 (threshold: 0.80) — PASS

The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as <failure> in JUnit output. When no threshold is set, JUnit defaults to 0.5.

Check eval files for schema errors without executing:

Terminal window
agentv validate evals/my-eval.yaml

Run a code-grader assertion in isolation without executing a full eval suite:

Terminal window
agentv eval assert <name> --agent-output <text> --agent-input <text>

The command discovers the assertion script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}, then passes the input via stdin and prints the result JSON to stdout.

Terminal window
# Run an assertion with inline arguments
agentv eval assert rouge-score \
--agent-output "The fox jumps over the lazy dog" \
--agent-input "Summarise the article"
# Or pass a JSON payload file
agentv eval assert rouge-score --file result.json

The --file option reads a JSON file with { "output": "...", "input": "..." } fields.

Exit codes: 0 if score >= 0.5 (pass), 1 if score < 0.5 (fail).

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits assertions instructions for code graders so external grading agents can execute them directly.

Grade existing agent sessions without re-running them. Import a transcript, then run deterministic evaluators:

Terminal window
# Import a Claude Code session
agentv import claude --discover latest
# Run evaluators against the imported transcript
agentv eval evals/my-eval.yaml --transcript .agentv/transcripts/claude-<id>.jsonl

See the Import tool docs for all providers and options.

Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:

required_version: ">=2.12.0"

The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).

ConditionInteractive (TTY)Non-interactive (CI)
Version satisfies rangeRuns silentlyRuns silently
Version below rangeWarns + prompts to continueWarns to stderr, continues
--strict flag + mismatchWarns + exits 1Warns + exits 1
No required_version setRuns silentlyRuns silently
Malformed semver rangeError + exits 1Error + exits 1

Use --strict in CI pipelines to enforce version requirements:

Terminal window
agentv eval --strict evals/my-eval.yaml

Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.

execution:
verbose: true
keep_workspaces: false
otel_file: .agentv/results/otel-{timestamp}.json
FieldCLI equivalentTypeDefaultDescription
verbose--verbosebooleanfalseEnable verbose logging
keep_workspaces--keep-workspacesbooleanfalseAlways keep temp workspaces after eval
otel_file--otel-filestringnoneWrite OTLP JSON trace to file
import { defineConfig } from '@agentv/core';
export default defineConfig({
execution: {
verbose: true,
keepWorkspaces: false,
otelFile: '.agentv/results/otel-{timestamp}.json',
},
});

The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.

Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.

Override the default ~/.agentv directory for all global runtime data (workspaces, git cache, subagents, trace state, version check cache):

Terminal window
# Linux/macOS
export AGENTV_HOME=/data/agentv
# Windows (PowerShell)
$env:AGENTV_HOME = "D:\agentv"
# Windows (CMD)
set AGENTV_HOME=D:\agentv

When set, AgentV logs Using AGENTV_HOME: <path> on startup to confirm the override is active.

Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.