Skip to content

Compare

The compare command computes deltas between two evaluation runs for A/B testing.

Run two evaluations and compare them:

Terminal window
agentv eval evals/my-eval.yaml --out before.jsonl
# ... make changes to your agent ...
agentv eval evals/my-eval.yaml --out after.jsonl
agentv compare before.jsonl after.jsonl

Set a minimum delta to highlight significant changes:

Terminal window
agentv compare before.jsonl after.jsonl --threshold 0.1

The comparison shows:

  • Wins — cases where scores improved
  • Losses — cases where scores regressed
  • Ties — cases with no significant change
  • Mean delta — average score change across all cases

This helps identify whether changes to your agent or prompts improved or regressed performance.