Skip to content

Introduction

AgentV is a CLI-first AI agent evaluation framework. It evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code judges + customizable LLM judges, all version-controlled in Git.

Best for: Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.

  • No cloud dependency — everything runs locally
  • No server — just install and run
  • Version-controlled — YAML evaluation files live in Git alongside your code
  • CI/CD ready — run evaluations in your pipeline without external API calls
  • Multiple evaluator types — code validators, LLM judges, custom Python/TypeScript
FeatureAgentVLangWatchLangSmithLangFuse
Setupnpm installCloud account + API keyCloud account + API keyCloud account + API key
ServerNone (local)Managed cloudManaged cloudManaged cloud
PrivacyAll localCloud-hostedCloud-hostedCloud-hosted
CLI-firstYesNoLimitedLimited
CI/CD readyYesRequires API callsRequires API callsRequires API calls
Version controlYes (YAML in Git)NoNoNo
EvaluatorsCode + LLM + CustomLLM onlyLLM + CodeLLM only

Evaluation files (.yaml or .jsonl) define test cases with expected outcomes. Targets specify which agent or provider to evaluate. Judges (code or LLM) score results. Results are written as JSONL/YAML for analysis and comparison.

  • Eval files — YAML or JSONL definitions of test cases
  • Eval cases — Individual test cases with input messages and expected outcomes
  • Targets — The agent or LLM provider being evaluated
  • Evaluators — Code judges (Python/TypeScript) or LLM judges that score responses
  • Rubrics — Structured criteria with weights for grading
  • Results — JSONL output with scores, reasoning, and execution traces
  • Multi-objective scoring: Correctness, latency, cost, safety in one run
  • Multiple evaluator types: Code validators, LLM judges, custom Python/TypeScript
  • Built-in targets: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
  • Structured evaluation: Rubric-based grading with weights and requirements
  • Batch evaluation: Run hundreds of test cases in parallel
  • Export: JSON, JSONL, YAML formats
  • Compare results: Compute deltas between evaluation runs for A/B testing