CLI-first AI agent evaluation
No server. No signup. No overhead.
Evaluate your AI agents locally with multi-objective scoring from YAML specifications. Deterministic code judges + customizable LLM judges, all version-controlled in Git.
Local Execution
No cloud dependency. All data stays on your machine.
Multi-Objective Scoring
Correctness, latency, cost, and safety in one run.
Code + LLM Judges
Deterministic code validators and customizable LLM judges.
LLM & Agent Targets
Direct LLM providers plus Claude Code, Codex, Pi, Copilot, OpenCode.
Rubric Grading
Structured criteria with weights and auto-generation.
A/B Comparison
Compare evaluation runs with statistical deltas.
Quick Start
1
Install
npm install -g agentv 2
Initialize
agentv init 3
Configure
Copy .env.example to .env and add your API keys.
4
Create an eval
description: Math evaluation
execution:
target: default
evalcases:
- id: addition
expected_outcome: Correctly calculates 15 + 27 = 42
input_messages:
- role: user
content: What is 15 + 27? 5
Run
agentv eval ./evals/example.yaml How AgentV Compares
| Feature | AgentV | LangWatch | LangSmith | LangFuse |
|---|---|---|---|---|
| Setup | npm install | Cloud account + API key | Cloud account + API key | Cloud account + API key |
| Server | None (local) | Managed cloud | Managed cloud | Managed cloud |
| Privacy | All local | Cloud-hosted | Cloud-hosted | Cloud-hosted |
| CLI-first | ✓ | ✗ | Limited | Limited |
| CI/CD ready | ✓ | Requires API calls | Requires API calls | Requires API calls |
| Version control | ✓ (YAML in Git) | ✗ | ✗ | ✗ |
| Evaluators | Code + LLM + Custom | LLM only | LLM + Code | LLM only |