Skip to content

CLI-first AI agent evaluation

No server. No signup. No overhead.

Evaluate your AI agents locally with multi-objective scoring from YAML specifications. Deterministic code judges + customizable LLM judges, all version-controlled in Git.

Local Execution

No cloud dependency. All data stays on your machine.

Multi-Objective Scoring

Correctness, latency, cost, and safety in one run.

Code + LLM Judges

Deterministic code validators and customizable LLM judges.

LLM & Agent Targets

Direct LLM providers plus Claude Code, Codex, Pi, Copilot, OpenCode.

Rubric Grading

Structured criteria with weights and auto-generation.

A/B Comparison

Compare evaluation runs with statistical deltas.

Quick Start

1

Install

npm install -g agentv
2

Initialize

agentv init
3

Configure

Copy .env.example to .env and add your API keys.

4

Create an eval

description: Math evaluation
execution:
  target: default

evalcases:
  - id: addition
    expected_outcome: Correctly calculates 15 + 27 = 42
    input_messages:
      - role: user
        content: What is 15 + 27?
5

Run

agentv eval ./evals/example.yaml

How AgentV Compares

Feature AgentV LangWatch LangSmith LangFuse
Setup npm install Cloud account + API key Cloud account + API key Cloud account + API key
Server None (local) Managed cloud Managed cloud Managed cloud
Privacy All local Cloud-hosted Cloud-hosted Cloud-hosted
CLI-first Limited Limited
CI/CD ready Requires API calls Requires API calls Requires API calls
Version control ✓ (YAML in Git)
Evaluators Code + LLM + Custom LLM only LLM + Code LLM only