Skill Evals (evals.json)
Overview
Section titled “Overview”Agent Skills is an open standard for describing AI agent capabilities. Its evals.json format defines simple test cases for skills — a prompt, expected output, and natural-language assertions.
AgentV natively supports evals.json. You can run Agent Skills evals directly:
agentv eval evals.json --target claudeWhen you need AgentV’s power features (deterministic evaluators, composite scoring, multi-turn conversations, workspace isolation), you can graduate to EVAL.yaml.
Quick start
Section titled “Quick start”Create evals.json:
{ "skill_name": "csv-analyzer", "evals": [ { "id": 1, "prompt": "I have a CSV of monthly sales data in evals/files/sales.csv. Find the top 3 months by revenue.", "expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).", "files": ["evals/files/sales.csv"], "assertions": [ "Output identifies November as the highest revenue month", "Output includes exactly 3 months", "Revenue figures are included for each month" ] } ]}Run it:
agentv eval evals.json --target claudeThe --target flag selects the agent harness. The agent evaluates itself — skills load naturally via progressive disclosure.
Field mapping
Section titled “Field mapping”When AgentV loads evals.json, it promotes fields to its internal representation:
| evals.json | EVAL.yaml equivalent | Notes |
|---|---|---|
prompt | input | Wrapped as [{role: "user", content: prompt}] |
expected_output | expected_output + criteria | Used as reference answer and evaluation criteria |
assertions[] | assertions[] | Each string becomes {type: llm-grader, prompt: text} |
files[] | file_paths | Resolved relative to evals.json, copied into workspace |
skill_name | metadata.skill_name | Carried as metadata |
id (number) | id (string) | Converted via String(id) |
Files support
Section titled “Files support”The files[] field lists files that the agent needs during evaluation. Paths are relative to the evals.json location:
{ "evals": [ { "id": 1, "prompt": "Analyze the sales data", "files": ["evals/files/sales.csv", "evals/files/config.json"] } ]}AgentV resolves these paths and copies the files into the workspace before the agent runs. If a file is missing, the test case fails with a file_copy_error.
Offline grading (no API keys)
Section titled “Offline grading (no API keys)”Grade existing agent sessions offline using agentv import to convert transcripts, then run deterministic evaluators:
# Import a Claude Code session transcriptagentv import claude --discover latest
# Run deterministic evaluators against the imported transcriptagentv eval evals.json --target copilot-logIf you’re using the agentv-bench skill bundle, validate your evals before running:
cd plugins/agentv-dev/skills/agentv-benchpython scripts/quick_validate.py --eval evals/evals.jsonThe rest of the bundle follows the same pattern:
scripts/run_eval.pyruns evals viaclaude -pscripts/run_loop.pyiterates eval rounds automaticallyscripts/aggregate_benchmark.pyandscripts/generate_report.pyread AgentV artifactsscripts/improve_description.pyproposes description experiments from observed failures
Benchmark output
Section titled “Benchmark output”Generate an Agent Skills compatible benchmark.json alongside the standard result JSONL:
agentv eval evals.json --target claude --benchmark-json benchmark.jsonThe benchmark uses AgentV’s pass threshold (score >= 0.8) to map continuous scores to the binary pass/fail that Agent Skills pass_rate expects:
{ "run_summary": { "with_skill": { "pass_rate": {"mean": 0.83, "stddev": 0.06}, "time_seconds": {"mean": 45.0, "stddev": 12.0}, "tokens": {"mean": 3800, "stddev": 400} } }}Converting to EVAL.yaml
Section titled “Converting to EVAL.yaml”When you’re ready to graduate, convert your evals.json to EVAL.yaml:
# Output to stdoutagentv convert evals.json
# Write to fileagentv convert evals.json -o eval.yamlThe generated YAML includes comments about available AgentV features you can use:
# Converted from Agent Skills evals.json# AgentV features you can add:# - type: is_json, contains, regex for deterministic evaluators# - type: code-grader for custom scoring scripts# - Multi-turn conversations via input message arrays# - Composite evaluators with weighted scoring# - Workspace isolation with repos and hooks
tests: - id: "1" criteria: |- The top 3 months by revenue are November, September, and December. input: - role: user content: "Find the top 3 months by revenue." # Promoted from evals.json assertions[] # Replace with type: is_json, contains, or regex for deterministic checks assertions: - name: assertion-1 type: llm-grader prompt: "Output identifies November as the highest revenue month"Inside the agentv-bench bundle, use agentv convert directly:
agentv convert evals/evals.json --out EVAL.yamlWhen to stay with evals.json
Section titled “When to stay with evals.json”Use evals.json when:
- You’re building a skill and want quick feedback loops
- Your assertions are natural-language (“output includes a chart”, “response is polite”)
- You want compatibility with other Agent Skills tooling
- Tests don’t need workspace isolation or deterministic checks
When to graduate to EVAL.yaml
Section titled “When to graduate to EVAL.yaml”Switch to EVAL.yaml when you need:
- Deterministic evaluators:
contains,regex,equals,is-json— faster and cheaper than LLM graders - Composite scoring: Weighted evaluators with custom aggregation
- Multi-turn conversations: Multi-message input sequences
- Workspace isolation: Sandboxed file systems per test case
- Tool trajectory evaluation: Assert on the sequence of tool calls
- Matrix evaluation: Test across multiple targets simultaneously
Side-by-side comparison
Section titled “Side-by-side comparison”The same eval expressed in both formats:
evals.json
Section titled “evals.json”{ "skill_name": "support-agent", "evals": [ { "id": 1, "prompt": "A customer says their order #12345 hasn't arrived after 2 weeks. Help them.", "expected_output": "An empathetic response that offers to track the order and provides next steps.", "assertions": [ "Response acknowledges the customer's frustration", "Response offers to look up order #12345", "Response provides clear next steps" ] } ]}EVAL.yaml equivalent
Section titled “EVAL.yaml equivalent”tests: - id: "1" input: | A customer says their order #12345 hasn't arrived after 2 weeks. Help them. expected_output: | An empathetic response that offers to track the order and provides next steps. assertions: - name: acknowledges-frustration type: llm-grader prompt: Response acknowledges the customer's frustration - name: looks-up-order type: contains value: "12345" - name: has-next-steps type: llm-grader prompt: Response provides clear next stepsNotice how the EVAL.yaml version can mix llm-grader (for subjective checks) with contains (for deterministic checks) — the order number check is now instant and free.