Skill Evals (evals.json)

Overview

Agent Skills is an open standard for describing AI agent capabilities. Its evals.json format defines simple test cases for skills — a prompt, expected output, and natural-language assertions.

AgentV natively supports evals.json. You can run Agent Skills evals directly:

agentv eval evals.json --target claude

When you need AgentV’s power features (deterministic evaluators, composite scoring, multi-turn conversations, workspace isolation), you can graduate to EVAL.yaml.

Quick start

Create evals.json:

{
  "skill_name": "csv-analyzer",
  "evals": [
    {
      "id": 1,
      "prompt": "I have a CSV of monthly sales data in evals/files/sales.csv. Find the top 3 months by revenue.",
      "expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).",
      "files": ["evals/files/sales.csv"],
      "assertions": [
        "Output identifies November as the highest revenue month",
        "Output includes exactly 3 months",
        "Revenue figures are included for each month"
      ]
    }
  ]
}

Run it:

agentv eval evals.json --target claude

The --target flag selects the agent harness. The agent evaluates itself — skills load naturally via progressive disclosure.

Field mapping

When AgentV loads evals.json, it promotes fields to its internal representation:

evals.json	EVAL.yaml equivalent	Notes
`prompt`	`input`	Wrapped as `[{role: "user", content: prompt}]`
`expected_output`	`expected_output` + `criteria`	Used as reference answer and evaluation criteria
`assertions[]`	`assertions[]`	Each string becomes `{type: llm-grader, prompt: text}`
`files[]`	`file_paths`	Resolved relative to evals.json, copied into workspace
`skill_name`	`metadata.skill_name`	Carried as metadata
`id` (number)	`id` (string)	Converted via `String(id)`

Files support

The files[] field lists files that the agent needs during evaluation. Paths are relative to the evals.json location:

{
  "evals": [
    {
      "id": 1,
      "prompt": "Analyze the sales data",
      "files": ["evals/files/sales.csv", "evals/files/config.json"]
    }
  ]
}

AgentV resolves these paths and copies the files into the workspace before the agent runs. If a file is missing, the test case fails with a file_copy_error.

Offline grading (no API keys)

Grade existing agent sessions offline using agentv import to convert transcripts, then run deterministic evaluators:

# Import a Claude Code session transcript
agentv import claude --discover latest

# Run deterministic evaluators against the imported transcript
agentv eval evals.json --target copilot-log

If you’re using the agentv-bench skill bundle, validate your evals before running:

cd plugins/agentv-dev/skills/agentv-bench
python scripts/quick_validate.py --eval evals/evals.json

The rest of the bundle follows the same pattern:

scripts/run_eval.py runs evals via claude -p
scripts/run_loop.py iterates eval rounds automatically
scripts/aggregate_benchmark.py and scripts/generate_report.py read AgentV artifacts
scripts/improve_description.py proposes description experiments from observed failures

Benchmark output

Generate an Agent Skills compatible benchmark.json alongside the standard result JSONL:

agentv eval evals.json --target claude --benchmark-json benchmark.json

The benchmark uses AgentV’s pass threshold (score >= 0.8) to map continuous scores to the binary pass/fail that Agent Skills pass_rate expects:

{
  "run_summary": {
    "with_skill": {
      "pass_rate": {"mean": 0.83, "stddev": 0.06},
      "time_seconds": {"mean": 45.0, "stddev": 12.0},
      "tokens": {"mean": 3800, "stddev": 400}
    }
  }
}

Converting to EVAL.yaml

When you’re ready to graduate, convert your evals.json to EVAL.yaml:

# Output to stdout
agentv convert evals.json

# Write to file
agentv convert evals.json -o eval.yaml

The generated YAML includes comments about available AgentV features you can use:

# Converted from Agent Skills evals.json
# AgentV features you can add:
#   - type: is_json, contains, regex for deterministic evaluators
#   - type: code-grader for custom scoring scripts
#   - Multi-turn conversations via input message arrays
#   - Composite evaluators with weighted scoring
#   - Workspace isolation with repos and hooks

tests:
  - id: "1"
    criteria: |-
      The top 3 months by revenue are November, September, and December.
    input:
      - role: user
        content: "Find the top 3 months by revenue."
    # Promoted from evals.json assertions[]
    # Replace with type: is_json, contains, or regex for deterministic checks
    assertions:
      - name: assertion-1
        type: llm-grader
        prompt: "Output identifies November as the highest revenue month"

Inside the agentv-bench bundle, use agentv convert directly:

agentv convert evals/evals.json --out EVAL.yaml

When to stay with evals.json

Use evals.json when:

You’re building a skill and want quick feedback loops
Your assertions are natural-language (“output includes a chart”, “response is polite”)
You want compatibility with other Agent Skills tooling
Tests don’t need workspace isolation or deterministic checks

When to graduate to EVAL.yaml

Switch to EVAL.yaml when you need:

Deterministic evaluators: contains, regex, equals, is-json — faster and cheaper than LLM graders
Composite scoring: Weighted evaluators with custom aggregation
Multi-turn conversations: Multi-message input sequences
Workspace isolation: Sandboxed file systems per test case
Tool trajectory evaluation: Assert on the sequence of tool calls
Matrix evaluation: Test across multiple targets simultaneously

Side-by-side comparison

The same eval expressed in both formats:

evals.json

{
  "skill_name": "support-agent",
  "evals": [
    {
      "id": 1,
      "prompt": "A customer says their order #12345 hasn't arrived after 2 weeks. Help them.",
      "expected_output": "An empathetic response that offers to track the order and provides next steps.",
      "assertions": [
        "Response acknowledges the customer's frustration",
        "Response offers to look up order #12345",
        "Response provides clear next steps"
      ]
    }
  ]
}

EVAL.yaml equivalent

tests:
  - id: "1"
    input: |
      A customer says their order #12345 hasn't arrived after 2 weeks. Help them.
    expected_output: |
      An empathetic response that offers to track the order and provides next steps.
    assertions:
      - name: acknowledges-frustration
        type: llm-grader
        prompt: Response acknowledges the customer's frustration
      - name: looks-up-order
        type: contains
        value: "12345"
      - name: has-next-steps
        type: llm-grader
        prompt: Response provides clear next steps

Notice how the EVAL.yaml version can mix llm-grader (for subjective checks) with contains (for deterministic checks) — the order number check is now instant and free.