Skip to content

Custom Evaluators

AgentV supports multiple evaluator types that can be combined for comprehensive evaluation.

TypeDescriptionUse Case
code_judgeDeterministic script (Python/TS/any)Exact matching, format validation, programmatic checks
llm_judgeLLM-based evaluation with custom promptSemantic evaluation, nuance, subjective quality
RubricsStructured criteria in eval caseMulti-criterion grading with weights

Evaluators are configured at the execution level — either top-level (applies to all cases) or per-case:

description: My evaluation
execution:
evaluators:
- name: correctness
type: llm_judge
prompt: ./judges/correctness.md
evalcases:
- id: test-1
# Uses the top-level evaluator
...
evalcases:
- id: test-1
expected_outcome: Returns valid JSON
input_messages:
- role: user
content: Generate a JSON config
execution:
evaluators:
- name: json_check
type: code_judge
script: ./validators/check_json.py

Use multiple evaluators on the same case for comprehensive scoring:

evalcases:
- id: code-generation
expected_outcome: Generates correct Python code
input_messages:
- role: user
content: Write a sorting function
rubrics:
- Code is syntactically valid
- Handles edge cases (empty list, single element)
- Uses appropriate algorithm
execution:
evaluators:
- name: syntax_check
type: code_judge
script: ./validators/check_syntax.py
- name: quality_review
type: llm_judge
prompt: ./judges/code_quality.md

Each evaluator produces its own score. Results appear in evaluator_results[] in the output JSONL.

  • Use code judges for deterministic checks — exact value matching, format validation, schema compliance
  • Use LLM judges for semantic evaluation — meaning, quality, helpfulness
  • Use rubrics for structured multi-criteria grading — when you need weighted, itemized scoring
  • Combine evaluator types for comprehensive coverage
  • Test code judges locally before running full evaluations