Skip to content

Rubrics

Rubrics define structured evaluation criteria directly in your eval cases. They support binary checklist grading and score-range analytic grading.

The simplest form — each string becomes a required criterion:

evalcases:
- id: quicksort-explain
expected_outcome: Explain how quicksort works
input_messages:
- role: user
content: Explain quicksort algorithm
rubrics:
- Mentions divide-and-conquer approach
- Explains partition step
- States time complexity

For fine-grained control, use rubric objects with weights and requirements:

rubrics:
- id: core-concept
expected_outcome: Explains divide-and-conquer
weight: 2.0
required: true
- id: partition
expected_outcome: Describes partition step
weight: 1.5
- id: complexity
expected_outcome: States O(n log n) average time
weight: 1.0
FieldDefaultDescription
idAuto-generatedUnique identifier for the criterion
expected_outcomeDescription of what to check
weight1.0Relative importance for scoring
requiredfalseIf true, failing this criterion fails the entire eval
required_min_scoreMinimum score threshold (score-range mode)
score_rangesScore range definitions (analytic mode)

For quality gradients instead of binary pass/fail, use score ranges:

rubrics:
- id: accuracy
expected_outcome: Provides correct answer
weight: 2.0
score_ranges:
0: Completely wrong
3: Partially correct with major errors
5: Mostly correct with minor issues
7: Correct with minor omissions
10: Perfectly accurate and complete

Each criterion is scored 0–10 by the LLM judge with granular feedback.

score = sum(satisfied_weights) / sum(total_weights)
score = sum(criterion_score / 10 * weight) / sum(total_weights)
VerdictScore
pass≥ 0.8
borderline≥ 0.6
fail< 0.6

Generate rubrics from expected outcomes:

Terminal window
agentv generate rubrics evals/my-eval.yaml

This analyzes each eval case’s expected_outcome and creates structured rubric criteria.

Rubrics work alongside code and LLM judges:

evalcases:
- id: code-quality
expected_outcome: Generates correct, clean Python code
input_messages:
- role: user
content: Write a fibonacci function
rubrics:
- Returns correct values for n=0,1,2,10
- Uses meaningful variable names
- Includes docstring
execution:
evaluators:
- name: syntax_check
type: code_judge
script: ./validators/check_python.py