Rubrics
Rubrics define structured evaluation criteria directly in your eval cases. They support binary checklist grading and score-range analytic grading.
Basic Usage
Section titled “Basic Usage”The simplest form — each string becomes a required criterion:
evalcases: - id: quicksort-explain expected_outcome: Explain how quicksort works input_messages: - role: user content: Explain quicksort algorithm rubrics: - Mentions divide-and-conquer approach - Explains partition step - States time complexityChecklist Mode
Section titled “Checklist Mode”For fine-grained control, use rubric objects with weights and requirements:
rubrics: - id: core-concept expected_outcome: Explains divide-and-conquer weight: 2.0 required: true - id: partition expected_outcome: Describes partition step weight: 1.5 - id: complexity expected_outcome: States O(n log n) average time weight: 1.0Rubric Object Fields
Section titled “Rubric Object Fields”| Field | Default | Description |
|---|---|---|
id | Auto-generated | Unique identifier for the criterion |
expected_outcome | — | Description of what to check |
weight | 1.0 | Relative importance for scoring |
required | false | If true, failing this criterion fails the entire eval |
required_min_score | — | Minimum score threshold (score-range mode) |
score_ranges | — | Score range definitions (analytic mode) |
Score-Range Mode (Analytic)
Section titled “Score-Range Mode (Analytic)”For quality gradients instead of binary pass/fail, use score ranges:
rubrics: - id: accuracy expected_outcome: Provides correct answer weight: 2.0 score_ranges: 0: Completely wrong 3: Partially correct with major errors 5: Mostly correct with minor issues 7: Correct with minor omissions 10: Perfectly accurate and completeEach criterion is scored 0–10 by the LLM judge with granular feedback.
Scoring
Section titled “Scoring”Checklist Mode
Section titled “Checklist Mode”score = sum(satisfied_weights) / sum(total_weights)Score-Range Mode
Section titled “Score-Range Mode”score = sum(criterion_score / 10 * weight) / sum(total_weights)Verdicts
Section titled “Verdicts”| Verdict | Score |
|---|---|
pass | ≥ 0.8 |
borderline | ≥ 0.6 |
fail | < 0.6 |
Auto-Generate Rubrics
Section titled “Auto-Generate Rubrics”Generate rubrics from expected outcomes:
agentv generate rubrics evals/my-eval.yamlThis analyzes each eval case’s expected_outcome and creates structured rubric criteria.
Combining with Other Evaluators
Section titled “Combining with Other Evaluators”Rubrics work alongside code and LLM judges:
evalcases: - id: code-quality expected_outcome: Generates correct, clean Python code input_messages: - role: user content: Write a fibonacci function rubrics: - Returns correct values for n=0,1,2,10 - Uses meaningful variable names - Includes docstring execution: evaluators: - name: syntax_check type: code_judge script: ./validators/check_python.py