Skip to content

LLM Judges

LLM judges use a language model to evaluate agent responses against custom criteria defined in a prompt file.

Reference an LLM judge in your eval file:

execution:
evaluators:
- name: semantic_check
type: llm_judge
prompt: ./judges/correctness.md

The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.

Write evaluation instructions as markdown. Template variables are interpolated:

# Evaluation Criteria
Evaluate the candidate's response to the following question:
**Question:** {{question}}
**Expected Outcome:** {{expected_outcome}}
**Reference Answer:** {{reference_answer}}
**Candidate Answer:** {{candidate_answer}}
## Scoring
Score the response from 0.0 to 1.0 based on:
1. Correctness — does the answer match the expected outcome?
2. Completeness — does it address all parts of the question?
3. Clarity — is the response clear and well-structured?
VariableSource
questionFirst user message content
expected_outcomeEval case expected_outcome field
reference_answerLast expected message content
candidate_answerLast candidate response content
sidecarEval case sidecar metadata
rubricsEval case rubrics (if defined)

For dynamic prompt generation:

judges/correctness.ts
export default function({ question, expected_outcome, candidate_answer }) {
return `Evaluate whether this response correctly answers the question.
Question: ${question}
Expected: ${expected_outcome}
Response: ${candidate_answer}
Score 1.0 if correct, 0.0 if incorrect.`;
}
  1. AgentV renders the prompt template with variables from the eval case
  2. The rendered prompt is sent to the judge target (configured in targets.yaml)
  3. The LLM returns a structured evaluation with score, hits, misses, and reasoning
  4. Results are recorded in the output JSONL