Code Judges
Code judges are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.
Contract
Section titled “Contract”Code judges communicate via stdin/stdout JSON:
Input (stdin):
{ "question": "What is 15 + 27?", "expected_outcome": "Correctly calculates 15 + 27 = 42", "candidate_answer": "The answer is 42.", "reference_answer": "42", "sidecar": {}}Output (stdout):
{ "score": 1.0, "hits": ["Answer contains correct value (42)"], "misses": [], "reasoning": "Passed 1 check(s)"}| Output Field | Type | Description |
|---|---|---|
score | number | 0.0 to 1.0 |
hits | string[] | Criteria that passed |
misses | string[] | Criteria that failed |
reasoning | string | Explanation of the score |
Python Example
Section titled “Python Example”import json, sysdata = json.load(sys.stdin)candidate_answer = data.get("candidate_answer", "")
hits = []misses = []
if "42" in candidate_answer: hits.append("Answer contains correct value (42)")else: misses.append("Answer does not contain expected value (42)")
score = 1.0 if hits else 0.0
print(json.dumps({ "score": score, "hits": hits, "misses": misses, "reasoning": f"Passed {len(hits)} check(s)"}))TypeScript Example
Section titled “TypeScript Example”import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));const candidateAnswer: string = data.candidate_answer ?? "";
const hits: string[] = [];const misses: string[] = [];
if (candidateAnswer.includes("42")) { hits.push("Answer contains correct value (42)");} else { misses.push("Answer does not contain expected value (42)");}
console.log(JSON.stringify({ score: hits.length > 0 ? 1.0 : 0.0, hits, misses, reasoning: `Passed ${hits.length} check(s)`,}));Referencing in Eval Files
Section titled “Referencing in Eval Files”execution: evaluators: - name: my_validator type: code_judge script: ./validators/check_answer.pyTesting Locally
Section titled “Testing Locally”Test a code judge by piping JSON to stdin:
echo '{"question":"What is 2+2?","expected_outcome":"4","candidate_answer":"4","reference_answer":"4","sidecar":{}}' | python validators/check_answer.py