a builder's codex
codex · operators · Hamel Husain · ins_llm-as-judge-binary-not-likert

Build LLM-as-judge as binary true/false, one judge per pesky failure mode — and validate against human labels

By Hamel Husain · Independent ML consultant and Berkeley PhD researcher · 2026-04-28 · podcast · Evals as error analysis, the benevolent dictator, LLM judges

Tier A · TL;DR
Build LLM-as-judge as binary true/false, one judge per pesky failure mode — and validate against human labels

Claim

LLM-as-judge evals should output a binary pass/fail per pesky failure mode, not a Likert scale. Build 4–7 narrow judges total, not dozens, because most failures are fixed by prompt edits and never need a permanent eval. Always validate the judge against human-labelled data using a confusion matrix, not a single agreement number.

Mechanism

A 1-to-5 Likert is "a weasel way of not making a decision" — it produces averages that look reasonable while masking the cases where the judge is wrong. A binary judge forces a callable decision and a falsifiable accuracy number. A confusion matrix surfaces the off-diagonal cells where a judge that says "pass" 90% of the time can hide near-total failure on the long tail. Without the matrix, teams trust scaffolding that has no real signal.

Conditions

Holds when:

Fails when:

Evidence

"1-2-3-4-5 is a weasel way of not making a decision."

"When people lose trust in your evals, they lose trust in you."

Operating rule Hamel and Shreya teach: most products end with 4–7 LLM judges total. Always look at the off-diagonal cells in the confusion matrix; agreement % is misleading on long-tail errors.

— Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28

Signals

Counter-evidence

For ranking problems (which of these outputs is best?), pairwise comparison or graded scoring may be needed. The binary rule is conditional on detection-style judges, not preference-style ones.

Cross-references

Open the interactive view → View original source → Markdown source →