a builder's codex
codex · patterns · Evals are data analysis — single judge, binary rubrics, error analysis first

Evals are data analysis — single judge, binary rubrics, error analysis first

Convergence

Three operators (Hamel/Shreya as a unit, plus Cat Wu and Reganti at the edges) converge on the same eval-shop pattern: evals are systematic data analysis on traces, not test suites. One trusted-taste judge. Binary rubrics, one judge per failure mode. Error analysis before metric-building. Open-coding on traces before automation.

Operators

Variation

Implication

Stand up an eval shop with one trusted-taste owner. Sample 100+ traces. Open-code first, axial-code second, automate third. Build LLM-as-judge as binary T/F per pesky failure mode. Validate every judge against human labels. Production threshold for automation is 100%, not 95%; everything below is a human-in-loop decision.

Sources

Open the interactive view →