a builder's codex
codex · operators · Hamel Husain · ins_benevolent-dictator-not-committee

Appoint one trusted-taste expert as the eval benevolent dictator — committees stall the loop

By Hamel Husain · Independent ML consultant and Berkeley PhD researcher · 2026-04-28 · podcast · Evals as error analysis, the benevolent dictator, LLM judges

Tier B · TL;DR
Appoint one trusted-taste expert as the eval benevolent dictator — committees stall the loop

Claim

For LLM eval work, appoint a single person whose taste the team trusts as the benevolent dictator on what counts as a failure. A committee bogged down debating the rubric never ships an eval; one trusted taste arbiter ships and the rubric tightens through use.

Mechanism

Open coding is judgment work. A committee needs consensus on the definition of "wrong" before any traces get coded, and that conversation rarely converges before the team loses momentum. A single arbiter encodes a coherent taste and writes it down. The rubric then evolves through actual reviews, not through abstract debate. Domain experts — often the product manager — make the best arbiters because they hold both the user perspective and the product reality.

Conditions

Holds when:

Fails when:

Evidence

"When you're doing this open coding, a lot of teams get bogged down in having a committee... You can appoint one person whose taste that you trust."

Hamel and Shreya prefer the product manager for the role because PMs hold both user and product context.

— Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28

Signals

Counter-evidence

For high-stakes regulated domains, single-person arbitration may not be acceptable. Multiple-arbiter calibration (with disagreement tracked and resolved) is the safer path there.

Cross-references

Open the interactive view → View original source → Markdown source →