Claim
Run trace review as a two-stage pipeline: a human samples 100+ traces and writes a free-form note on the first thing wrong with each (open coding); then an LLM groups those notes into failure-mode buckets (axial coding). An LLM cannot do the open-coding pass for you because it lacks product context; humans cannot scale the categorisation pass.
Mechanism
Open coding captures domain-specific failure that an LLM judge would miss because the LLM has no privileged access to product reality (e.g., "we don't actually offer virtual tours" — hallucination invisible without context). Axial coding is pure clustering, which LLMs do reliably. The split assigns each task to the actor that can do it, and the resulting pivot table converts qualitative review into quantitative priority.
Conditions
Holds when:
- A domain expert can read traces and recognise wrong outputs.
- The traces are recent enough that current product context applies.
- The team can tolerate the upfront human time for the first pass.
Fails when:
- Reviewers are not domain experts and their notes mislead the categorisation.
- The traces span product changes such that "wrong" varies by date.
- The team cuts the human step to save time and the LLM hallucinates the categories.
Evidence
"When you're doing this open coding... appoint one person whose taste that you trust."
"I would bet money... if I put that into ChatGPT and asked, 'Is there an error?' it would say, 'No, did a great job.'"
Stopping rule: theoretical saturation, not a fixed count. Once 15–60 traces stop yielding new categories, you stop.
— Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28
Signals
- Failure-mode pivot table covers >80% of observed problems with a small set of clusters.
- Engineering work is prioritised by category counts, not by squeaky-wheel reports.
- The same pipeline runs weekly without each cycle starting from scratch.
Counter-evidence
Coding-agent teams (Claude Code, Codex) operate with much lighter eval discipline because the developer is also the user; the dogfood loop closes inside one head. That pattern does not generalise to products where the buyer is not the builder.
Cross-references
- Evals are systematic data analysis on your LLM application — start with error analysis, not tests — why this pipeline matters
- Appoint one trusted-taste expert as the eval benevolent dictator — committees stall the loop — who runs the human step
- Build LLM-as-judge as binary true/false, one judge per pesky failure mode — and validate against human labels — what the categories become