Your AI Evals Are Measuring the Wrong Thing
Somewhere between the research paper and the production deploy, something breaks. A team spends weeks tuning a summarization feature, hits 87% accuracy on their holdout set, ships it — and watches 60% of users immediately edit or delete the AI-generated output. The model was "good." The feature wasn't.
This pattern is common enough to have a name in ML circles: the eval gap. The metrics used to validate AI systems in research settings are genuinely useful for comparing models, but they're a poor proxy for the thing PMs actually care about: does this feature work well enough, reliably enough, that real users will trust it and change their behavior because of it?
If you're shipping your first AI feature — or trying to figure out why the one you shipped isn't getting used — the problem is probably not your model. It's your measurement framework.
Why Research Metrics Don't Predict Product Success
Accuracy, F1, BLEU, ROUGE — these metrics were designed to solve a specific problem: comparing models against each other on standardized benchmarks. They're excellent at that job. What they're not designed to do is tell you whether a user will accept an AI suggestion in a real workflow, under real conditions, with real ambiguity in the input.
Three structural reasons explain the gap:
1. Benchmark data is cleaner than production data. Test sets are typically curated, labeled by humans with clear instructions, and drawn from a distribution the model has implicitly been optimized against. Production inputs are messier — typos, ambiguous context, edge cases that never appeared in training. A model that scores 91% on a clean benchmark might degrade to 70% reliability on the long tail of real user queries. That 21-point drop is invisible until you ship.
2. Metrics measure outputs, not outcomes. BLEU score tells you how similar a generated translation is to a reference translation. It says nothing about whether the translation is actually useful to the person reading it, or whether they'd trust it enough to send to a client. Research from Google on neural machine translation found that BLEU improvements didn't reliably correlate with human preference ratings — a gap that has only widened as LLMs produce fluent but factually wrong text.
3. They assume a single "correct" answer. F1 score makes sense when there's a ground truth label. But most AI features in B2B SaaS involve tasks where multiple outputs are acceptable — drafting an email, summarizing a support ticket, suggesting a next step. The question isn't "is this output correct?" It's "is this output good enough that the user will use it rather than ignore it?"
The enterprise AI failure rate makes this concrete: according to Gartner, only about 20% of AI projects that reach pilot stage actually scale to production. The gap between "technically works" and "ships successfully" is where most AI features go to die — and traditional evals don't help you see it coming.
The Three Eval Layers PMs Actually Need
Instead of a single metric, think in three layers. Each answers a different question, and you need signal from all three before you can make a confident ship decision.
Layer 1: Technical Evals — Does It Hallucinate Under Real Conditions?
This is the closest layer to traditional ML metrics, but with a production-relevant twist. You're not asking "what's the accuracy on a holdout set?" You're asking: what does this feature do when it encounters inputs it wasn't designed for?
The specific technical evals that matter for shipped AI features:
- Hallucination rate on real user queries: Sample 50–100 actual inputs from your user base (or expected user base) and manually review outputs for factual errors, invented citations, or made-up data. This is a manual process, and it should be.
- Refusal and failure rate: How often does the model decline to answer, return an error, or produce a clearly nonsensical output? Even 5% failure rate in a high-frequency workflow is a serious UX problem.
- Latency distribution, not just average: P95 latency matters more than mean latency. If 1 in 20 requests takes 12 seconds, users will notice — even if your average is 2 seconds.
- Behavior on adversarial or off-label inputs: What happens when a user asks something outside the intended scope? Does it fail gracefully or confidently return garbage?
How to sample real user queries before you have real users
If you're pre-launch, you can approximate real user inputs through a few methods:
- Guerrilla testing: Give 10–15 people in your target persona 15 minutes to use the feature and log every query they submit. You'll surface edge cases faster than you'd expect.
- Analogous product mining: Look at public forums, Reddit threads, or support tickets for competing products to understand what inputs users actually generate.
- Synthetic adversarial sets: Ask someone on your team to spend an hour deliberately trying to break the feature — edge cases, ambiguous inputs, out-of-scope requests. This surfaces failure modes that clean test sets miss entirely.
Layer 2: User Evals — Do People Actually Trust It?
This is the layer most PMs skip, because it requires qualitative work that feels less rigorous than a number. It's actually the most predictive layer for whether a feature gets adopted.
The core question is: when the AI produces an output, what does the user do next?
Three behavioral signals to instrument from day one:
- Acceptance rate: What percentage of AI suggestions are used without modification? This is the single most useful leading indicator of feature trust. Cursor's acceptance rate metric for code completions is a good example — teams track it weekly and use drops as an early warning signal.
- Edit distance on accepted suggestions: Even accepted suggestions tell you something. If users are accepting but heavily rewriting, the feature is saving some effort but not as much as it could. This helps you understand whether you have a trust problem or a quality problem.
- Regeneration and rejection rate: How often do users click "try again" or explicitly dismiss the output? A high rejection rate on a feature with good benchmark scores is a clear signal that your test distribution doesn't match your production distribution.
Layer 3: Business Evals — Does It Change What You Care About?
This layer connects AI feature performance to the outcomes your company actually tracks. It's easy to skip because the feedback loop is slow — you might not see business impact for 30–60 days after a feature ships. But skipping it means you can't make a credible case for continued investment, and you can't catch cases where a feature is being used but isn't actually helping.
The relevant questions here depend on your product, but common ones for B2B SaaS:
- Does the AI feature reduce time-to-complete on the core workflow it's embedded in? (Measure task completion time before and after for the same user cohort.)
- Does it correlate with retention? Users who engage with the AI feature — do they churn at lower rates than those who don't?
- Does it reduce support volume for the workflows it touches?
What business evals give you
- Defensible ROI data for continued investment
- Early signal if a feature is used but not valuable
- Alignment between AI team work and company-level goals
What they can't tell you
- Why a metric moved (correlation, not causation)
- Whether the model itself needs improvement vs. the UX
- Fast feedback — cycles are 30–90 days minimum
Building a Lightweight Eval Suite Without a Data Science Team
The good news: you don't need a dedicated ML evaluation team to run a useful eval suite. You need a process and about 4–6 hours a week.
Here's a minimal viable eval workflow that doesn't require specialized tooling:
Weekly spot check (2 hours) Sample 25–30 real user inputs from the past week (pull from logs, or use a simple logging layer if you don't have one yet). Manually review the AI outputs against a simple rubric: accurate, acceptable, needs editing, wrong, harmful. Tally the distribution. Track it week over week in a spreadsheet. The trend matters more than any single week's numbers.
Monthly behavioral review (2 hours) Pull acceptance rate, edit rate, and rejection rate from your instrumentation. Compare to the prior month. If acceptance rate drops more than 5 percentage points, treat it as a regression signal and dig into which input types are causing it.
Rubric design (one-time, 1–2 hours) Before you start spot-checking, align with your team on what "good" looks like. Write down 5–7 criteria specific to your feature — not generic quality criteria, but the things that matter for your use case. For a contract summarization feature, that might be: correctly identifies parties, correctly identifies key dates, doesn't invent clauses that aren't present, summary is under 150 words. Specific rubrics produce consistent reviews; vague ones don't.
When to Ship Without Perfect Evals
The honest answer is that you will almost always ship before your evals are complete, because perfect evals don't exist and waiting for them is a different kind of failure. The question is: what minimum signal do you need before each stage of rollout?
A reasonable threshold structure:
Before limited beta (5–10% of users or a waitlist) You need: technical evals on at least 50 real-input samples showing hallucination rate below your threshold (define this in advance — for high-stakes domains, that might be <2%; for low-stakes, <10% might be acceptable), and at least one round of manual review showing no harmful or embarrassing outputs. You do not need business impact data at this stage.
Before broad rollout (50–100% of users) You need: 2–4 weeks of behavioral data from beta showing acceptance rate above a baseline you set in advance, and no sustained regression in your weekly spot checks. You also need a rollback plan — a feature flag that lets you dial back to 0% in under an hour if something goes wrong.
Before removing the safety net (e.g., removing "AI-generated" disclaimers or human review) You need: business eval data showing the feature is driving the outcome it was designed for, and acceptance rate that's been stable for at least 30 days.
The eval gap between research metrics and production success isn't a technical problem — it's a prioritization problem. Teams measure what's easy to measure (benchmark accuracy) rather than what's predictive (user acceptance, hallucination rate on real inputs, behavioral change). The shift required isn't adding more rigor; it's redirecting the rigor you already have toward the questions that actually determine whether your feature ships successfully and stays shipped.
The PM who can walk into a ship/no-ship meeting with acceptance rate trends, a hallucination rate from real user inputs, and a clear rubric for what "good enough" looks like will make better decisions than one armed with an F1 score — and will have a much easier time explaining those decisions to the rest of the team.