AI Demos Lie. You Need AI R&D.

You can’t know in advance how much AI can do for your feature, or how reliably. So try before you develop — and make the try an eval, not a demo.

TL;DR

You don’t know what AI can do until you measure it. A model’s capability on your task — correctness, reliability, the cost of making it good enough — can’t be known in advance. Correctness is a distribution — one run samples it once. And the cost that matters isn’t the per-call price on the provider’s page; it’s price × volume × the technique you’ll turn out to need — a demo shows only the first factor.
So try before you develop. The trying is the R&D. A phase before the build that settles the decisions that matter — feasible or not, which model, which technique, how much autonomy — with evidence, inside a budget set for finding out.
A demo is a try that lies. It proves the feature can work once; production runs it millions of times. The only try that counts as evidence is an eval — a repeatable check, against real cases, gated on a number fixed up front.
Let the eval drive. Spend complexity, money, and risk only when the evidence forces you to.

1. What “AI R&D” means

AI R&D is finding out, with evidence, whether — and how — a model can do your feature’s job reliably enough, within budget, before you commit to building the feature. It exists because of one fact about foundation models: what one can do for your specific task can’t be known in advance — not from a benchmark, not from the marketing page, not from a demo. You find out by measuring, or you find out in production.

Two clarifications, because the phrase gets read the wrong way in two directions.

First, this is not R&D of AI — the work frontier labs do to invent models. You train nothing; you adapt a model that already exists. The context is AI engineering, not machine learning, and the discipline is the same whether you call GPT-4o-mini through an API or run an open model on your own hardware.

Second, R&D is a phase, not a permanent state — but it’s the decisions that end, not the measurement:

R&D         → build and measure the capability: the model, prompt, technique
              that must clear the bar
Development → wrap the settled capability in a product: integration,
              error handling, fallbacks, rollout
The eval    → survives both: offline as the regression gate for any later
              change, online as production monitoring (Section 4)

When the measurement catches the world moving — a model deprecation, a new failure pattern in traffic — the affected decision reopens and the loop runs again, on just that decision. Decisions before the build; measurement forever.

Skipping this phase is the default: build what works in a demo, ship it, and find out in production that “works” was never defined. So start with why the default try — the demo — doesn’t count.

2. The demo is a try that lies

The industry base rate is brutal — MIT’s 2025 State of AI in Business found 95% of enterprise generative AI pilots deliver no measurable business impact; Gartner predicts more than 40% of agentic AI projects will be cancelled by the end of 2027. The reported causes are varied — integration, data readiness, ROI, org change — so read those as a base rate, not a verdict. This post is about one culprit that is inside your control, and it shows up as the same pattern every time: the demo passes. The product fails.

2.1 Why the demo passes and the product doesn’t

Start with the property everyone names first: AI features are non-deterministic. The same input can produce a different output. A traditional feature, given the same input twice, answers the same twice — that’s what makes “it works” in a demo a meaningful claim. An AI feature makes no such promise. And non-determinism is only the most visible of several properties that each break a demo differently.

A demo tests one run. Production runs the feature millions of times, against inputs nobody hand-picked, on a schedule nobody controls.

Demo       → can it work once?
Production → how often does it work — and is that often enough?

Almost nothing about passing the first question tells you anything about the second.

That is the precise sense in which a demo lies. It’s honest about what happened — the feature did work, once. It lies as evidence: about what one run means for the next million.

2.2 Five failures a demo never catches

A demo is built, by construction, to avoid the cases that break things. What slips through:

Fail on repeat. Run the exact same input again. It breaks the second time, or the tenth, or the ten-thousandth. Nothing about run #1 predicted this.
Fail on model upgrade. Your provider silently swaps the model behind the API, or deprecates the one you tuned against. Nothing in your code changed; everything in your output did.
Fail on cost. It works. It’s also $4 a call when your unit economics need it to be four cents. A demo never has a P&L attached.
Fail on data growth. The demo ran on ten curated examples. Production sees a million, and the long tail of weird, real inputs you never imagined.
Fail silently. The most dangerous one. No error is thrown. The model gives a confident, fluent, completely wrong answer, and nothing in the system flags it. A demo’s happy-path input was never going to trigger it.

Notice these don’t share a single root:

#1 is non-determinism.
#2 is model drift you don’t control.
#3 is unit economics a demo never carries.
#4 is an open input space no demo samples.
#5 is the absence of any signal that a fluent answer is wrong.

“I tried it and it worked” can’t see any of them — and because the causes differ, so do the fixes.

A try that counts has to see all five. Building it takes two steps: fix the target before you start (Section 3), then run the loop that does the deciding (Section 4).

3. Before the try: set the target, not the solution

Every AI feature starts as a requirement with a gap underneath it — the distance between what the model does out of the box and what the feature needs. Closing the gap is a short list of decisions: which model, which technique, how much autonomy. The discipline of this section: don’t make those choices from preference. Fix the target, list the options — and don’t choose yet. The evidence does the choosing.

3.1 Gather context from stakeholders

Before anything technical:

What’s the real job this feature does?
Who is affected if it’s wrong?
What’s the thing that, if it breaks, breaks trust — not just a metric?

You can’t set a bar you haven’t heard the stakes for.

3.2 Identify the gap

List every place the model, used as-is, falls short. Two kinds of gap:

Structural — knowable from facts before you run anything: the model has never seen your private data; the context window can’t hold your history; the frontier model’s latency breaks your budget.
Behavioural — only shows up when you try: name where it will fail, specifically — not “it might hallucinate” but “it will misclassify intent X as intent Y when the user phrases it like this.”

Informal trying is exactly the right tool for the behavioural kind — poke at the model, watch it break, write the breakage down. Just know what you’re producing: hypotheses for the eval to test, not evidence that anything works. A demo’s sin isn’t existing; it’s being promoted from hypothesis to proof.

3.3 The options: technique, model, autonomy

There is never one way to close a gap. Three choices define every solution — technique, model, autonomy — and the first has a natural order. Not by price: an agent’s per-call bill can exceed a fine-tuned small model’s, and a basic agent can take less engineering than a production retrieval pipeline. What orders them is commitment — how much you must build before the eval can judge the technique, and how much you throw away if it says no. Prompt: edit a string. RAG: rebuild an index. Agents: re-architect behaviour. Fine-tuning: recollect data and retrain.

prompt & context engineering → RAG → agents → fine-tuning
   least commitment ─────────────────────► most commitment

One more thing the order is not: a list of substitutes. Each technique closes a different kind of gap:

RAG → a knowledge gap: the model hasn’t seen your data.
Agents → a multi-step-action gap: the task needs tools and sequences.
Fine-tuning → a behaviour gap: style, format, an embedded skill.

And they compose — an agent can contain RAG as a tool. The gaps you named in 3.2 nominate the candidate techniques; the order says which candidate to try first when more than one could close the same gap.

The other two choices have no order. You set them at the start and can revisit them at any point:

Model — paid frontier models (easy to call, expensive per call) versus open, smaller, or encoder models (cheap per call, more work to make good enough).
Autonomy — how much you hand to the model unsupervised, versus keeping a human or a deterministic check in the loop.

The order is a default, not a law — evidence can reorder it for your case. What’s non-negotiable is the discipline: start with the first candidate technique in the order, the cheapest model, and as much supervision as the job allows. Escalate one choice at a time — the next technique, a bigger model, more autonomy — and only when the evidence proves what you have can’t clear the bar.

The failure mode to watch for is over-escalation — reaching for agents or fine-tuning because they’re more impressive, when a sharper prompt and a real eval would have done the job for a tenth of the cost. Teams burn budget not by under-building but by building more machine than the problem needed.

Escalating never makes complexity go away, either — it moves it somewhere harder to see. Prompt to RAG: the complexity leaves a prompt you can read and reappears in a retrieval pipeline you have to measure. RAG to fine-tuning: it moves into training data and the evaluation of that data. Not a reason never to escalate — the reason the eval matters more with each step. The more committed the technique, the more of the complexity lives somewhere you can’t eyeball.

3.4 Set the bar — where the business and the engineering meet

The single most important decision in the sequence. “Good enough” has to become a number — a pass rate, a precision/recall target, a tolerance for a specific error — not a feeling anyone has after watching a demo.

Setting the number is a risk-tolerance decision: how often can this feature be wrong before it costs a customer, a regulator’s attention, a headline? The tolerance belongs to the business — no engineer can tell you how much risk your brand can absorb. The metric that expresses it gets shaped with engineering, because a bar nobody can measure is a wish. The business owns the appetite; the number is written jointly; engineering builds against it.

Correctness isn’t the only dimension. Cost per call and latency are part of the bar — budgets, in numbers, set before the build. Right-but-$4-a-call misses the bar as surely as wrong; naming the budgets is what lets the eval reject a technique on price or speed, not just accuracy.

One more number: the budget for the try itself — how much time and money the business will spend on finding out. R&D without a stop condition isn’t R&D; it’s a hobby. This budget is what makes the loop in Section 4 finite, and its third outcome (Section 5) an honest answer instead of an admission of defeat.

Skip this step and you get the conversation every postmortem has: “well, it seemed to work.” Seemed, to whom, how many times, against what?

3.5 Commit to evidence before build

Two things must exist before the feature gets built, not after:

A golden dataset — real cases and hard cases, gathered from the people who’ll be affected, each paired with the answer you’d actually want.
A repeatable check that runs the feature against that dataset and produces the number from 3.4.

How much to invest scales with the stakes, and the bar does the scaling: a low-stakes internal tool with a human checking every output earns a small golden set and a cheap try; a customer-facing feature acting unsupervised does not. Every AI feature starts as R&D; not every R&D phase is the same size.

By the end of Section 3 you have the options (candidate techniques in order of commitment, a model choice, an autonomy setting), a bar (the number), a stop condition (the try budget) — and nothing chosen. The choosing is done by evidence. That’s Section 4.

4. Eval-driven development: how the try runs

The eval is not a one-time test before shipping. During R&D it’s the engine the phase runs on — what the research literature calls evaluation-driven development: a governing function, not a terminal checkpoint. It’s also what keeps spend under control. Without it you’re flying blind, in Chip Huyen’s phrase, and flying blind is expensive in both directions: shipping something broken, or over-building something that didn’t need it.

4.1 The loop

make the least committed move the options allow
→ run the eval against the golden dataset
→ read the score against the bar from 3.4

pass  → decision settled; freeze it, hand it to development —
        shipping happens there
fail  → smallest change the evidence points to: sharpen the prompt,
        change model or autonomy, or escalate one technique — then run again
spent → techniques or try budget exhausted: "not feasible at this bar" —
        the third outcome, Section 5

That’s the whole mechanism. The eval is the referee: a failed score is the only valid reason to spend more — nothing else is — and the try budget caps the total. The eval decides where the next dollar goes; the budget decides whether there is one.

4.2 What an eval is built from

Four parts: dataset → feature → scorer → score.

The dataset is the golden set from 3.5: real and hard cases, each with a known right answer.
The feature is the system under test — model, prompt, context, tools, whatever you’ve built so far — run against every item.
The scorer judges each output:

Scorer	Reach for it when	Cost
Rule-based	correctness is checkable by code — a regex, a schema, an exact value	cheapest, fastest; use whenever you can
LLM-as-judge	correctness is semantic but a clear rubric captures it	cheap — but validate it against human judgment before trusting it
Human	genuinely subjective, high-stakes, or validating the judge	slowest, most expensive, sometimes the only honest option

The score is the aggregate — the pass rate you compare against the bar.

4.3 Offline and online — both

Offline — run before shipping, against the golden dataset: the gate for “are we ready.” It outlives R&D as the regression gate: every change on your side — a prompt tweak, a dependency bump, a model version you chose to adopt — re-runs it. And it re-runs on a schedule too, because the one change that never announces itself is your provider swapping the model behind the API.
Online — run after shipping, against real traffic, because no golden set anticipates the real world. This is the eval as production monitoring — and where the golden set earns its next version: failures caught online go back into the dataset, so the next offline run is harder to pass than the last.

Online isn’t optional polish — it’s the fix for a trap built into offline eval. Goodhart’s Law: when a measure becomes a target, it stops being a good measure. Optimize hard enough against a fixed golden set and the offline score rises while the feature gets no better at the cases the set didn’t anticipate. The golden set tells you if you’re ready to ship. Only production traffic tells you if you were right.

One rule makes all of it mean something: gate on a pass rate measured at scale, never on one green run — and “at scale” means two different things. Repetition — the same input many times — catches non-determinism: the case that passes once but fails one time in fifty. Coverage — many different inputs — catches the open input space: the weird real cases your golden set never imagined. Ten thousand reruns of one case measure reliability; ten thousand distinct cases measure reach. Don’t mistake one for the other.

4.4 Does the method answer the problem?

Check it against Section 2’s five failures — each row names the distinguishing catch; the scorer underlies them all:

Failure a demo can’t see	Caught by
Fail on repeat	repetition — the same input rerun until the pass rate is a measurement, not luck
Fail on model upgrade	the regression gate — re-run on your changes and on a schedule; drift shows up as a score, not an incident
Fail on cost	the bar — cost per call is scored like correctness; right-but-too-expensive fails
Fail on data growth	coverage offline, online eval after — production’s weird cases feed the golden set
Fail silently	the scorer — every output judged against a known answer or validated rubric; fluent-but-wrong scores wrong

Five failures a demo can’t see; five named mechanisms. That is what “the try must be an eval” means in full.

5. Two tries, and a third outcome

The loop hands down three verdicts, all of them honest: yes to this attempt — settle it; no to this attempt — make the smallest change, and run again; no at this bar — stop. Only the first and the last end the loop; the middle one steers it. The first two below are real; the third gets named because nobody writes it up.

5.1 The try that said yes: intent recognition

The feature: recognise user intent from free text, to route a conversation correctly.

First attempt: a prompt written as clean conditional logic — if the message looks like this, it’s intent X; if like that, intent Y. It read like good engineering. The eval was a small set of critical cases — including the one that mattered most, an irrelevant input that must not trigger an intent switch — each rerun 10,000 times, scored by exact match. The conditional prompt passed about half the time.

The fix wasn’t a bigger model. Same model, GPT-4o-mini. The prompt was rewritten — not as conditions, but as a plain statement of the goal: describe what the model is trying to determine, and let it reason rather than pattern-match against rules. Same cases, same 10,000 reruns each: zero failures.

Be careful with what that number means. Ten thousand clean runs doesn’t prove the prompt never fails — it proves, with 95% confidence, that the failure rate on those cases is below roughly three in ten thousand (the statistician’s rule of three). That’s the honest form of the claim, and notice its shape: a distribution claim, the only kind an AI feature can make.

A demo that happened to pass the case two or three times would have looked exactly as shippable — and would have been running a coin flip in production. This is repetition doing its job: turning a lucky green into an honest 50%, or an honest bound. (Coverage — whether the prompt holds on inputs nobody hand-picked — is the separate axis, and online eval’s job.) And for a genuinely semantic task, a conceptual instruction beat a programmatic one, on the same model, by a wide margin.

5.2 The try that said no: the RAG rejection

The feature: long-term memory for an ongoing conversation — recall relevant context from earlier sessions.

The rule says escalate only when the technique in hand can’t clear the bar. Here prompt and context engineering failed on a structural gap — the kind 3.2 says you can establish from arithmetic: the context window simply couldn’t hold enough history. So the next candidate for a knowledge gap was tried: RAG, pulling past context by similarity search.

The eval found two disqualifying problems, both measured, neither guessed: no similarity threshold cleanly separated recall from precision — every threshold that caught enough relevant context also caught too much irrelevant context — and no latency budget was left for a reranker that might have fixed it.

The decision: de-escalate — replace RAG with deliberate, explicit context construction. Not a consolation prize — a decision made on the same evidence standard as 5.1, pointing the other way. This is where the R&D framing earns its keep: a well-supported “no” is a win. In research, a negative result that arrives before the spend is a success.

5.3 The third outcome: the options run out

Neither case above hit it, but the loop has one more exit, and the definition of AI R&D isn’t complete without it: the last candidate technique has been tried, or the try budget is spent, and the bar still isn’t cleared. The honest reading is not “the team failed.” It is “this feature is not feasible at this bar” — which leaves exactly two moves, both belonging to the business: renegotiate the bar, back in 3.4, with evidence about what relaxing it would buy; or don’t build the feature.

That answer is R&D’s most valuable product, offered at the lowest price it will ever be available for. The same discovery can always be made later, in production — paid for in incidents, in churn, and in the headline the bar existed to prevent.

6. Close

You can’t know in advance what AI can do for your feature, or how reliably. A demo asks “can it work once?” An eval asks “how often does it work — and is that often enough?” Try before you develop — and make the try one whose answer you can trust.

Demos lie. Evals decide.

⊡

Sources

MIT, The State of AI in Business 2025 (NANDA initiative) — 95% of enterprise GenAI pilots fail to deliver measurable impact.
Gartner — more than 40% of agentic AI projects predicted to be cancelled by end of 2027.
Chip Huyen, AI Engineering (O’Reilly) — evaluation as the central bottleneck; without an eval pipeline, you’re flying blind.
Xia et al., Evaluation-Driven Development of LLM Agents, 2024 — eval as a continuous governing function spanning offline and online use.