AI & ML·April 29, 2026·9 min read

What it takes to add AI to a real product (not a demo)

Building an LLM demo takes a weekend. Shipping AI inside a product that 50,000 customers depend on is a different problem. Here's what changes.

TP
The Pinnacle team
Engineering

Anyone with a credit card and a free afternoon can build a chatbot that "knows" their company. Paste 50 documents into a vector database, wire up an LLM, hand-write a system prompt that says "you are a helpful assistant," demo it to the leadership team, get a standing ovation.

That demo is not a product. We have shipped the productionized version of that demo about 30 times now, and the gap between the demo and the production system is larger than the gap between the empty repo and the demo.

Here's what changes.

Latency stops being free

Demo latency is fine. The user asks a question, waits four seconds, gets an answer. They are impressed. In production, those four seconds compound into bounced sessions and abandoned forms.

Production latency requires three things that demos rarely have:

  • Streaming responses. Users tolerate slow if they see progress. They abandon if the screen sits blank. Streaming is not free, especially behind a CDN; you need to configure your edge correctly or your stream becomes a 4-second pause followed by a 200-byte burst.
  • Retrieval latency budgets. A naive RAG pipeline does a vector search, then a re-rank, then a prompt assembly, then the LLM call. Each step adds 200-800 ms. Budget them.
  • Provider failover. Every major LLM provider has 30-minute outages every few months. If your product cannot fail over, your product is down whenever your provider is down.

Quality stops being subjective

In a demo, the leadership team eyeballs five questions and says "looks great." In production, your support team sees 500 wrong answers a week and starts forwarding them to engineering with the subject line "WTF."

You need three things to keep quality high in production:

  1. An evaluation set. Twenty to fifty real questions with real expected answers, scored by humans. Every change to your prompt, your retrieval, your model gets evaluated against the same set. Without this, you are flying blind.
  2. A feedback loop. Thumbs up and thumbs down on every answer, written to a database. Once a week, look at the thumbs-down answers. Patterns emerge fast.
  3. A way to override. When the LLM consistently gets a specific question wrong, you need a way to short-circuit the LLM for that question with a hand-written answer. We call this the "frozen answers" table. Almost every production system needs one.

Cost stops being a rounding error

Demo costs are not real. You make a hundred calls, spend two dollars, move on. Production costs scale with usage, and AI usage scales nonlinearly with adoption. We have seen monthly bills go from $300 to $30,000 in eight weeks.

Three levers control cost:

  • Model choice. The expensive frontier model is rarely the right choice for every call. We typically route 80% of traffic to a smaller, cheaper model and reserve the frontier model for the 20% of queries that need it.
  • Caching. Identical or near-identical queries should not hit the LLM twice. A semantic cache with a high similarity threshold can cut bills in half.
  • Token discipline. Long system prompts and long retrieved contexts are quietly expensive. Trim aggressively. Measure your average prompt length weekly.

Safety stops being optional

A demo can hallucinate. A production system that hallucinates an answer about a customer's account, a medical question, or a price will get you sued, regulated, or fired.

What we ship instead of "smart and unconstrained":

  • Refusal patterns. The model knows what it does not know. If a question is outside the supported domain, it says so and routes to a human.
  • PII handling. No PII goes into logs, prompts, or training data. We audit this with regex sweeps and log redactors.
  • Output validation. Structured outputs validated against a Zod schema before being shown to the user. If validation fails, retry or fall back. Never display unvalidated JSON.

Operations stops being a side concern

Demos don't have ops. Production systems do. You need:

  • A dashboard with request volume, latency percentiles, error rate, cost per day, and average tokens per query.
  • Alerting on cost spikes, latency regressions, and error rate jumps.
  • Logs that let you reconstruct what happened on any specific query within 30 seconds.
  • A way to roll back a prompt change in one command.

If you don't have these, you don't have a production AI system. You have a demo that happens to be accessible from the internet.

The shorter version

Building the demo is 20% of the work. The other 80% is making the demo trustworthy under load, predictable in cost, defensible in court, and recoverable when something goes wrong.

If you are six weeks into "just adding a chatbot" and the launch keeps slipping, this is why. The demo was the easy part.

We do this kind of work. Tell us what you're building and we'll be honest about whether AI is the right lever for your problem.

Tags
#ai#llm#rag#production
Liked this?

Have a project where this thinking applies?

Tell us about it. We’ll reply within one business day with a recommended next step.