Indicative editorial // Real content arrives May 2026

Building a customer-service agent that does not hallucinate

An agent that hallucinates in customer service does two kinds of damage. The first is the wrong answer. The second is the loss of operator trust that follows.

Impart EditorialField notes24 April 20267 min read

An agent that hallucinates in customer service does two kinds of damage. The first is the wrong answer. The second is the loss of operator trust that follows. Once the operator stops trusting the agent's outputs, every flag becomes a manual review and the agent stops earning its keep. The agents that survive operator review do not stop hallucinating. They get better at saying when they do not know.

Grounding beats prompting

Most teams reach for prompt engineering when an agent gives a wrong answer. A different prompt, a few more examples, a stricter system instruction. This works at low volume and breaks at high volume. The reliable lever is retrieval design. A model that has been handed the right three paragraphs from your knowledge base before answering will outperform a model that was prompted into a corner. The prompt is the last twenty percent of the answer. The retrieval set is the first eighty.

The retrieval set

A focused retrieval set is built from three sources: real ticket history (with personal information stripped), the policy library, and the product knowledge base. The size at which retrieval starts to fail varies by content type. For a typical service queue the knee in the curve is around fifty thousand passages. Below that, retrieval is reliable. Above that, retrieval quality drops and the set needs to be partitioned by topic and routed to the right shard before the model is called. Architects who skip the partitioning step end up with an agent that is correct on the easy questions and creative on the hard ones.

The handoff rule

When the model is not confident, it must hand off to a human, not improvise. A handoff rule has three parts: a confidence threshold, a fallback message, and a queue routing rule. The threshold is set by sampling production traffic and finding the point at which operator-flagged errors cluster. The fallback message is short, human, and never apologises for the agent. The routing rule sends the case to the right human queue with the model's reasoning attached so the human does not start from scratch. Build the handoff before you build the answer.

Tuning against the operator

The operator, not the data scientist, is the ground truth for what counts as a correct response. Weekly tuning cycles against operator-flagged calls is the right cadence. Each week the operator picks the ten worst calls from the previous week, the engineering team diagnoses each one (retrieval, prompt, threshold, or handoff), and ships the fix. After eight weeks the worst-call rate has typically dropped by seventy to eighty percent in our experience. The agent is not better at hallucinating. It is better at not answering questions it should not answer.

A model is not the system. A model plus retrieval, plus a handoff rule, plus an operator review cycle, is the system.

Allow the agent to say it does not know

The agent that does not hallucinate is the one that is allowed to say it does not know. Build the not-knowing into the system before the knowing. Score the agent on its handoff rate as well as its answer rate. Reward operator escalations that catch a wrong answer. An agent that hands off ten percent of the time and is right on the ninety it answers is a better agent than one that answers everything and is right on eighty.

Close

If your team is shipping a customer-service agent, the first build target is the handoff path, not the answer quality. Get that right and the rest is tuning. Request a proposal and we will scope the build with the handoff path written first.

If this maps to something you are scoping, the next step is a written proposal.

Request a Proposal