Why We Don't Just Trust the Model

Feb 3

5 min read

The case for grounding, retrieval, and verification in enterprise AI

Hallucinations are not a bug we're waiting for OpenAI or Anthropic to fix. They're a fundamental property of how large language models work. Understanding why they happen is the first step toward building systems that work reliably for your business.

Why Language Models Hallucinate

Large language models are, at their core, next-token prediction machines. They're trained to predict what word comes next given all the words that came before. This is called causal language modeling, and it's the foundation of GPT, Claude, and most other modern LLMs.

The training process has a subtle but important consequence: models are rewarded for confident, fluent responses rather than honest uncertainty. OpenAI's own research explains why: training and evaluation procedures fundamentally reward guessing over acknowledging uncertainty. As the researchers put it, when asked for someone's birthday, guessing "September 10th" has a 1-in-365 chance of being right, while saying "I don't know" guarantees zero points on an evaluation benchmark.

There are other contributing factors that only amplify the problem: training data is never complete, the model's context window is finite, and the probabilistic nature of token generation means there's always some chance of retrieving incorrect information. Recent research goes so far as to argue that hallucinations are an innate limitation of the architecture itself.

How bad is the problem in practice? It depends heavily on the domain and task. A 2025 AI study by Mount Sinai medical experts found hallucination rates ranging from 50% to 83% across different LLMs when summarizing clinical cases. In legal contexts, Stanford researchers found models hallucinate 58% to 82% of the time on legal questions. These numbers should make anyone pause before deploying a raw model in a high-stakes enterprise context.

How Grounding Changes the Equation

The solution is not to wait for better models. The solution is to change what the model has access to when it generates a response.

Grounding refers to connecting an AI system's outputs to verifiable data sources at inference time. Rather than asking the model "What were our Q3 sales figures?" and hoping it remembers something from training (it won't, because your proprietary data wasn't in its training set), you retrieve the actual sales data from your database and provide it alongside the question. The model's job shifts from recalling facts to synthesizing information you've already verified.

Until recently, the most common implementation is Retrieval-Augmented Generation (RAG). The basic architecture involves four steps:

User asks a question
The system searches your knowledge base for relevant documents and ranks them by relevance
Retrieved documents are injected into the prompt alongside the question
The model generates a response grounded in the retrieved context

The effectiveness is real, if imperfect. Stanford researchers found that RAG-based legal research tools reduced hallucination rates to 17-33%, compared to 58-82% for general-purpose LLMs on legal queries—roughly cutting errors in half. In enterprise deployments, results can be even better: one production system reduced hallucination rates from 14-21% down to 2-4% on structured workflow generation tasks

Platforms like Contextual AI have built their entire product around this approach. Their RAG system jointly optimizes document understanding, retrieval, and language modeling as a unified system rather than stitching together separate components. Their benchmarks show a 5.4% improvement in end-to-end RAG performance over the next best baseline.

One important caveat: RAG reduces hallucinations about facts in your retrieved documents, but if your knowledge base contains errors, the model will confidently repeat them. Grounding is only as good as the data you ground in.

The Emerging Role of MCP

A newer development is the Model Context Protocol (MCP), introduced by Anthropic in November of 2024. MCP is an open standard that provides a universal way to connect AI models to data sources and tools, similar to how USB-C standardized device connections.

Before MCP, connecting an AI model to your Salesforce instance, your internal wiki, and your analytics database required building three separate integrations with custom code for each. This created a combinatorial M by N problem: connecting M models to N data sources required M times N custom integrations.

MCP addresses this with a standardized protocol. Build an MCP server once for your data source, and any MCP-compatible model can connect to it.

Think of MCP as infrastructure for the grounding approaches described above. A RAG system needs to retrieve documents from somewhere: your internal wiki, your CRM, your analytics database. Before MCP, each of those connections required custom integration code. MCP standardizes those connections, making it easier to build and maintain the retrieval layer that RAG depends on.

For enterprise applications, this matters because grounding is only useful if it's connected to the right data. MCP reduces the engineering burden of keeping those connections current, which means your RAG system can pull from more sources with less maintenance overhead. The protocol doesn't replace RAG; it makes RAG more practical to implement at scale.

Designing Systems, Not Deploying Models

The key mindset shift is this: we're not deploying a model. We're designing a system where the model is one component.

When building AI applications for clients, the architecture typically includes:

Data retrieval layer: Vector databases, keyword search, and semantic ranking to surface relevant information from your existing systems. This is where the grounding happens.
Tool use and function calling: Connecting the model to calculators, databases, APIs, and code execution so it retrieves live data rather than generating it from memory. When a model calls your CRM to look up a customer's contract terms instead of guessing, it can't hallucinate those terms.
Prompt engineering: Carefully constructed system prompts that constrain the model's behavior, specify output formats, and include explicit instructions about uncertainty.
Guardrails and validation: Output checking to catch obvious errors, citation requirements so users can verify claims, and fallback behaviors when confidence is low.
Monitoring and feedback loops: Logging interactions to identify failure patterns, measuring accuracy over time, and continuously improving retrieval quality.

The model itself is the least interesting part, engineering-wise. What matters is the infrastructure around it that ensures the model has access to accurate, relevant information and that its outputs are appropriately constrained and verified. This is reflective of the old chestnut that data scientists spend 80% of their time engineering their data and only 20% modeling.

What This Means in Practice

For a client asking "Can AI help us answer customer questions about our products?", the answer isn't "Yes, GPT-5 is really good now." The answer is a system design conversation:

Where does your product information live?
How do we keep it current?
What happens when a customer asks about a product feature that doesn't exist?
How do we handle questions the system isn't confident about?
What's the escalation path to a human?

The hallucination problem doesn't go away because we choose the right model. It gets managed through careful system design that acknowledges the model's limitations and builds appropriate safeguards.

None of this is to say AI can't deliver substantial value for enterprise applications. It absolutely can. But that value comes from treating it as a component in a well-designed system rather than a magic box that just knows things. The clients who get the most value from AI are the ones who understand this distinction.

The goal isn't perfection. It's building fault-tolerant systems.

Joe Marlo

Director of Data Science

Lander Analytics

Subscribe to our Substack and below to our monthly emails for practical AI strategies for your organization: what to build, what to avoid, and how to make systems reliable in the real world.

Work with us: If you want help identifying the right first workflow, building a permissioned knowledge base, or training your team to ship responsibly, reach out at info@landeranalytics.com.

About the author: Joe Marlo is Director of Data Science at Lander Analytics, where he designs agentic workflows, statistical models, and interactive frontends that put rigorous analysis into production.

Lander Analytics Blog 2.0