Engineering

How We Built a Real-Time Claim Verification Engine

GroundTruth Team

January 5, 2025

12 min read

When we set out to build GroundTruth, we had one core technical challenge: verify every factual claim in an AI-generated response against a knowledge base, and do it in under two seconds. Here is how we did it.

The architecture overview

Our verification pipeline has five stages: claim extraction, evidence retrieval, claim-evidence matching, risk scoring, and safe rewriting. Each stage is optimized for latency, and stages run in parallel where possible.

The entire pipeline is stateless (from the caller's perspective) and exposed via a single REST endpoint. You POST the AI-generated draft, and we return the verified result with full evidence.

Stage 1: Claim extraction

The first challenge is breaking a natural-language response into atomic, independently verifiable claims. A sentence like "You can return items within 60 days for a full refund and we offer free shipping on orders over $50" contains two distinct factual claims that need to be verified separately.

We use an LLM for this step, with carefully tuned prompts that instruct it to extract only factual claims (ignoring greetings, opinions, and hedging language). We experimented with rule-based approaches and smaller models but found that GPT-class models produce significantly better decompositions, especially for compound sentences.

Stage 2: Hybrid evidence retrieval

For each extracted claim, we need to find the most relevant passages in the customer's knowledge base. Pure vector search (embedding similarity) misses exact terms like product names and policy numbers. Pure keyword search (BM25) misses semantic matches when different words express the same meaning.

Our solution is hybrid retrieval with Reciprocal Rank Fusion (RRF). We run both a FAISS-based vector search and a BM25 keyword search in parallel, then merge the results using RRF. This gives us the best-of-both-worlds: semantic understanding with keyword precision.

The FAISS index is pre-built when the knowledge base is uploaded and stored in-memory for fast retrieval. For BM25, we maintain an inverted index updated incrementally as documents are added. Both indexes support hot reloading so knowledge bases can be updated without downtime.

Stage 3: Claim-evidence matching

Once we have the top-k evidence passages for each claim, we need to determine whether the evidence supports, contradicts, or is insufficient to judge the claim. This is essentially a natural language inference (NLI) task.

We use a fine-tuned cross-encoder model for this step. The cross-encoder takes a (claim, evidence) pair and outputs a three-way classification: entailment (supported), contradiction (unsupported), or neutral (needs review), along with a confidence score.

Stage 4: Risk scoring

The per-claim verdicts are aggregated into an overall risk score for the response. The scoring algorithm weights claims by their potential impact (e.g., pricing claims are weighted higher than general product descriptions) and by the confidence of the NLI model.

Customers can configure their own thresholds for what constitutes low, medium, and high risk. They can also configure what action is taken at each risk level: pass through, flag for review, auto-rewrite, or block entirely.

Stage 5: Safe rewriting

When the risk score exceeds the configured threshold, we generate a safe rewrite. The rewrite preserves the original response's tone and style but replaces unsupported claims with verified information from the knowledge base. Claims that cannot be verified are replaced with soft deflections like "Please check our website for the latest details."

Latency optimization

Getting all five stages to complete in under two seconds required significant optimization. Here are the key techniques we used:

Parallel retrieval. FAISS and BM25 searches run in parallel for each claim, and claims are processed concurrently.
In-memory indexes. FAISS indexes and BM25 inverted indexes are kept in memory for sub-millisecond lookup times.
Batched NLI inference. All (claim, evidence) pairs are batched into a single inference call to the cross-encoder, amortizing model loading overhead.
Streaming claim extraction. We begin retrieval for the first extracted claim before all claims have been extracted, using streaming output from the LLM.

What is next

We are continuously improving the pipeline. Current areas of focus include multi-hop reasoning (verifying claims that require combining information from multiple documents), temporal awareness (detecting when information might be outdated), and support for structured data sources like databases and APIs.

If you are interested in the technical details or want to discuss our approach, reach out to us at our contact page. We love talking about this stuff.

All posts