When an LLM Gets It Wrong

Accountability and Verification in Production AI Systems

In 2026, the question for most Hong Kong businesses is no longer whether LLMs can handle document processing, query answering, or report generation. That has been demonstrated. The less-asked question is: when the system produces an incorrect answer, how do you know, and who is responsible?

AI Hallucination LLM Reliability Enterprise AI RAG AI Compliance

Hallucination Is a Statistical Property, Not a Bug

LLMs are probabilistic systems. Each token in a response is selected based on statistical likelihood given the preceding context. The model does not verify facts against an external ground truth before generating output. This means confident-sounding but factually incorrect answers — hallucinations — are an inherent property of the technology, not a defect that gets patched out.

For general-purpose use, occasional errors carry low consequence. For financial, insurance, and legal document processing in Hong Kong, a fabricated policy clause or an incorrect contract detail is a liability. The system's output is in a regulated context where "the AI said so" is not a defensible explanation.

The Gap Between a Demo and a Production System

Most AI solutions are demonstrated by showing the system answering questions correctly. Few demonstrations show what the system does when it cannot answer correctly.

A production-grade system requires engineering around the generation layer, not just within it.

Retrieval grounding means every answer is derived from your actual documents, not from the model's training memory. Each response can be traced to a source passage — it is verifiable, not asserted.

Confidence boundaries mean the system recognises the limits of its retrieval. When relevant content is not found, the correct behaviour is to return "not found" — not to generate a plausible-sounding answer from general knowledge.

Evaluation (evals) means testing the system against real business questions before deployment to quantify accuracy, and monitoring output quality continuously after deployment. Without this layer, quality is subjective.

Audit logging means every query, every retrieved passage, and every generated response is recorded. When something goes wrong, there is a traceable record. Without logs, investigation starts from zero.

This Is a Current Compliance Question, Not a Future One

Hong Kong's financial, insurance, and legal sectors process documents where outputs need to be defensible. A regulator will not accept "the AI answered it" as an explanation for an incorrect client-facing response.

Every customer-facing AI system deployed today is a future accountability object. Building without verification layers is cheaper at the start and significantly more expensive when an incident occurs. Retrofitting these layers into a production system costs multiples of building them in initially.

Four Questions to Ask Any AI Vendor

These questions apply to both off-the-shelf AI tools and custom-built systems. They do not require technical background to ask, and the response quality is itself informative.

One: Can answers be traced back to source documents? If not, every response is an unverifiable assertion.

Two: Under what conditions will the system say it doesn't know? If the answer is never, hallucination risk is unaddressed.

Three: How is accuracy measured? If there is no defined evaluation method, quality is unknown.

Four: Where are error logs stored? If there are no logs, there is nothing to investigate when something goes wrong.

Honest Expectations for AI Systems

No production AI system has a zero error rate. An honest engineer will tell you clearly: under what conditions the system is reliable, how it degrades when those conditions are not met, and how you will know when something goes wrong.

A vendor who commits to perfect accuracy is describing a sales position. A vendor who describes failure modes and monitoring design is describing an engineering position.

Levi is a Hong Kong-based independent AI engineer building production LLM and RAG systems for Hong Kong and Greater Bay Area businesses, with retrieval grounding, confidence boundaries, and audit logging as standard components.

WhatsApp Free Initial Consultation → More enterprise case studies →

Or email: support@hksoka.com