← Blog
繁體中文 English 简体中文
Levi · LinkedIn

Your Documents Are Half Chinese, Half English — This Is Where Most AI Systems Fail

Cantonese queries, Traditional Chinese documents, English clauses — why benchmark accuracy doesn't hold in Hong Kong

RAG Cantonese AI Bilingual Documents Hong Kong Multilingual Semantic Retrieval

"Our AI is great in English, but our documents are half Chinese, half English — whenever we ask in Cantonese, it misses the point entirely."

This is one of the most common complaints from Hong Kong companies after deploying AI. Vendors demo in English, results look excellent. On the ground, employees ask in Cantonese, documents are in Traditional Chinese with English clauses — and retrieval quality drops immediately.

This article explains why this happens, which parts are technically solvable, and which limitations vendors will not proactively tell you about.

The Root Cause: Embedding Is Not Translation

The core of a RAG system is embedding — converting text into vectors, then using vector similarity to find relevant content. The problem: mainstream embedding models are trained predominantly on English data.

The practical consequence: the same meaning, written in English versus Chinese, can be very far apart in vector space. You ask in Cantonese: "How much is the critical illness benefit on this policy?" But the document says: "Critical Illness Benefit shall be payable..." — the system cannot retrieve it. Not because it is unintelligent, but because from its perspective, the two passages simply do not look similar.

Hong Kong documents add another layer of complexity: within the same contract, definitions are in English, schedules are in Chinese, and annotations are mixed. This is not an edge case — it is the standard format of Hong Kong commercial documents.

Spoken Cantonese Is a Separate Layer

Written Chinese and spoken Cantonese are also distinct in embedding space. "When does this thing expire?" and "The expiry date of this document" mean the same thing; their vector distance is not small.

If your employees will use spoken language to query — and they will — while your knowledge base is in written form, the retrieval gap exists.

How to Solve This Technically

This problem has engineering solutions, but they need to be addressed at the architecture layer, not retrofitted after deployment:

Cross-lingual embedding strategy. Address language differences at the embedding stage — for example, embedding content together with its translation so that Chinese queries and English content are genuinely close in vector space. This is a design decision; discovering the problem after a system is built means the cost of rework is several times higher.

Test with real queries. Before go-live, test with questions your employees would actually ask — spoken Cantonese, mixed Chinese-English, business terminology. Not vendor-prepared English demo questions.

Measure retrieval quality separately. Chinese query and English query retrieval accuracy need separate measurement. A single overall accuracy figure can completely hide poor Chinese performance.

How Buyers Can Verify Before Signing

No technical background needed. Three actions can expose problems before you sign:

First: Insist on demoing in Cantonese. Do not let vendors finish the demo in English and call it done.

Second: Bring one of your own mixed-language documents for a live test. Not their prepared sample.

Third: Ask directly: "How do you handle Cantonese queries searching English content?" Someone who answers with a specific mechanism has done this work. Someone who answers "the model supports multilingual" has not actually solved the problem — a model "supporting" Chinese and retrieval being "accurate" are two different things.

Why This Matters Specifically in Hong Kong

For Western companies, multilingual capability is a bonus feature. For Hong Kong companies, Chinese-English mixing is the fundamental format of documents — insurance policies, legal contracts, government circulars, internal reports all follow this pattern.

A system that achieves 90% accuracy on English documents but drops to 60% on Chinese queries is not "mostly usable" in Hong Kong — it is a core use case failure. And this gap is completely invisible in an English demo.

I build production-grade RAG systems for Hong Kong enterprises. Cross-lingual retrieval is part of the architecture design, not a post-deployment fix. Production deployments include a Hong Kong critical illness insurance RAG comparison platform (mixed Chinese-English policy documents) and HKSoka, a Claude-powered conversational platform with bilingual embedding design.

If your documents are Chinese-English mixed and you want to understand actual retrieval quality before deployment, get in touch.

Contact: smartai.hk+ai.consulting@proton.me
LinkedIn: linkedin.com/in/levi-innovation

Get in touch →