"Does My Data Go to America?"
What Hong Kong Businesses Need to Know About AI Data Flows Before Processing Confidential Documents
In every AI project conversation with Hong Kong businesses, the most common question isn't about price — it's this: "Do our contracts, insurance policies, and client data all get sent to OpenAI?"
The question is entirely legitimate. But market answers tend to be binary: vendors say "no problem," or companies ban all AI internally. Both reactions are built on not actually knowing where the data goes.
This article explains it clearly.
Consumer ChatGPT and Enterprise API Are Two Different Things
The biggest misconception is treating these as the same thing.
When an employee opens free ChatGPT and pastes company documents into it — those conversations can, by default, be used for model training. Your contract terms could theoretically become part of the model. This is why many companies ban AI, and the concern is valid.
But when enterprise systems call models via API — Claude, GPT, Gemini commercial API — that is a different contractual framework. Major providers' API terms state clearly: API inputs and outputs are not used for model training. Anthropic and OpenAI's commercial terms both include explicit commitments on this.
Put plainly: your company's biggest data leakage risk right now is not a future AI system — it's the free ChatGPT your employees are already using today. A properly designed internal system actually brings that risk back into a controlled environment.
"Not Used for Training" Does Not Mean "Nothing Is Retained"
To be honest: API providers typically retain request logs for a short period (usually within 30 days) for abuse monitoring, then delete them. Some providers offer zero-retention options, but these require enterprise-tier agreements.
For most Hong Kong SME document processing needs, standard API terms are adequate. But if your industry has specific data residency requirements — such as certain financial institutions' internal policies — this needs to be raised at the very start of scoping discussions, because it directly changes architecture choices and costs.
Any engineer who tells you "data never leaves your company" while using cloud LLM APIs is not being truthful. Data is transmitted to the model provider through encrypted channels for processing — that is a technical fact. The question is not "whether it gets sent," but "what are the contractual protections and retention policies after it is sent."
What Architecture Can Do: Reduce What Gets Sent in the First Place
Data flows are not just a terms-and-conditions issue — they are a design issue.
A privacy-conscious system can de-identify data before sending it: client names, ID numbers, and account numbers are replaced with placeholders locally, then restored after the model processes the content. The model sees "Client A's policy," not a real name.
The retrieval architecture itself has privacy implications: a RAG system sends only the document fragments relevant to the current question, not the entire document library. Your ten years of contract files sit in your own database; the model only touches the few paragraphs needed to answer the current question.
These are not optional extras — they are architecture decisions. They are set at the pricing stage, and retrofitting them later costs several times more.
The On-Premise Myth
Some companies ask: can we run everything locally, with no cloud API at all?
Technically yes — open-source models can be deployed on your own servers. But the real picture needs to be stated clearly: GPU hardware capable of running commercially viable models requires upfront investment in the hundreds of thousands of HKD, not counting ongoing maintenance. And open-source model performance on Cantonese and mixed Chinese-English documents has a visible gap compared to top API models.
For the vast majority of SMEs, the combination doesn't hold: ten times the cost for seventy percent of the quality. The genuine use case for on-premise deployment is specific compliance requirements at large institutions — and those institutions need internal infrastructure teams, not project-based engineers.
Four Questions Worth Asking Vendors Before You Sign
First: Which model provider does your system use? What do the API terms say about training use and data retention? If they cannot answer, they have not read the terms of what they are building on.
Second: Where is my document library stored? Which parts get sent to the model API, and when?
Third: Is sensitive data de-identified? At which layer does this happen?
Fourth: If regulators require audit records, what can your system provide?
None of these questions require technical background. How clearly a vendor can answer is itself an indicator of their maturity.
Privacy Is Designed, Not Promised
"Absolute security" is a sales line. Clearly explaining where data goes at each step, where protections exist, and where trade-offs are made — that is engineering.
If your company wants to use AI for document work but has been stuck on data security concerns, the most useful first step is: list which fields in your documents are genuinely sensitive. That list will make any scoping conversation substantially more concrete.
Independent AI engineer based in Hong Kong, building production-grade LLM applications, RAG pipelines, and document intelligence systems. Project-based engagements with clearly scoped deliverables before work begins.
Contact: smartai.hk+ai.consulting@proton.me
LinkedIn: linkedin.com/in/levi-innovation