2026-06-21 · Levi · LinkedIn

Long-Term AI Memory: The Architecture and Engineering Behind HKSoka

A four-layer memory model, proposition-level chunking, bilingual embeddings, and conversation lifecycle management — a full engineering breakdown

AI Memory RAG Architecture Vector Search LLM Engineering Enterprise AI

Large language model conversations share a common limitation: every new conversation resets context, and the user has to re-explain background information from scratch. HKSoka is an AI platform supporting Traditional Chinese, Simplified Chinese, and English that treats "memory" as a core engineering problem to be broken down — not just whether the system remembers, but when it remembers, what it remembers, how it's retrieved afterward, and whether the user can correct it directly. The following breaks this system down layer by layer, from concept to implementation detail.

1. The Memory System: Four Components

Memory is split into four parts, each responsible for a different time scale, with the user able to control each one individually.

1.1 Seed Memory

Long-term background the user deliberately puts in. Any text can be pasted in as a named "seed," and long documents (25+ pages) can be uploaded and integrated the same way, without needing to create a separate project.

Creation and toggling: each seed's full content is preserved in storage, and separately chunked and embedded for retrieval — what gets saved is the complete version, what retrieval returns is a slice of it. Every seed can be switched on or off individually depending on what a given conversation needs, without interfering with the others, so users can manage multiple background sets without having to enable or disable them all at once.

Editing triggers re-indexing: when a seed's content is edited, the system first clears that seed's old retrieval chunks, then re-chunks and re-embeds the new content. The full content and the retrievable chunks stay in sync as a result — there's no gap where the saved version is new but retrieval still returns the old one.

Deletion cascades: deleting a seed removes its associated retrieval chunks along with it.

1.2 Learned Memory

Generated automatically after a conversation has been idle for a while — the system extracts stable, user-relevant facts from the conversation and logs them individually.

Two different trigger conditions: a normal conversation is only processed after sitting idle for a set period. If an account has no learned memory at all yet, that idle threshold is shortened significantly, so new users feel the system "remembering" earlier conversations sooner. The background processing itself runs in small batches to avoid a sudden spike in processing volume slowing down the wider system.

Strict extraction rules: only stable facts that are personal to the user get extracted — anything temporary or relevant only within a single conversation is skipped. The same fact is only recorded once: each new extraction pass first checks against existing memory to avoid accumulating near-duplicate entries. Wording stays as close to the user's original phrasing as possible, without subjective interpretation or over-inference.

Forgetting is respected: after a user deletes a learned memory entry, that topic is added to an exclusion list, and future automatic extraction is explicitly told not to record similar content again. In other words, deletion isn't a one-off removal — it's a standing instruction, preventing the same thing from being re-extracted in a later conversation.

1.3 Critical Memory

A daily scheduled job condenses learned memory into a short summary that is injected into every conversation by default, with no retrieval filtering step in between.

Clear selection criteria: whether a piece of information is worth promoting to critical memory is judged against three criteria — would the user be materially affected if the AI forgot this, is it a stable fact rather than a one-off opinion, and would forgetting it cause the AI to give a wrong or inappropriate response. Only content that satisfies all three gets included.

Deliberately offset scheduling: the consolidation job runs during fixed low-traffic windows, and each run only processes users whose learned memory was updated within the last day but hasn't been processed in the last hour, to avoid the same user being re-processed repeatedly in a short span.

A different injection method from other memory types: seed memory and learned memory both rely on retrieval ranked by relevance, so they aren't guaranteed to surface on every query. Critical memory skips retrieval filtering entirely and is injected directly into every conversation. This trades a fixed, small token cost for guaranteed availability of the most important information.

1.4 Custom Instructions

A persistent setting the user writes themselves (up to 500 characters) that defines the AI's tone, persona, or response style.

Defines identity and tone directly, without relying on accumulated learning: seed memory and learned memory are both controlled by the user in terms of content, but learned memory has to be gradually inferred by the system from conversations. Custom instructions, by contrast, let the user state upfront exactly what persona or tone the AI should use, without waiting for the system to accumulate enough conversation to infer it. The distinction comes down to the nature and mechanics of the content: seed memory is knowledge retrieved by relevance, custom instructions are a standing behavioral setting — both are fully user-driven, just for different purposes.

Auto-generated through onboarding: on first use, the system asks a few short questions to get to know the user — what to call them, any background they'd like to add, their preferred response style, and language preference — then combines the answers into a custom instruction. Most users end up with a starting point without having to write anything themselves.

Placement accounts for caching cost: custom instructions rarely change mid-conversation, so they're placed in the "stable" part of the system prompt, separate from the memory fragments that change with every query. The stable portion can be reused via prompt caching instead of being recomputed with every message — a design choice that directly affects the cost of every conversation.

2. The Retrieval Engine: Do the Work at Write Time, Benefit at Query Time

Retrieval accuracy matters more than storage volume. HKSoka's approach is to put the processing cost at the moment data is written, so the hit rate is higher at query time.

2.1 Proposition-Level Chunking

Long content isn't stored as whole blocks — it's broken down into individual, self-contained "propositions."

Segment first, then extract: longer documents are first split into multiple sections by length, and each section is sent independently for proposition extraction, keeping the amount of content processed at once within a reliable range so extraction quality doesn't degrade from processing too much at once.

Self-containment is a hard requirement: every proposition must be understandable on its own, without depending on surrounding context to make sense. That's because a proposition will later be retrieved independently and inserted into a completely different conversation context, with no chance of appearing alongside its original neighboring content.

Parsing failures have a fallback: if model output can't be fully parsed into the structured format, the system tries to salvage whatever usable partial propositions it can from the output, rather than discarding the entire block.

2.2 Bilingual Embeddings

Most users write in Chinese, but mixed Chinese-English queries are common.

A translation is generated at write time: each proposition gets a corresponding English version generated alongside it as it's extracted. The text actually used to compute the vector combines the original and the translation, but what's stored and shown to the user stays in the original language.

The cost is paid once, at write time: the translation computation only happens at the moment memory is written, so the benefit of bilingual retrieval doesn't show up as added latency or cost on every query.

The effect depends on translation quality: the trade-off of this design is that translation quality directly affects vector accuracy — if a translation drifts, retrieval hit rate drifts with it.

2.3 Hybrid Retrieval and Tiered Quotas

At query time, the system compares using both semantic vectors and keyword matching at the same time.

Vectors handle meaning, keywords handle precision: pure semantic matching tends to miss when wording differs a lot; pure keyword matching tends to miss when a user phrases something with a near-synonym. Combining both improves the stability of the hit rate.

Candidate pool size and injected count are calculated separately: the system first pulls a larger candidate pool, then separately caps how many seed-memory and learned-memory chunks actually get injected into the conversation. This exists because heavy document-uploaders can end up with far more seed-memory chunks than learned-memory ones — without separate caps, learned memory could get crowded out of the results entirely. Tiered quotas guarantee both types get a chance to appear.

Throttling and de-duplication on the write side: when a large volume of content is written at once, embedding requests are batched with spacing between them to avoid sending too many requests to an external service in a short window. Database writes are designed to prevent duplicates, so even if the same fact gets extracted more than once, it doesn't produce more than one vector record in the database.

3. Conversation Lifecycle Management

3.1 Context Compression for Long Conversations

Once a conversation's content builds up past a certain length, the system uses two layers to control cost and stability.

A hard cap as a safety net: once the estimated token usage crosses a fixed threshold, older content is moved out of the active processing range, guaranteeing it never accumulates without bound under any circumstance.

Summary compression as the primary mechanism: a background job continuously maintains a condensed summary for long conversations, instead of simply discarding old content. Each summary update only processes what's new since the last pass and merges it with the existing summary, keeping the cost from rising indefinitely as a conversation gets longer.

Summary takes priority once it exists: once a conversation has a summary, future turns use "summary + the most recent few messages" in place of the full history — preserving context while keeping the amount of content each turn needs to process stable.

3.2 Editing a Message Branches, It Doesn't Overwrite History

Editing a previously sent message and re-asking doesn't overwrite the original conversation record.

The original conversation stays fully intact: editing opens a new conversation branch at that message's position, while the original conversation, including the AI's original reply, stays unchanged. Users can keep both "the reply to the original phrasing" and "the reply to the edited phrasing" side by side for comparison.

Branches are generated server-side: a branch's content is created by the server truncating the original conversation record up to the specified point and building a new conversation record from it, rather than relying on the frontend to stitch it together — keeping the branch consistent with the database state.

3.3 Deletion and Forgetting Are Linked

Deleting a conversation doesn't just hide it from the screen.

Conversations use soft deletion: a deleted conversation is marked hidden and no longer appears in the list, with a double-tap confirmation in the interface to prevent accidental deletion.

Memory generated from that conversation is removed too: deletion also identifies which learned-memory entries were extracted from that conversation, and removes the corresponding memory entries and retrieval chunks along with it. In other words, deleting a conversation never leaves a gap where "the conversation is hidden from the screen but the AI still retains that content" — and memory from other conversations is unaffected, since the deletion is targeted, not a blanket wipe.

3.4 Incognito Mode

With incognito mode on, the system skips memory retrieval and injection entirely. The conversation isn't saved once it ends, and it never enters the learned-memory processing pipeline — it's a complete bypass, not something hidden after the fact.

4. The Artifact Workspace: Editing in Real Time, Inside the Conversation

When you need to produce a document, a piece of code, an article, or a short piece of creative writing, you can open Artifact mode, which displays the content in a separate panel alongside the conversation.

4.1 Editing Strategy: Targeted Edits Take Priority Over Full Rewrites

By default the system edits using a "locate and replace" approach, and only rewrites the entire piece when the content is entirely new or most of it needs to change.

Exact matching is required: a targeted edit requires the text being changed to match the panel content exactly, or it won't take effect — a constraint that forces every edit to be a verifiable, precise operation.

The practical benefit for users: for a long document or piece of code being refined over multiple passes, you only need to describe what needs to change rather than regenerating the whole thing every time, which speeds up the back-and-forth of revision.

4.2 Versioning and Rollback

Version history is retrievable: if the same Artifact goes through multiple edits within one conversation, every version is kept, and users can switch back to view a previous version in the panel.

The server parses each result independently to keep data consistent: for every edit, the server independently parses the result and writes it to the database rather than simply trusting what the frontend reports — ensuring the version that's stored matches what was actually produced, so even if the frontend has a display glitch, the version in the database stays accurate.

Editing can continue across devices: if the frontend has no locally saved content — for example, after switching devices or refreshing the page — the system automatically pulls the last saved version from the database so editing can continue without starting over.

4.3 Exporting

Finished content supports one-click copying, and can also be downloaded as a text file under a specified filename, making it easy to drop directly into other tools or delivery workflows.

5. Other Practical Features

Web search: when real-time information is needed, search can be turned on and the response will be based on what's found. On follow-up questions, the system clearly marks that this part of the content came from an earlier live search rather than training data, so both the user and the system stay clear on the source and recency of the information. The search process is capped to avoid one question triggering excessive rounds of searching and a long wait.

PDF and image uploads: PDFs or images can be uploaded directly with a question, and pasting an image directly from the clipboard is supported too, without needing to save it to a file first. If one of several files fails to load, the system simply skips that file rather than failing the whole request.

Conversation and memory export: both conversation history and learned memory can be downloaded as text files with one click, for backup or for moving elsewhere.

Trilingual interface: Traditional Chinese, Simplified Chinese, and English.

6. The Reasoning Behind the Architecture Decisions

Asynchronous processing: document chunking, proposition extraction, and embedding are computationally heavy steps, so they're split off from the main conversation flow and handled by an independent asynchronous process, keeping the real-time conversation experience smooth. The trade-off is that newly uploaded content takes a short while before it becomes retrievable.

Tiered models and caching: background, repetitive work (fact extraction, summary consolidation) runs on a lighter model, while the main conversation keeps a higher-quality model. The stable part of the system prompt (base settings, custom instructions, critical memory) is handled separately from the part that changes with every query (retrieved chunks) — the stable part can be reused, cutting some of the cost. This is a concrete trade-off between token cost and response quality.

Scheduled rather than real-time processing: learned memory doesn't need to take effect instantly, so scheduled processing trades a bit of immediacy for lower system complexity. New users get a shorter processing threshold so the memory feature can still be felt sooner — balancing system simplicity against user experience.

7. An Honest Look at the Limitations

Any responsible AI system should be clear about its own boundaries.

The ceiling on retrieval: RAG is built on semantic similarity, and when an answer requires chaining several scattered concepts together through multi-step reasoning, similarity matching itself has a ceiling — that's an architectural trade-off, not something an individual fix can resolve.

Bilingual embeddings depend on translation quality: if the translation generated at write time is off, the vector drifts with it — but this doesn't affect the stored original content. What the user sees stays accurate; only the retrieval hit rate is affected.

Learned-memory accuracy depends on the conversation itself: extraction is based on what's explicitly expressed in the conversation — the clearer the user's wording, the more accurate the record. Every entry can be reviewed and corrected at any time.

New memory has a short activation delay: learned memory only starts processing after a conversation has gone idle, so something just said needs to wait for the background job to finish before it shows up in retrieval results.

Document import takes time: after a longer document is uploaded, proposition extraction and bilingual embedding need a certain amount of compute time — the longer the document, the longer the preparation time relative to it.

Understanding these boundaries is a prerequisite for using the system in the right setting, and it's also the kind of question worth asking when evaluating any AI solution.

About HKSoka and Consulting

HKSoka is built in Hong Kong, supports Traditional Chinese, Simplified Chinese, and English, and brings memory, the Artifact workspace, web search, and document upload together in a single interface. Platform: www.hksoka.com

The same memory and retrieval architecture applies to enterprise settings too — consolidating scattered documents, conversations, and knowledge into a layer of memory that's retrievable, controllable, and auditable. If your team is evaluating how to bring an LLM into an actual business workflow, feel free to reach out via LinkedIn (linkedin.com/in/levi-innovation) or email smartai.hk+ai.consulting@proton.me.

Frequently Asked Questions

An SME may not have its own technical team — does adopting this require engineering resources?

The platform itself runs entirely through the conversation interface, with no extra development or deployment needed. Memory, document upload, and the Artifact workspace are all built-in features — open the interface and they're ready to use. Applying the same memory architecture to an enterprise's own internal systems (such as connecting it to a company's existing documents or tools) is where additional engineering resources for integration would be needed.

What does HKSoka's memory system actually do differently?

The difference is mainly in the structure and controllability of the memory itself. The system splits memory into three layers — seed, learned, and critical — each responsible for a different role: background the user actively provides, facts the system automatically infers, and a daily condensed summary of the highlights. Every entry can be viewed, edited, or deleted individually. Memory writes also handle bilingual Chinese-English retrieval, paired with a trilingual Traditional Chinese, Simplified Chinese, and English interface, which suits mixed Chinese-English query habits.

Could the data recorded by the system get mixed up with other users?

No. Each user's memory and retrieval chunks are stored independently with no overlap — one user's memory never appears in another user's retrieval results.

If the system records something incorrectly, can it be deleted?

Yes. Every learned-memory entry can be viewed, edited, or deleted individually. After deletion, the system remembers that this kind of content shouldn't be recorded again, preventing the same thing from being re-extracted next time. Seed memory can similarly be toggled or deleted at any time — control stays with the user.

How are uploaded documents handled? Can they be queried right away?

After a document is uploaded, the system chunks it and builds an index in the background, which takes some time before it can be queried in conversation. The longer the document, the longer the relative preparation time. This is background processing and doesn't block the conversation that's currently in progress.

Can this memory architecture be used in our own company's systems, or is it limited to this platform?

The version of HKSoka actually running in production is, in effect, a live validation of this tiered memory and hybrid retrieval design — proof that the architecture is viable in a production environment. This kind of tiered memory and retrieval design has also previously been applied to client projects. What an enterprise actually needs is usually not identical to this platform: existing systems, document formats, data sources, and use cases all differ, so directly applying an off-the-shelf platform isn't always the right fit. The same design thinking can be re-broken-down and rebuilt around an enterprise's own situation — which memory tiers are actually necessary, how retrieval logic should fit existing systems, what data needs to be prioritized — the answers depend on the actual use case. If you'd like to know how this architecture could fit your company's use case, feel free to reach out directly — starting from understanding the requirements, then deciding how to build it.

If you'd like to know how this memory and retrieval architecture could fit your actual business case, feel free to reach out directly via LinkedIn (linkedin.com/in/levi-innovation) or smartai.hk+ai.consulting@proton.me.

Levi is an independent AI engineer based in Hong Kong, building production-grade LLM applications, RAG pipelines, and document intelligence systems for SMEs pursuing AI digitalization internationally, working remotely.

Get in touch → More enterprise case studies → Discuss your project →