← writing·Receipts·~10 min read·filed 2026-06-10

Mnemonic: self-hosted, categorized memory for AI agents

A small self-hosted FastAPI memory server for AI agents, built on mem0 and Qdrant. It auto-sorts every memory into seven semantic categories, serves a tiered L0/L1/L2 context tree instead of dumping the whole pile, and keeps the conversation history on your own box. About $2/month self-hosted versus $20+/month for the cloud memory APIs.

semantic categories · self-hosted · MIT

receipts
open-source
agent-memory
self-hosted
mem0
qdrant

// TL;DR · abstract

Mnemonic is a self-hosted FastAPI memory server for AI agents on mem0 (storage plus per-fact reasoning) and Qdrant (vectors). It runs on your own box, so the conversation history and vectors never leave it. MIT, Python, v4.0.0.
Every memory is auto-sorted into one of seven semantic categories (personal, business, technical, decision, relationship, temporal, uncategorized) with an importance score. Not four layers; that earlier framing was wrong and is corrected.
The distinctive design is the tiered context tree: the /context endpoint loads L0 category summaries eagerly, then L1 and L2 detail only down the branch a query actually takes, instead of pasting the whole store into every prompt.
Fact resolution today is mem0's add/update/delete, so a changed fact updates in place. The explicit ContradictionDetector is built and unit-tested but NOT yet wired into /add; that wiring is the first open issue, not a finished feature.
About $2/month self-hosted on a small VPS plus Qdrant (an estimate, not a billing screenshot) versus $20+/month for cloud memory APIs. 36 unit tests cover the pure-function modules, no route-level tests yet, low stars, no external users.

An agent with no memory is a coworker with amnesia. It re-asks you for things you told it an hour ago, and it contradicts itself across a session because it cannot remember what it already decided. That part of the problem is not interesting to anyone, which is exactly why it stays broken. Everyone wants to build the planner and the tool-calls; almost nobody wants to own the boring layer underneath that lets the agent know what kind of thing it knows and stop losing its early turns.

Mnemonic is my attempt at that boring layer: a small self-hosted memory server for AI agents, MIT-licensed, Python, FastAPI. This essay describes what it actually is, what is shipped, and the one piece that is built but not yet wired in. I am careful about that last distinction on purpose, because the difference between "tested module" and "running in production" is the whole game in agent memory, and most write-ups blur it. The case file is at /work/mnemonic; the code is at github.com/omarnagy91/mnemonic.

01 · Why agent memory is the unglamorous half

The hosted memory APIs solve a real problem and are not wrong to exist. But they want your agents' entire conversation history sitting on their server, and they bill around $20 a month per developer for it. For someone running a fleet of agents on their own box, both of those are friction: the data leaves, and the cost scales with the number of agents you run.

The alternative most people fall back to is a raw vector store: embed everything, query by similarity, hope the right thing comes back. That works until two things happen. First, the agent has no notion of category, so a question about a person and a question about a past decision both come back as one undifferentiated list of nearest neighbours. Second, a long session hits the context window and the early turns silently fall off, which is the amnesia again, one layer down. Mnemonic does not invent a new vector database to fix this. It sits on two pieces that already work and adds the layers a raw library leaves to you.

02 · What Mnemonic actually is

Underneath, it is mem0 for the storage and the per-fact reasoning, and Qdrant for the vectors. mem0 owns the part that decides, when a new fact arrives, whether to add it, update an existing one, or delete it. Qdrant owns the vector index. Mnemonic is the server wrapped around them, plus the things a serious agent needs that the raw library hands back to you as homework: automatic categorization, a tiered context tree, a compaction hook, an import pipeline, and a visual explorer. It runs on your own machine, so the conversation history and the vectors never leave it.

I did not write a vector store. I wrote the layer that makes mem0 plus Qdrant usable as an agent's memory without a week of plumbing, and I am generalizing it in the open.

03 · Seven categories, not four layers

Every memory that goes in gets auto-sorted into exactly one of seven semantic categories, each with an importance score. The categories are: personal, business, technical, decision, relationship, temporal, and uncategorized. That last one is deliberate. A memory that does not confidently belong anywhere lands in uncategorized rather than being forced into a bucket it does not fit, which keeps the other six honest.

The payoff is that retrieval and summaries can reason about kinds of memory, not just vector distance. "What did we decide about pricing" leans on the decision category; "what do I know about this client" leans on relationship and business. The category is a cheap structural signal on top of similarity, and it costs almost nothing to carry.

One correction, because the correction is the point. Mnemonic does not have "four layers" called scratchpad, episodic, semantic, and relationships. That framing was wrong and is gone. It has seven flat semantic categories plus an importance score.

04 · The context tree is the distinctive design choice

When an agent asks for context, the obvious move is to dump the whole memory store into the prompt. That is expensive, and it gets more expensive every session. The /context endpoint does the opposite: it assembles a tiered view and loads it lazily.

L0 · category summaries

loaded eagerly · cheap

A short summary per category, loaded first, every time. This is the agent's compact map of what it knows: seven small summaries, not a thousand memories. It is cheap enough to send on every call.

L1 · the relevant slice

on demand

The memories inside the categories the query actually touches. If the question is about a decision, you pull the decision branch and leave the rest unloaded.

L2 · full detail on demand

only down the branch

Individual memories at full fidelity, pulled only when the answer needs them, only down the branch the question went.

A worked example. The agent gets a query about what a client decided on scope. It loads L0 first: seven one-line summaries, including "decision: 14 items, latest on pricing and scope." The query touches decisions, so it drills into L1 for that category and goes to L2 for the exact wording only if the answer needs it. The personal, technical, and temporal branches are never loaded.

This is the same instinct as routing in a cost-cut, applied to memory instead of model choice. If you have read how I cut our LLM bill 28% without changing models, the cheapest call is the one you never make. The tiered context tree is that pointed at retrieval: the cheapest tokens are the ones you never load.

05 · Fact resolution, told precisely

This is the section I am most careful about, because it is the easiest place to overclaim. The failure mode is the contradictory fact: the user moved to a new city last month, but the old city is still sitting in the store. A flat vector store keeps both and surfaces whichever scores higher on a given query. Two things in Mnemonic address this, and they are not the same thing.

The first is mem0, and it is what runs today. mem0 underneath already reasons about whether an incoming fact should add, update, or delete an existing memory. So "moved to a new city" updates the location in place rather than stacking a second, contradictory one next to the first. This is the fact resolution that actually happens on every write right now.

The second is an explicit ContradictionDetector that Mnemonic ships. It is an LLM-judged step that decides keep-old, keep-new, or merge, for stricter and more auditable control than mem0's built-in reasoning. Here is the precise status: it is built, and it is covered by unit tests, but it is not yet wired into the /add path. Wiring it in behind a config flag is the first open issue on the repo, not a finished feature.

The reason I will not paper over that gap: in agent memory, "we have a module for it" and "it runs on every write" are different claims, and the second is the only one that matters in production. mem0 carries the floor today; the detector raises the ceiling once it is wired.

06 · The rest of the server, with the same honesty

Three more pieces are real, listed with the same care about what is finished.

Compaction hook

shipped

When a session approaches the model's token limit, /compact saves the working context before older turns fall off the window. The direct fix for the amnesia in the opening paragraph.

Import pipeline

parsers tested · wiring open

/import ingests history from text, JSON, and CSV. The parsers are unit-tested, but finishing the end-to-end wiring is an open issue, so this is "parsers done, pipeline in flight," not "import works."

Visual explorer

shipped

A graph, a timeline, and a per-category dashboard in the browser. A memory layer you cannot see is a memory layer you cannot debug, so the explorer ships with the server.

Retrieval itself scores vector similarity against recency and importance, and a /reflect endpoint synthesizes a direct answer across the memories it pulled instead of handing back a list of fragments. That synthesis step is the difference between a pile of search hits and an answer.

07 · The self-host argument, with the estimate labelled

Self-hosting here is not ideology. It is about where the data sits and what it costs.

// monthly run cost · self-hosted vs cloud memory APIs

Self-hosted on a small VPS with Qdrant runs about $2/month at this scale. The cloud memory APIs list $20+/month per developer. The $2 figure is an estimate for a small box, not a billing screenshot.
estimate vs listed price	USD per month
Mnemonic · self-hosted	2
Cloud memory APIs	20

Self-hosted on a small VPS with Qdrant runs about $2/month at this scale. The cloud memory APIs list $20+/month per developer. The $2 figure is an estimate for a small box, not a billing screenshot.

To be exact about that $2: it is an estimate for a small box, not a billing screenshot I am holding up. The number that is not an estimate is the structural one. With a cloud memory API, your agents' conversation history lives on a vendor's server. Self-hosted, it lives on yours. For an agent reasoning over private business context, that is the deciding factor, not a preference. The same logic I would apply to a Postgres connection string applies to an agent's reasoning history, so I built the version where it does not have to leave the box.

08 · The honest history and where it actually stands

Mnemonic was not born as a clean open-source project. It was built as the memory layer for an in-house multi-agent stack of mine, which is why the repo still ships a plugin for that now-dead stack. Generalizing it in the open is the whole reason it was recently rebranded from its internal name.

On maturity, the real numbers: 36 unit tests cover the pure-function modules, the categorizer, the parsers, the retrieval scoring, the detector. Route-level tests do not exist yet, a gap I am naming rather than hiding. It is at v4.0.0, with low stars and no external users I know of. None of that is a pitch; it is the state of the thing.

What works today, stated plainly: the FastAPI server, the seven-category auto-sorting, the L0/L1/L2 context tree, weighted retrieval with the /reflect synthesis step, the compaction hook, and the visual explorer. What is in flight: the explicit contradiction detector wired into /add, and the end-to-end import pipeline. Both are open issues, both visible on the repo.

I publish the unfinished edges as issues instead of smoothing them over for a launch on purpose. An OSS memory layer has to earn trust that it does what it says, because you are going to put your agent's history inside it, and a clean story that hides the seams buys a worse outcome than naming them. The same discipline runs the audit sprint: lead with where the leak is, not where the win is.

09 · When to use something else instead

I would rather point you at the right tool than win an argument. Mnemonic is the right call when you run your own agents, want the conversation history on your own box, and want categorization plus the tiered context tree without writing that layer yourself. It is small and you can read it end to end.

It is the wrong call in a few clear cases. If you want a managed service with an SLA and zero ops, the hosted memory APIs exist for exactly that, and paying $20 a month to not run infrastructure is a reasonable trade. If you are happy embedding directly and do not need categories or tiered loading, mem0 on its own is less to operate. And if your need is a hosted, batteries-included product with a team behind it, the Mem0 Platform, Zep, and Supermemory are built to be that, where Mnemonic is the self-hosted layer you own.

The case for Mnemonic is narrow and I will keep it narrow: self-hosted, categorized, tiered memory for someone running a real agent fleet who wants the architecture without the lease. If that is you, grab it from GitHub, point it at a Qdrant instance, and read the issues to see exactly what you get and what is still being wired in. If you want it integrated into a live fleet without the plumbing, Find the leak is the applied version: one week, your stack, the right memory shape for your domain, and a working prototype against your data.

// related essays

// sources cited

https://github.com/omarnagy91/mnemonic

// next move

Want this level of rigor on your own stack?

Find the leak: 1 week, $950, fixed scope. A plain-English plan plus one real fix built and working, yours to keep regardless.

Find the leak · $950 · start here Or email omar@neurascale.org

// related essays

Mnemonic: self-hosted, categorized memory for AI agents

01 · Why agent memory is the unglamorous half

02 · What Mnemonic actually is

03 · Seven categories, not four layers

04 · The context tree is the distinctive design choice

L0 · category summaries

L1 · the relevant slice

L2 · full detail on demand

05 · Fact resolution, told precisely

06 · The rest of the server, with the same honesty

Compaction hook

Import pipeline

Visual explorer

07 · The self-host argument, with the estimate labelled

08 · The honest history and where it actually stands

09 · When to use something else instead

llm-eval-ci: the gate that fails the PR when your LLM quietly regresses

Mnemonic: self-hosted, categorized memory for AI agents

Want this level of rigor on your own stack?

llm-eval-ci: the gate that fails the PR when your LLM quietly regresses

2,831 orders on 30 commits: the boring system a donut shop kept over the better one I built