Every AI agent accumulates memory over time. Without a compression strategy, that memory becomes an unsearchable mess of contradictions, duplicates, and stale noise. This is how you prevent it — with graph theory, Union-Find clustering, and LLM-driven synthesis.
Here's a number most AI engineers never look at: the rate at which their agent's memory graph grows. In a production setting with a typical conversational agent, you're looking at roughly 15–40 new memory nodes per session. After 3 months of daily use, that's 1,400–3,600 nodes. After a year, 5,000–14,000.
That sounds manageable until you add edges. A fully-connected semantic layer between N nodes has O(N²) potential edges. At 3,000 nodes, you have theoretical space for 9 million edges. Context-window injection of that graph is obviously impossible. Even a k-nearest-neighbours index over raw embeddings degrades significantly past about 500 nodes for the retrieval patterns agents actually use.
"The graph that felt fast at 200 nodes will feel wrong at 2,000. Not slow — wrong. The retrieved memories will be stale, contradictory, or contextually irrelevant. The agent will confidently reason from outdated information."
The solution isn't to cap memory size — that just throws away information. The solution is consolidation: periodically compressing fine-grained episodic fragments into higher-density semantic nodes, the same way human memory converts short-term episodes into long-term knowledge during sleep.
Before getting to the right answer, let's look at the three approaches most engineers try first — and why they fail.
Delete memories older than N days. Simple, fast, completely wrong. The most important memories in an agent's graph might be 6 months old — "user is allergic to framework X because of incident Y". TTL pruning has no concept of importance, only age. You'll delete exactly the institutional knowledge you need most.
Age and importance are uncorrelated. A 2-year-old memory about a fundamental architecture decision is infinitely more valuable than a 2-hour-old memory about what the user had for lunch.
If two nodes have cosine similarity > 0.9, delete one. Also wrong. High-similarity nodes are often carrying different causal or temporal information. "The auth module was broken on Tuesday" and "The auth module was broken on Thursday" are highly similar embeddings representing two different events that might have different causes. Dedup them and you lose half the signal.
Keep only the N most recently accessed nodes. Wrong in a different way. This creates recency bias — memories that are accessed frequently aren't necessarily important, they might just be frequently referenced because they're wrong and the agent keeps trying to reconcile them. LRU rewards confusion.
The core of the REM clustering phase is Union-Find — a classical data structure from graph theory that efficiently groups nodes into connected components. If you've implemented Kruskal's algorithm for minimum spanning trees, you've used it. Here's how it applies to memory.
The problem: given a graph with hundreds of memory nodes, find which groups of nodes are "about the same thing" — sufficiently connected or similar that they can be safely consolidated into a single synthesis node. This is a connected-components problem.
class UnionFind {
constructor(nodes) {
this.parent = Object.fromEntries(nodes.map(n => [n.id, n.id]));
this.rank = Object.fromEntries(nodes.map(n => [n.id, 0]));
}
find(id) { // Path compression
if (this.parent[id] !== id)
this.parent[id] = this.find(this.parent[id]);
return this.parent[id];
}
union(a, b) { // Union by rank
const [ra, rb] = [this.find(a), this.find(b)];
if (ra === rb) return;
if (this.rank[ra] < this.rank[rb]) [ra, rb] = [rb, ra]; // swap
this.parent[rb] = ra;
if (this.rank[ra] === this.rank[rb]) this.rank[ra]++;
}
clusters() { // Return groups
const groups = {};
for (const id of Object.keys(this.parent)) {
const root = this.find(id);
(groups[root] ??= []).push(id);
}
return Object.values(groups).filter(g => g.length > 1);
}
}
// Usage in REM cycle: link nodes that share entity connections
// or exceed semantic similarity threshold θ
async function clusterForREM(flaggedNodes, graph, θ = 0.78) {
const uf = new UnionFind(flaggedNodes);
// Connect nodes sharing entity edges
for (const node of flaggedNodes) {
const neighbours = await graph.entityNeighbours(node.id);
neighbours.forEach(n => uf.union(node.id, n.id));
}
// Connect nodes above similarity threshold
for (const node of flaggedNodes) {
const similar = await graph.similarNodes(node.embedding, { k: 8, threshold: θ });
similar.filter(s => flaggedNodes.has(s.id)).forEach(s => uf.union(node.id, s.id));
}
return uf.clusters(); // → [[id1, id2, id3], [id4, id5], ...]
}The threshold θ = 0.78 is the key hyperparameter. Too low and you merge nodes that should stay separate. Too high and you get no compression. In practice, 0.75–0.82 works well for most agent use cases. It's worth running an offline analysis on your graph to tune this for your domain.
Once Union-Find has identified a cluster, the synthesis step sends it to an LLM with a carefully structured prompt. This is where the actual intelligence lives — the LLM distils the cluster into a higher-level insight that contains the essential signal from all members.
The prompt design matters enormously. A naive "summarise these memories" produces a lossy average. A well-structured synthesis prompt preserves causal structure, temporal ordering, entity relationships, and contradiction signals.
function buildSynthesisPrompt(cluster, edges) {
return `You are a memory consolidation engine for an AI agent.
Below are ${cluster.length} related memory fragments collected over time.
Your task: synthesise them into ONE high-density insight node.
RULES:
1. Preserve all causal relationships (X caused Y, X led to Z)
2. Preserve temporal ordering (before/after, sequence)
3. Preserve entity associations (people, projects, decisions)
4. If fragments contradict each other, note the resolution
5. Output must be MORE information-dense than any single fragment
6. DO NOT average or generalise — distil and compress
FRAGMENTS:
${cluster.map((m, i) => `[${i+1}] ${m.content} (importance: ${m.importance.toFixed(2)})`).join('\n')}
CAUSAL EDGES WITHIN CLUSTER:
${edges.filter(e => e.type === 'causal').map(e => `${e.from_summary} → ${e.to_summary}`).join('\n')}
OUTPUT FORMAT (JSON):
{
"synthesis": "the distilled insight — one dense paragraph",
"key_entities": ["entity1", "entity2"],
"causal_chains": ["cause → effect"],
"temporal_anchors": ["earliest event", "latest event"],
"importance": 0.0–1.0
}`;
}The JSON output structure is deliberate. It forces the LLM to explicitly preserve what matters — causal chains and temporal anchors — rather than producing a fluffy narrative summary that loses the structure. The `importance` field becomes the synthesis node's score, typically 15–30% higher than the mean of its cluster members.
After synthesis, you have a new node. But the cluster members had edges to other nodes in the graph — connections to entities, causes, temporally adjacent memories. If you just delete the cluster and add the synthesis node, you lose all those connections. You've created a floating island.
async function migrateEdges(clusterIds, synthesisNodeId, db) {
// Find all outbound edges from cluster members to OUTSIDE the cluster
const externalEdges = await db.all(`
SELECT * FROM edges
WHERE from_id IN (${clusterIds.map(() => '?').join(',')})
AND to_id NOT IN (${clusterIds.map(() => '?').join(',')})
`, [...clusterIds, ...clusterIds]);
// Deduplicate: if multiple cluster members pointed to same target,
// keep the highest-weight edge only
const deduped = deduplicateEdges(externalEdges);
// Repoint all edges to the synthesis node
for (const edge of deduped) {
await db.run(
'INSERT INTO edges (from_id, to_id, type, weight) VALUES (?, ?, ?, ?)',
[synthesisNodeId, edge.to_id, edge.type, edge.weight]
);
}
// Soft-delete cluster members — don't hard delete in case rollback needed
await db.run(
`UPDATE memories SET deprecated = 1 WHERE id IN (${clusterIds.map(() => '?').join(',')})`,
clusterIds
);
}The REM cycle only targets nodes below an importance threshold τ. Getting this function right determines whether your agent becomes smarter or dumber over time. Here's the formula that works in practice:
Running REM too frequently wastes LLM tokens on clusters too small to meaningfully compress. Running it too infrequently lets the graph degrade. The right trigger is event-based, not time-based.
In production, combine triggers 2 and 3: monitor retrieval quality continuously, and also run REM during every idle period longer than 4 hours. The idle run catches routine fragmentation. The quality-decay trigger catches unexpected spikes.
Memory compression for AI agents is a solved problem — it just hasn't been widely adopted yet. The research is clear (arXiv:2601.02163, arXiv:2601.03236), the algorithms are classical (Union-Find, graph traversal), and the infrastructure is minimal (SQLite + sqlite-vec).
The agent that runs REM compression becomes genuinely more capable over time. It doesn't just remember more efficiently — it abstracts episodic details into semantic knowledge, the same transition that separates expertise from experience. The graph gets smarter, not just smaller.
What you need to build this: a graph schema that separates nodes from edges, an importance scoring function with the four components above, a Union-Find implementation for clustering, a synthesis prompt that preserves causal and temporal structure, and an edge inheritance step that keeps the graph connected after pruning.
Everything described here is implemented in VEKTOR's open architecture at vektormemory.com. The research papers are all available on arXiv. The SQLite-vec extension is open source. There's no reason your agent should be forgetting who you are.
// END_OF_ARTICLE
VEKTOR · vektormemory.com · Node.js + SQLite-vec