nof0xgiven
Concepts

Intelligent Context Compression

Non-blocking, continuous context management that makes AI conversations feel infinite.

The Problem I Kept Running Into

Most AI agents hit a wall when conversations get long. The standard approach — LangChain's summarizationMiddleware — blocks the entire agent while it generates a summary. The user stares at a spinner, the agent can't respond, and the most recent messages (the freshest, most relevant context) get caught in the summarisation window. It's a teadious interruption that completely breaks flow.

The fallback is even worse: when context overflows, it just trims old messages — no summarisation, just information silently deleted. You lose things you needed.

Neither of these is acceptable if you're building something that feels like a real thinking partner.


Two-Phase Architecture

Idea 1: Background Snapshot (non-blocking compaction)

Instead of blocking the agent to summarize, run a background Cerebras agent that creates a context bundle while the user keeps chatting. The bundle gets injected seamlessly at the next turn boundary. The user never sees a spinner. The agent never pauses. It just keeps going.

Here's the flow:

User + Agent chatting normally

    [70% context reached]

         ├──→ Background Cerebras agent starts summarizing
         │    (runs in parallel — user and agent NOT blocked)

         ├──→ Conversation continues normally...
         │    Messages after snapshot index are tracked verbatim

    [Cerebras finishes — usually 10-15 seconds for 100K tokens]

    [Next turn boundary]

         └──→ Inject context bundle:
              1. System prompt (preserved)
              2. Compressed context bundle
              3. Recent messages since snapshot (verbatim, untouched)

              UI: subtle "context compressed" divider — no drama

A few considerations to make it non blocking:

The background provider is Cerebras running at ~1000 tokens/second. A 100K token conversation compresses in 10–15 seconds — fast enough that by the time the user sends their next message, the bundle is almost always ready. If it's not, wait briefly (up to 5 seconds) or fall back to trim-based compression. Everything after the snapshot index is kept verbatim, so in-flight context is never summarised mid-thought.

The summary prompt is also custom — not LangChain's default. It explicitly captures what actually matters: decisions made and their reasoning, files modified, current task state and progress, user preferences expressed, errors encountered and how they were resolved, and any active constraints or open questions. The things an agent actually needs to keep working.


Idea 2: Continuous Context Pipe (infinite memory)

Idea 1 is the snapshot approach — one-shot compression at 65–70%. Idea 2 is much more interesting.

Instead of waiting until context fills up, evolve the system into a continuously-maintained context store that updates after every agent turn. The agent's context window stays perpetually lean — always under 50% — and context never "runs out" because it's never accumulating unchecked in the first place.

Every agent turn completes

    [Cerebras context processor]
    "Here's the latest exchange. Here's the current store. Update it."


    [Persistent Context Store]  ← structured markdown, not a vector DB
    ┌─────────────────────────┐
    │ ## Task State            │  ← always current
    │ ## Decisions             │  ← append-only, never loses decisions
    │ ## Files & Code          │  ← tracks all modifications
    │ ## Important Context     │  ← drift detection: fades stale items
    │ ## User Preferences      │  ← permanent unless contradicted
    └─────────────────────────┘


    [Agent prompt assembly]
    System prompt + Context store snapshot + Last ~20 messages
    Always < 50% of context limit

The key difference from Phase 1:

AspectPhase 1Phase 2
WhenOne-shot at 65%After every turn
StoreEphemeralPersistent per-thread
ContentFull conversation summaryStructured, categorized, with drift
Context usageGrows → compacts → regrowsAlways < 50%, no spikes

Drift detection is one of the more interesting problems here. Not everything stays relevant. Debugging traces, exploration dead-ends, and superseded approaches should fade. Decisions, user preferences, and file modifications should never be lost. And if something old gets referenced again, it gets promoted back to the top. The context store becomes a living, self-managing document.


Why This Matters

A blocking spinner when you're mid flow is annoying, but more frustrating is the context pollution at 70% and ongoing need to repeat things. An agent that manages context gracefully, that never pauses, never forgets a decision you made 40 messages ago, and never silently drops your preferences would feel like a true step up.

That's what ICC is designed to be.

On this page