Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.maximem.ai/llms.txt

Use this file to discover all available pages before exploring further.

Short-term context is like a person’s working memory during a meeting. They remember everything discussed so far in the meeting, but once the meeting ends, only the important takeaways persist as long-term memories.

How short-term context builds

Every conversational turn adds to the short-term context. A “turn” consists of a user message and the corresponding assistant response. As the conversation progresses, the context window grows:
Turn 1:  User: "What's our current API rate limit?"
         Assistant: "Your current rate limit is 1,000 requests per minute."
         Context size: ~50 tokens

Turn 2:  User: "Can we increase that for our enterprise plan?"
         Assistant: "Yes, enterprise plans support up to 10,000 req/min..."
         Context size: ~120 tokens

Turn 3:  User: "What about burst handling?"
         Assistant: "Burst allowances provide a 2x multiplier for 30 seconds..."
         Context size: ~200 tokens

  ...

Turn 25: Context size: ~4,000 tokens
Turn 50: Context size: ~8,000 tokens
Each turn is appended to the running conversation history. Your agent sees the full history on every subsequent turn, which is what gives it the ability to reference earlier parts of the conversation (“as I mentioned earlier…” or “going back to your question about rate limits…”).

The context window problem

Short-term context cannot grow indefinitely. There are three practical constraints:

Token limits

Every LLM has a maximum context window. Once the conversation history exceeds this limit, older content must be dropped or compressed. Even with large context windows (100K+ tokens), filling them entirely with conversation history leaves little room for retrieved long-term memories and system instructions.

Cost scaling

LLM costs scale with input token count. A conversation with 50,000 tokens of history costs significantly more per turn than one with 2,000 tokens. For high-volume applications, unbounded context growth can make costs unsustainable.

Quality degradation

Research shows that LLMs pay less attention to content in the middle of long contexts (the “lost in the middle” phenomenon). Extremely long conversation histories can actually degrade response quality as the model struggles to identify the most relevant parts.
This is where context compaction becomes essential.

When compaction kicks in

Synap monitors the short-term context and triggers compaction when configurable thresholds are exceeded. You can configure these thresholds through the Memory Architecture Configuration:
ThresholdWhat it measuresDefaultConfigurable via
Token countTotal tokens in the conversation historyVaries by modelMACA retrieval.budget.max_tokens
Turn countNumber of conversational turnsConfigurableMACA retrieval settings
Cost thresholdAccumulated cost of context per conversationConfigurableMACA retrieval settings
When any threshold is exceeded, Synap initiates the compaction process.

What happens during compaction

Compaction is not simply truncating old messages. Synap performs intelligent compression that preserves essential information while reducing token count.
1

Analyze the conversation

Synap examines the full conversation history to identify key information: facts established, decisions made, preferences expressed, action items created, and the current topic of discussion.
2

Extract essential information

Critical information is extracted into structured summaries. This includes:
  • Active facts: information that is still relevant to the conversation
  • Decisions made: choices or conclusions reached during the discussion
  • Current state: what the user is currently working on or asking about
  • Pending items: unanswered questions or open action items
  • User preferences: communication style, format preferences expressed during the conversation
3

Compress the history

The original verbose conversation history is replaced with a compressed representation. Early turns that have been fully resolved are condensed into summaries. Recent turns (typically the last 3-5) are preserved verbatim to maintain immediate conversational flow.
4

Persist to long-term memory

Information extracted during compaction that has lasting value — facts, preferences, episodes, temporal events — is also routed through the ingestion pipeline to become long-term memories. This ensures that valuable knowledge from conversations is not lost when the short-term context is compressed.
Compaction is a lossy process. While Synap preserves the most important information, some conversational nuance and detail from early turns is inevitably lost. For applications where complete conversation fidelity is required, consider storing full transcripts separately and relying on Synap for the semantic layer.

Compaction in practice

Here is a simplified example of how compaction transforms a conversation:
Turn 1: User asks about API rate limits
Turn 2: Assistant explains 1,000 req/min default
Turn 3: User asks about enterprise pricing
Turn 4: Assistant provides pricing tiers
Turn 5: User asks about SSO support
Turn 6: Assistant confirms SAML and OIDC support
Turn 7: User asks about data residency
Turn 8: Assistant explains EU and US region options
Turn 9: User asks about migration from competitor
Turn 10: Assistant outlines migration process
...
Turn 20: User asks a follow-up about migration timeline
Context size: ~6,000 tokens
The compressed version retains all essential information while reducing context size by approximately 75%. The most recent turns are preserved verbatim for conversational continuity.

Relationship to long-term memory

Short-term and long-term memory work together. During a conversation:
  1. Long-term memories are retrieved at the start of each turn to provide background knowledge
  2. Short-term context provides immediate conversational continuity
  3. During compaction, extracted knowledge is routed to long-term storage
  4. After the conversation ends, remaining short-term context can be ingested as a conversation document, producing additional long-term memories
# During a conversation, both context types work together
context = await sdk.user.context.fetch(
    user_id="user_alice",
    customer_id="acme_corp",
    search_query=["migration timeline"],
    conversation_id="conv_abc123"  # Links to short-term context
)

# context includes:
# - Long-term memories: "Alice is evaluating enterprise plan" (from past sessions)
# - Short-term context: "Currently discussing migration from competitor" (this session)
This dual-memory architecture mirrors how human memory works: we draw on both what we remember from past experiences (long-term) and what we are currently thinking about (short-term) to formulate responses.

Next steps

Context Compaction

Technical deep dive into the compaction algorithm and configuration options.

Long-term Context

How persistent knowledge is stored and retrieved across sessions.

SDK: Context Compaction

Implement and configure context compaction in your application.

Memories & Context

Return to the overview of memories and context in Synap.