Fast Mode - Maximem Synap

If you are building a real-time chatbot or customer-facing agent, start with fast mode for both ingestion and retrieval. You can selectively upgrade specific interactions to accurate/long-range mode as needed without changing your overall architecture.

Fast ingestion

Fast ingestion runs a lightweight version of the extraction pipeline, optimized to make memories available as quickly as possible.

What it does

Stage	Behavior
Chunking	Basic semantic chunking by paragraph and sentence boundaries
Entity extraction	Lightweight named entity recognition (people, organizations, products)
Embedding	Vector embeddings generated for each chunk
Preference detection	Basic keyword-based preference identification
Storage	Chunks stored in vector store; entities indexed for lookup

What it skips

Stage	Skipped in fast mode
Deep entity resolution	No cross-reference matching against the full entity registry. Auto-registration still occurs, but semantic matching against existing entries is limited.
Relationship mapping	No graph edges created between entities. The relationships between people, projects, and decisions are not explicitly modeled.
Advanced categorization	No topic hierarchy classification or domain-specific tagging beyond basic entity types.
Emotional analysis	No sentiment or emotional tone analysis of the conversation.

Processing time

Fast ingestion typically completes in 1-5 seconds. Memories become available for vector-based retrieval almost immediately after processing.

When to use fast ingestion

Real-time chat logging: Every conversation turn in a live agent interaction.
High-throughput pipelines: Applications that ingest hundreds of documents per minute.
Non-critical context: Routine conversations, status updates, simple Q&A interactions.
Ephemeral content: Data that is useful for near-term context but does not need deep relationship modeling.

Code example

from maximem_synap import MaximemSynapSDK

sdk = MaximemSynapSDK(api_key="your_api_key")

# Fast ingestion for a routine conversation turn
await sdk.memories.create(
    document="User: What's the status of my order?\n"
             "Assistant: Your order #4521 shipped yesterday and should arrive by Thursday.",
    document_type="ai-chat-conversation",
    user_id="user_123",
    customer_id="acme_corp",
    mode="fast"
)
# Returns immediately. Memory available for retrieval within seconds.

Fast retrieval

Fast retrieval queries both the vector store and the knowledge graph, but skips the LLM-driven subquery decomposition and reranking that accurate mode performs. This keeps latency low, making it suitable for the hot path of real-time conversations.

How it works

Query embedding

The user’s query is converted into a vector embedding, consistent with the embeddings created during ingestion.

Vector similarity search

The query embedding is compared against stored memory embeddings using cosine similarity. The search is scoped to the applicable scope levels (user, customer, client, world) based on the provided user_id and customer_id.

Ranking

Results are ranked by cosine similarity score. No additional ranking signals (recency, graph centrality, confidence) are applied.

Return

The top-k results (determined by the configured budget) are returned as structured context.

Latency

Fast retrieval is low-latency, which is what makes it suitable for the hot path of real-time conversations. It returns context quickly enough that retrieval is rarely the bottleneck in an agent’s overall response time.

What it returns

Fast retrieval returns memories that are semantically similar to the query. It finds content that talks about the same topics or uses similar language.

# Fast retrieval for a real-time conversation
context = await sdk.conversation.context.fetch(
    conversation_id="conv_123",
    user_id="user_123",
    customer_id="acme_corp",
    search_query=["What do we know about Project Atlas?"],
    mode="fast"
)

# Returns: memories that mention "Project Atlas" directly
# Does NOT return: connected entities (team members, deadlines, decisions)
# unless they co-occur in the same chunks as "Project Atlas"

for fact in context.facts:
    print(f"[{fact.confidence:.2f}] {fact.content}")

What it misses

Fast retrieval queries both the vector store and the knowledge graph, so it does surface directly connected graph context. What it skips are the LLM-driven refinements that accurate mode adds:

LLM subquery decomposition: Accurate mode uses an LLM to break a complex query into focused sub-queries, expanding coverage across multiple entities and angles. Fast mode runs the query as given, so broad multi-part questions may retrieve less complete context.
Reranking: Accurate mode applies an additional reranking pass to reorder candidates for relevance. Fast mode returns results without this extra refinement, so the ordering is less finely tuned for complex, multi-entity queries.

For these scenarios, use accurate mode.

Tradeoffs: fast vs accurate

Aspect	Fast Mode	Accurate Mode
Ingestion processing time	1-5 seconds	10 seconds to several minutes
Retrieval latency	Lower latency	Higher latency
Search method	Vector + graph (no LLM query decomposition)	Vector + graph + LLM subquery decomposition + reranking
Ranking signals	Cosine similarity	Similarity + recency + graph centrality + confidence
Entity resolution	Lightweight (basic NER)	Full pipeline (semantic matching, cross-reference)
Relationship awareness	Graph relationships, without LLM-driven multi-hop decomposition	Full (explicit graph edges with LLM-driven multi-hop decomposition)
Best for	Real-time chat, simple queries, high throughput	Complex queries, relationship-aware context, deep analysis
Cost	Lower compute usage	Higher compute usage

Fast and accurate modes are not mutually exclusive. You can use fast mode for retrieval during real-time conversations and switch to accurate mode for specific high-value queries — all within the same application and the same Synap Instance.

When to use fast mode

Fast mode is the right choice for the majority of agent interactions. Use it when:

Real-time conversations

Any conversation where the user is waiting for a response. Fast retrieval latency is low and effectively imperceptible, and the overall response time is dominated by LLM generation.

Single-topic queries

Questions about a specific topic, person, or event where the answer is likely contained in a single memory chunk. Examples: “What is our refund policy?”, “When is Alice’s birthday?”, “What did the user say about dark mode?”

High-frequency retrieval

Applications that retrieve context on every user message. At scale (thousands of concurrent users), the lower compute cost of fast mode is significant.

Cost-sensitive applications

Fast mode uses less compute per retrieval than accurate mode. For applications with high query volumes and tight cost budgets, fast mode provides a meaningful cost reduction.

Latency-sensitive applications

Voice agents, real-time collaboration tools, and other applications where every millisecond of latency matters. Fast mode is lower-latency than accurate mode.

When to upgrade to accurate mode

Consider switching to accurate mode for specific interactions:

Complex queries spanning multiple entities: “Summarize everything we know about the Atlas project, who is involved, and what decisions have been made.”
Relationship queries: “How is Sarah connected to the infrastructure migration?”
Comprehensive summaries: “Give me a full briefing on this customer.”
Onboarding or profile-building conversations: Where deep extraction builds a richer user profile from the start.

You can mix modes within the same application:

# Default: fast mode for real-time chat
context = await sdk.conversation.context.fetch(
    conversation_id=conversation_id,
    user_id=user_id,
    customer_id=customer_id,
    search_query=[user_message],
    mode="fast"
)

# Upgrade to accurate mode when the query is complex
if is_complex_query(user_message):
    context = await sdk.conversation.context.fetch(
        conversation_id=conversation_id,
        user_id=user_id,
        customer_id=customer_id,
        search_query=[user_message],
        mode="accurate"
    )

Code examples

Full agent loop with fast mode

from maximem_synap import MaximemSynapSDK
from openai import AsyncOpenAI

sdk = MaximemSynapSDK(api_key="synap_api_key")
openai_client = AsyncOpenAI(api_key="openai_api_key")

async def handle_message(user_message: str, user_id: str, customer_id: str, conversation_id: str):
    # Fast retrieval: low latency
    context = await sdk.conversation.context.fetch(
        conversation_id=conversation_id,
        user_id=user_id,
        customer_id=customer_id,
        search_query=[user_message],
        mode="fast"
    )

    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"You are a helpful assistant.\n\nContext:\n{context.formatted_context}"
            },
            {"role": "user", "content": user_message}
        ]
    )

    assistant_message = response.choices[0].message.content

    # Fast ingestion: returns immediately, processes in 1-5 seconds
    await sdk.memories.create(
        document=f"User: {user_message}\nAssistant: {assistant_message}",
        document_type="ai-chat-conversation",
        user_id=user_id,
        customer_id=customer_id,
        mode="fast"
    )

    return assistant_message

Batch ingestion in fast mode

Fast mode can also be used for bootstrap ingestion when speed is more important than extraction depth:

# Fast bootstrap for high-volume, lower-priority data
result = await sdk.memories.batch_create(
    documents=[
        {
            "document": log_entry,
            "document_type": "ai-chat-conversation",
            "document_id": f"log_{entry_id}",
            "user_id": user_id,
            "customer_id": customer_id,
            "mode": "fast"  # Override the default long-range for batch
        }
        for entry_id, log_entry, user_id, customer_id in log_entries
    ],
    fail_fast=False
)

Next steps

Accurate Mode

Understand the thorough alternative for complex queries and important documents.

Context Fetch SDK

Full SDK reference for retrieval methods, including mode selection.

Runtime Ingestion

How runtime ingestion integrates fast mode into the agent loop.

Agent Interactions

The full retrieve-generate-ingest pattern for memory-enabled agents.

Documentation Index

​Fast ingestion

​What it does

​What it skips

​Processing time

​When to use fast ingestion

​Code example

​Fast retrieval

​How it works

​Latency

​What it returns

​What it misses

​Tradeoffs: fast vs accurate

​When to use fast mode

​When to upgrade to accurate mode

​Code examples

​Full agent loop with fast mode

​Batch ingestion in fast mode

​Next steps

Accurate Mode

Context Fetch SDK

Runtime Ingestion

Agent Interactions

Fast ingestion

What it does

What it skips

Processing time

When to use fast ingestion

Code example

Fast retrieval

How it works

Latency

What it returns

What it misses

Tradeoffs: fast vs accurate

When to use fast mode

When to upgrade to accurate mode

Code examples

Full agent loop with fast mode

Batch ingestion in fast mode

Next steps