Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.maximem.ai/llms.txt

Use this file to discover all available pages before exploring further.

If you are building a real-time chatbot or customer-facing agent, start with fast mode for both ingestion and retrieval. You can selectively upgrade specific interactions to accurate/long-range mode as needed without changing your overall architecture.

Fast ingestion

Fast ingestion runs a lightweight version of the extraction pipeline, optimized to make memories available as quickly as possible.

What it does

StageBehavior
ChunkingBasic semantic chunking by paragraph and sentence boundaries
Entity extractionLightweight named entity recognition (people, organizations, products)
EmbeddingVector embeddings generated for each chunk
Preference detectionBasic keyword-based preference identification
StorageChunks stored in vector store; entities indexed for lookup

What it skips

StageSkipped in fast mode
Deep entity resolutionNo cross-reference matching against the full entity registry. Auto-registration still occurs, but semantic matching against existing entries is limited.
Relationship mappingNo graph edges created between entities. The relationships between people, projects, and decisions are not explicitly modeled.
Advanced categorizationNo topic hierarchy classification or domain-specific tagging beyond basic entity types.
Emotional analysisNo sentiment or emotional tone analysis of the conversation.

Processing time

Fast ingestion typically completes in 1-5 seconds. Memories become available for vector-based retrieval almost immediately after processing.

When to use fast ingestion

  • Real-time chat logging: Every conversation turn in a live agent interaction.
  • High-throughput pipelines: Applications that ingest hundreds of documents per minute.
  • Non-critical context: Routine conversations, status updates, simple Q&A interactions.
  • Ephemeral content: Data that is useful for near-term context but does not need deep relationship modeling.

Code example

from maximem_synap import MaximemSynapSDK

sdk = MaximemSynapSDK(api_key="your_api_key")

# Fast ingestion for a routine conversation turn
await sdk.memories.create(
    document="User: What's the status of my order?\n"
             "Assistant: Your order #4521 shipped yesterday and should arrive by Thursday.",
    document_type="ai-chat-conversation",
    user_id="user_123",
    customer_id="acme_corp",
    mode="fast"
)
# Returns immediately. Memory available for retrieval within seconds.

Fast retrieval

Fast retrieval queries both the vector store and the knowledge graph, but skips the LLM-driven subquery decomposition and reranking that accurate mode performs. This keeps latency low, making it suitable for the hot path of real-time conversations.

How it works

1

Query embedding

The user’s query is converted into a vector embedding, consistent with the embeddings created during ingestion.
2

Vector similarity search

The query embedding is compared against stored memory embeddings using cosine similarity. The search is scoped to the applicable scope levels (user, customer, client, world) based on the provided user_id and customer_id.
3

Ranking

Results are ranked by cosine similarity score. No additional ranking signals (recency, graph centrality, confidence) are applied.
4

Return

The top-k results (determined by the configured budget) are returned as structured context.

Latency

Fast retrieval is low-latency, which is what makes it suitable for the hot path of real-time conversations. It returns context quickly enough that retrieval is rarely the bottleneck in an agent’s overall response time.

What it returns

Fast retrieval returns memories that are semantically similar to the query. It finds content that talks about the same topics or uses similar language.
# Fast retrieval for a real-time conversation
context = await sdk.conversation.context.fetch(
    conversation_id="conv_123",
    user_id="user_123",
    customer_id="acme_corp",
    search_query=["What do we know about Project Atlas?"],
    mode="fast"
)

# Returns: memories that mention "Project Atlas" directly
# Does NOT return: connected entities (team members, deadlines, decisions)
# unless they co-occur in the same chunks as "Project Atlas"

for fact in context.facts:
    print(f"[{fact.confidence:.2f}] {fact.content}")

What it misses

Fast retrieval queries both the vector store and the knowledge graph, so it does surface directly connected graph context. What it skips are the LLM-driven refinements that accurate mode adds:
  • LLM subquery decomposition: Accurate mode uses an LLM to break a complex query into focused sub-queries, expanding coverage across multiple entities and angles. Fast mode runs the query as given, so broad multi-part questions may retrieve less complete context.
  • Reranking: Accurate mode applies an additional reranking pass to reorder candidates for relevance. Fast mode returns results without this extra refinement, so the ordering is less finely tuned for complex, multi-entity queries.
For these scenarios, use accurate mode.

Tradeoffs: fast vs accurate

AspectFast ModeAccurate Mode
Ingestion processing time1-5 seconds10 seconds to several minutes
Retrieval latencyLower latencyHigher latency
Search methodVector + graph (no LLM query decomposition)Vector + graph + LLM subquery decomposition + reranking
Ranking signalsCosine similaritySimilarity + recency + graph centrality + confidence
Entity resolutionLightweight (basic NER)Full pipeline (semantic matching, cross-reference)
Relationship awarenessGraph relationships, without LLM-driven multi-hop decompositionFull (explicit graph edges with LLM-driven multi-hop decomposition)
Best forReal-time chat, simple queries, high throughputComplex queries, relationship-aware context, deep analysis
CostLower compute usageHigher compute usage
Fast and accurate modes are not mutually exclusive. You can use fast mode for retrieval during real-time conversations and switch to accurate mode for specific high-value queries — all within the same application and the same Synap Instance.

When to use fast mode

Fast mode is the right choice for the majority of agent interactions. Use it when:
Any conversation where the user is waiting for a response. Fast retrieval latency is low and effectively imperceptible, and the overall response time is dominated by LLM generation.
Questions about a specific topic, person, or event where the answer is likely contained in a single memory chunk. Examples: “What is our refund policy?”, “When is Alice’s birthday?”, “What did the user say about dark mode?”
Applications that retrieve context on every user message. At scale (thousands of concurrent users), the lower compute cost of fast mode is significant.
Fast mode uses less compute per retrieval than accurate mode. For applications with high query volumes and tight cost budgets, fast mode provides a meaningful cost reduction.
Voice agents, real-time collaboration tools, and other applications where every millisecond of latency matters. Fast mode is lower-latency than accurate mode.

When to upgrade to accurate mode

Consider switching to accurate mode for specific interactions:
  • Complex queries spanning multiple entities: “Summarize everything we know about the Atlas project, who is involved, and what decisions have been made.”
  • Relationship queries: “How is Sarah connected to the infrastructure migration?”
  • Comprehensive summaries: “Give me a full briefing on this customer.”
  • Onboarding or profile-building conversations: Where deep extraction builds a richer user profile from the start.
You can mix modes within the same application:
# Default: fast mode for real-time chat
context = await sdk.conversation.context.fetch(
    conversation_id=conversation_id,
    user_id=user_id,
    customer_id=customer_id,
    search_query=[user_message],
    mode="fast"
)

# Upgrade to accurate mode when the query is complex
if is_complex_query(user_message):
    context = await sdk.conversation.context.fetch(
        conversation_id=conversation_id,
        user_id=user_id,
        customer_id=customer_id,
        search_query=[user_message],
        mode="accurate"
    )

Code examples

Full agent loop with fast mode

from maximem_synap import MaximemSynapSDK
from openai import AsyncOpenAI

sdk = MaximemSynapSDK(api_key="synap_api_key")
openai_client = AsyncOpenAI(api_key="openai_api_key")

async def handle_message(user_message: str, user_id: str, customer_id: str, conversation_id: str):
    # Fast retrieval: low latency
    context = await sdk.conversation.context.fetch(
        conversation_id=conversation_id,
        user_id=user_id,
        customer_id=customer_id,
        search_query=[user_message],
        mode="fast"
    )

    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"You are a helpful assistant.\n\nContext:\n{context.formatted_context}"
            },
            {"role": "user", "content": user_message}
        ]
    )

    assistant_message = response.choices[0].message.content

    # Fast ingestion: returns immediately, processes in 1-5 seconds
    await sdk.memories.create(
        document=f"User: {user_message}\nAssistant: {assistant_message}",
        document_type="ai-chat-conversation",
        user_id=user_id,
        customer_id=customer_id,
        mode="fast"
    )

    return assistant_message

Batch ingestion in fast mode

Fast mode can also be used for bootstrap ingestion when speed is more important than extraction depth:
# Fast bootstrap for high-volume, lower-priority data
result = await sdk.memories.batch_create(
    documents=[
        {
            "document": log_entry,
            "document_type": "ai-chat-conversation",
            "document_id": f"log_{entry_id}",
            "user_id": user_id,
            "customer_id": customer_id,
            "mode": "fast"  # Override the default long-range for batch
        }
        for entry_id, log_entry, user_id, customer_id in log_entries
    ],
    fail_fast=False
)

Next steps

Accurate Mode

Understand the thorough alternative for complex queries and important documents.

Context Fetch SDK

Full SDK reference for retrieval methods, including mode selection.

Runtime Ingestion

How runtime ingestion integrates fast mode into the agent loop.

Agent Interactions

The full retrieve-generate-ingest pattern for memory-enabled agents.