Skip to main content
If you are building a real-time chatbot or customer-facing agent, start with fast mode for both ingestion and retrieval. You can selectively upgrade specific interactions to accurate/long-range mode as needed without changing your overall architecture.

Fast ingestion

Fast ingestion runs a lightweight version of the extraction pipeline, optimized to make memories available as quickly as possible.

What it does

StageBehavior
ChunkingBasic semantic chunking by paragraph and sentence boundaries
Entity extractionLightweight named entity recognition (people, organizations, products)
EmbeddingVector embeddings generated for each chunk (1536 dimensions)
Preference detectionBasic keyword-based preference identification
StorageChunks stored in vector store; entities indexed for lookup

What it skips

StageSkipped in fast mode
Deep entity resolutionNo cross-reference matching against the full entity registry. Auto-registration still occurs, but semantic matching against existing entries is limited.
Relationship mappingNo graph edges created between entities. The relationships between people, projects, and decisions are not explicitly modeled.
Advanced categorizationNo topic hierarchy classification or domain-specific tagging beyond basic entity types.
Emotional analysisNo sentiment or emotional tone analysis of the conversation.

Processing time

Fast ingestion typically completes in 1-5 seconds. Memories become available for vector-based retrieval almost immediately after processing.

When to use fast ingestion

  • Real-time chat logging: Every conversation turn in a live agent interaction.
  • High-throughput pipelines: Applications that ingest hundreds of documents per minute.
  • Non-critical context: Routine conversations, status updates, simple Q&A interactions.
  • Ephemeral content: Data that is useful for near-term context but does not need deep relationship modeling.

Code example

from synap import Synap

sdk = Synap(api_key="your_api_key")

# Fast ingestion for a routine conversation turn
await sdk.memories.create(
    document="User: What's the status of my order?\n"
             "Assistant: Your order #4521 shipped yesterday and should arrive by Thursday.",
    document_type="ai-chat-conversation",
    user_id="user_123",
    customer_id="acme_corp",
    mode="fast"
)
# Returns immediately. Memory available for retrieval within seconds.

Fast retrieval

Fast retrieval uses vector similarity search exclusively, skipping graph traversal and multi-signal ranking. This produces results in ~50-100ms, making it suitable for the hot path of real-time conversations.

How it works

1

Query embedding

The user’s query is converted into a vector embedding using the same model used during ingestion (1536 dimensions).
2

Vector similarity search

The query embedding is compared against stored memory embeddings using cosine similarity. The search is scoped to the applicable scope levels (user, customer, client, world) based on the provided user_id and customer_id.
3

Ranking

Results are ranked by cosine similarity score. No additional ranking signals (recency, graph centrality, confidence) are applied.
4

Return

The top-k results (determined by the configured budget) are returned as structured context.

Latency

MetricValue
P50 latency~50ms
P95 latency~100ms
P99 latency~150ms
These latencies are measured from API call to response and include network overhead. Actual search time is typically under 30ms.

What it returns

Fast retrieval returns memories that are semantically similar to the query. It finds content that talks about the same topics or uses similar language.
# Fast retrieval for a real-time conversation
context = await sdk.conversation.context.fetch(
    user_id="user_123",
    customer_id="acme_corp",
    query="What do we know about Project Atlas?",
    mode="fast"
)

# Returns: memories that mention "Project Atlas" directly
# Does NOT return: connected entities (team members, deadlines, decisions)
# unless they co-occur in the same chunks as "Project Atlas"

for fact in context.facts:
    print(f"[{fact.confidence:.2f}] {fact.content}")

What it misses

Fast retrieval does not traverse the knowledge graph. This means it will not surface:
  • Connected entities: If “Project Atlas” is associated with team members Sarah and James in the graph, but a specific memory chunk only mentions the project name, fast mode will not follow the graph edges to retrieve context about Sarah and James.
  • Indirect relationships: A memory about “the Q3 deadline” that is connected to Project Atlas through a graph relationship but does not contain the words “Project Atlas” will not be found.
  • Multi-hop context: Information that requires traversing two or more relationship edges to reach (e.g., “Project Atlas” -> “Sarah” -> “Engineering Team” -> “current priorities”).
For these scenarios, use accurate mode.

Tradeoffs: fast vs accurate

AspectFast ModeAccurate Mode
Ingestion processing time1-5 seconds10 seconds to several minutes
Retrieval latency~50-100ms~200-500ms
Search methodVector similarity onlyVector + graph traversal + cross-engine ranking
Ranking signalsCosine similaritySimilarity + recency + graph centrality + confidence
Entity resolutionLightweight (basic NER)Full pipeline (semantic matching, cross-reference)
Relationship awarenessNone (co-occurrence only)Full (explicit graph edges and traversal)
Best forReal-time chat, simple queries, high throughputComplex queries, relationship-aware context, deep analysis
CostLower compute usageHigher compute usage
Fast and accurate modes are not mutually exclusive. You can use fast mode for retrieval during real-time conversations and switch to accurate mode for specific high-value queries — all within the same application and the same Synap Instance.

When to use fast mode

Fast mode is the right choice for the majority of agent interactions. Use it when:
Any conversation where the user is waiting for a response. The ~50-100ms retrieval latency is imperceptible, and the overall response time is dominated by LLM generation (typically 500ms-3s).
Questions about a specific topic, person, or event where the answer is likely contained in a single memory chunk. Examples: “What is our refund policy?”, “When is Alice’s birthday?”, “What did the user say about dark mode?”
Applications that retrieve context on every user message. At scale (thousands of concurrent users), the lower compute cost of fast mode is significant.
Fast mode uses less compute per retrieval than accurate mode. For applications with high query volumes and tight cost budgets, fast mode provides a meaningful cost reduction.
Voice agents, real-time collaboration tools, and other applications where every millisecond of latency matters. Fast mode’s ~50-100ms retrieval is 2-5x faster than accurate mode.

When to upgrade to accurate mode

Consider switching to accurate mode for specific interactions:
  • Complex queries spanning multiple entities: “Summarize everything we know about the Atlas project, who is involved, and what decisions have been made.”
  • Relationship queries: “How is Sarah connected to the infrastructure migration?”
  • Comprehensive summaries: “Give me a full briefing on this customer.”
  • Onboarding or profile-building conversations: Where deep extraction builds a richer user profile from the start.
You can mix modes within the same application:
# Default: fast mode for real-time chat
context = await sdk.conversation.context.fetch(
    user_id=user_id,
    customer_id=customer_id,
    query=user_message,
    mode="fast"
)

# Upgrade to accurate mode when the query is complex
if is_complex_query(user_message):
    context = await sdk.conversation.context.fetch(
        user_id=user_id,
        customer_id=customer_id,
        query=user_message,
        mode="accurate"
    )

Code examples

Full agent loop with fast mode

from synap import Synap
from openai import AsyncOpenAI

sdk = Synap(api_key="synap_api_key")
openai_client = AsyncOpenAI(api_key="openai_api_key")

async def handle_message(user_message: str, user_id: str, customer_id: str):
    # Fast retrieval: ~50-100ms
    context = await sdk.conversation.context.fetch(
        user_id=user_id,
        customer_id=customer_id,
        query=user_message,
        mode="fast"
    )

    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": f"You are a helpful assistant.\n\nContext:\n{context.formatted_context}"
            },
            {"role": "user", "content": user_message}
        ]
    )

    assistant_message = response.choices[0].message.content

    # Fast ingestion: returns immediately, processes in 1-5 seconds
    await sdk.memories.create(
        document=f"User: {user_message}\nAssistant: {assistant_message}",
        document_type="ai-chat-conversation",
        user_id=user_id,
        customer_id=customer_id,
        mode="fast"
    )

    return assistant_message

Batch ingestion in fast mode

Fast mode can also be used for bootstrap ingestion when speed is more important than extraction depth:
# Fast bootstrap for high-volume, lower-priority data
result = await sdk.memories.batch_create(
    documents=[
        {
            "document": log_entry,
            "document_type": "ai-chat-conversation",
            "document_id": f"log_{entry_id}",
            "user_id": user_id,
            "customer_id": customer_id,
            "mode": "fast"  # Override the default long-range for batch
        }
        for entry_id, log_entry, user_id, customer_id in log_entries
    ],
    fail_fast=False
)

Next steps

Accurate Mode

Understand the thorough alternative for complex queries and important documents.

Context Fetch SDK

Full SDK reference for retrieval methods, including mode selection.

Runtime Ingestion

How runtime ingestion integrates fast mode into the agent loop.

Agent Interactions

The full retrieve-generate-ingest pattern for memory-enabled agents.