Documentation Index
Fetch the complete documentation index at: https://docs.maximem.ai/llms.txt
Use this file to discover all available pages before exploring further.
Fast ingestion
Fast ingestion runs a lightweight version of the extraction pipeline, optimized to make memories available as quickly as possible.What it does
| Stage | Behavior |
|---|---|
| Chunking | Basic semantic chunking by paragraph and sentence boundaries |
| Entity extraction | Lightweight named entity recognition (people, organizations, products) |
| Embedding | Vector embeddings generated for each chunk |
| Preference detection | Basic keyword-based preference identification |
| Storage | Chunks stored in vector store; entities indexed for lookup |
What it skips
| Stage | Skipped in fast mode |
|---|---|
| Deep entity resolution | No cross-reference matching against the full entity registry. Auto-registration still occurs, but semantic matching against existing entries is limited. |
| Relationship mapping | No graph edges created between entities. The relationships between people, projects, and decisions are not explicitly modeled. |
| Advanced categorization | No topic hierarchy classification or domain-specific tagging beyond basic entity types. |
| Emotional analysis | No sentiment or emotional tone analysis of the conversation. |
Processing time
Fast ingestion typically completes in 1-5 seconds. Memories become available for vector-based retrieval almost immediately after processing.When to use fast ingestion
- Real-time chat logging: Every conversation turn in a live agent interaction.
- High-throughput pipelines: Applications that ingest hundreds of documents per minute.
- Non-critical context: Routine conversations, status updates, simple Q&A interactions.
- Ephemeral content: Data that is useful for near-term context but does not need deep relationship modeling.
Code example
Fast retrieval
Fast retrieval queries both the vector store and the knowledge graph, but skips the LLM-driven subquery decomposition and reranking that accurate mode performs. This keeps latency low, making it suitable for the hot path of real-time conversations.How it works
Query embedding
The user’s query is converted into a vector embedding, consistent with the embeddings created during ingestion.
Vector similarity search
The query embedding is compared against stored memory embeddings using cosine similarity. The search is scoped to the applicable scope levels (user, customer, client, world) based on the provided
user_id and customer_id.Ranking
Results are ranked by cosine similarity score. No additional ranking signals (recency, graph centrality, confidence) are applied.
Latency
Fast retrieval is low-latency, which is what makes it suitable for the hot path of real-time conversations. It returns context quickly enough that retrieval is rarely the bottleneck in an agent’s overall response time.What it returns
Fast retrieval returns memories that are semantically similar to the query. It finds content that talks about the same topics or uses similar language.What it misses
Fast retrieval queries both the vector store and the knowledge graph, so it does surface directly connected graph context. What it skips are the LLM-driven refinements that accurate mode adds:- LLM subquery decomposition: Accurate mode uses an LLM to break a complex query into focused sub-queries, expanding coverage across multiple entities and angles. Fast mode runs the query as given, so broad multi-part questions may retrieve less complete context.
- Reranking: Accurate mode applies an additional reranking pass to reorder candidates for relevance. Fast mode returns results without this extra refinement, so the ordering is less finely tuned for complex, multi-entity queries.
Tradeoffs: fast vs accurate
| Aspect | Fast Mode | Accurate Mode |
|---|---|---|
| Ingestion processing time | 1-5 seconds | 10 seconds to several minutes |
| Retrieval latency | Lower latency | Higher latency |
| Search method | Vector + graph (no LLM query decomposition) | Vector + graph + LLM subquery decomposition + reranking |
| Ranking signals | Cosine similarity | Similarity + recency + graph centrality + confidence |
| Entity resolution | Lightweight (basic NER) | Full pipeline (semantic matching, cross-reference) |
| Relationship awareness | Graph relationships, without LLM-driven multi-hop decomposition | Full (explicit graph edges with LLM-driven multi-hop decomposition) |
| Best for | Real-time chat, simple queries, high throughput | Complex queries, relationship-aware context, deep analysis |
| Cost | Lower compute usage | Higher compute usage |
Fast and accurate modes are not mutually exclusive. You can use fast mode for retrieval during real-time conversations and switch to accurate mode for specific high-value queries — all within the same application and the same Synap Instance.
When to use fast mode
Fast mode is the right choice for the majority of agent interactions. Use it when:Real-time conversations
Real-time conversations
Any conversation where the user is waiting for a response. Fast retrieval latency is low and effectively imperceptible, and the overall response time is dominated by LLM generation.
Single-topic queries
Single-topic queries
Questions about a specific topic, person, or event where the answer is likely contained in a single memory chunk. Examples: “What is our refund policy?”, “When is Alice’s birthday?”, “What did the user say about dark mode?”
High-frequency retrieval
High-frequency retrieval
Applications that retrieve context on every user message. At scale (thousands of concurrent users), the lower compute cost of fast mode is significant.
Cost-sensitive applications
Cost-sensitive applications
Fast mode uses less compute per retrieval than accurate mode. For applications with high query volumes and tight cost budgets, fast mode provides a meaningful cost reduction.
Latency-sensitive applications
Latency-sensitive applications
Voice agents, real-time collaboration tools, and other applications where every millisecond of latency matters. Fast mode is lower-latency than accurate mode.
When to upgrade to accurate mode
Consider switching to accurate mode for specific interactions:- Complex queries spanning multiple entities: “Summarize everything we know about the Atlas project, who is involved, and what decisions have been made.”
- Relationship queries: “How is Sarah connected to the infrastructure migration?”
- Comprehensive summaries: “Give me a full briefing on this customer.”
- Onboarding or profile-building conversations: Where deep extraction builds a richer user profile from the start.
Code examples
Full agent loop with fast mode
Batch ingestion in fast mode
Fast mode can also be used for bootstrap ingestion when speed is more important than extraction depth:Next steps
Accurate Mode
Understand the thorough alternative for complex queries and important documents.
Context Fetch SDK
Full SDK reference for retrieval methods, including mode selection.
Runtime Ingestion
How runtime ingestion integrates fast mode into the agent loop.
Agent Interactions
The full retrieve-generate-ingest pattern for memory-enabled agents.