Fast ingestion
Fast ingestion runs a lightweight version of the extraction pipeline, optimized to make memories available as quickly as possible.What it does
| Stage | Behavior |
|---|---|
| Chunking | Basic semantic chunking by paragraph and sentence boundaries |
| Entity extraction | Lightweight named entity recognition (people, organizations, products) |
| Embedding | Vector embeddings generated for each chunk (1536 dimensions) |
| Preference detection | Basic keyword-based preference identification |
| Storage | Chunks stored in vector store; entities indexed for lookup |
What it skips
| Stage | Skipped in fast mode |
|---|---|
| Deep entity resolution | No cross-reference matching against the full entity registry. Auto-registration still occurs, but semantic matching against existing entries is limited. |
| Relationship mapping | No graph edges created between entities. The relationships between people, projects, and decisions are not explicitly modeled. |
| Advanced categorization | No topic hierarchy classification or domain-specific tagging beyond basic entity types. |
| Emotional analysis | No sentiment or emotional tone analysis of the conversation. |
Processing time
Fast ingestion typically completes in 1-5 seconds. Memories become available for vector-based retrieval almost immediately after processing.When to use fast ingestion
- Real-time chat logging: Every conversation turn in a live agent interaction.
- High-throughput pipelines: Applications that ingest hundreds of documents per minute.
- Non-critical context: Routine conversations, status updates, simple Q&A interactions.
- Ephemeral content: Data that is useful for near-term context but does not need deep relationship modeling.
Code example
Fast retrieval
Fast retrieval uses vector similarity search exclusively, skipping graph traversal and multi-signal ranking. This produces results in ~50-100ms, making it suitable for the hot path of real-time conversations.How it works
Query embedding
The user’s query is converted into a vector embedding using the same model used during ingestion (1536 dimensions).
Vector similarity search
The query embedding is compared against stored memory embeddings using cosine similarity. The search is scoped to the applicable scope levels (user, customer, client, world) based on the provided
user_id and customer_id.Ranking
Results are ranked by cosine similarity score. No additional ranking signals (recency, graph centrality, confidence) are applied.
Latency
| Metric | Value |
|---|---|
| P50 latency | ~50ms |
| P95 latency | ~100ms |
| P99 latency | ~150ms |
What it returns
Fast retrieval returns memories that are semantically similar to the query. It finds content that talks about the same topics or uses similar language.What it misses
Fast retrieval does not traverse the knowledge graph. This means it will not surface:- Connected entities: If “Project Atlas” is associated with team members Sarah and James in the graph, but a specific memory chunk only mentions the project name, fast mode will not follow the graph edges to retrieve context about Sarah and James.
- Indirect relationships: A memory about “the Q3 deadline” that is connected to Project Atlas through a graph relationship but does not contain the words “Project Atlas” will not be found.
- Multi-hop context: Information that requires traversing two or more relationship edges to reach (e.g., “Project Atlas” -> “Sarah” -> “Engineering Team” -> “current priorities”).
Tradeoffs: fast vs accurate
| Aspect | Fast Mode | Accurate Mode |
|---|---|---|
| Ingestion processing time | 1-5 seconds | 10 seconds to several minutes |
| Retrieval latency | ~50-100ms | ~200-500ms |
| Search method | Vector similarity only | Vector + graph traversal + cross-engine ranking |
| Ranking signals | Cosine similarity | Similarity + recency + graph centrality + confidence |
| Entity resolution | Lightweight (basic NER) | Full pipeline (semantic matching, cross-reference) |
| Relationship awareness | None (co-occurrence only) | Full (explicit graph edges and traversal) |
| Best for | Real-time chat, simple queries, high throughput | Complex queries, relationship-aware context, deep analysis |
| Cost | Lower compute usage | Higher compute usage |
Fast and accurate modes are not mutually exclusive. You can use fast mode for retrieval during real-time conversations and switch to accurate mode for specific high-value queries — all within the same application and the same Synap Instance.
When to use fast mode
Fast mode is the right choice for the majority of agent interactions. Use it when:Real-time conversations
Real-time conversations
Any conversation where the user is waiting for a response. The ~50-100ms retrieval latency is imperceptible, and the overall response time is dominated by LLM generation (typically 500ms-3s).
Single-topic queries
Single-topic queries
Questions about a specific topic, person, or event where the answer is likely contained in a single memory chunk. Examples: “What is our refund policy?”, “When is Alice’s birthday?”, “What did the user say about dark mode?”
High-frequency retrieval
High-frequency retrieval
Applications that retrieve context on every user message. At scale (thousands of concurrent users), the lower compute cost of fast mode is significant.
Cost-sensitive applications
Cost-sensitive applications
Fast mode uses less compute per retrieval than accurate mode. For applications with high query volumes and tight cost budgets, fast mode provides a meaningful cost reduction.
Latency-sensitive applications
Latency-sensitive applications
Voice agents, real-time collaboration tools, and other applications where every millisecond of latency matters. Fast mode’s ~50-100ms retrieval is 2-5x faster than accurate mode.
When to upgrade to accurate mode
Consider switching to accurate mode for specific interactions:- Complex queries spanning multiple entities: “Summarize everything we know about the Atlas project, who is involved, and what decisions have been made.”
- Relationship queries: “How is Sarah connected to the infrastructure migration?”
- Comprehensive summaries: “Give me a full briefing on this customer.”
- Onboarding or profile-building conversations: Where deep extraction builds a richer user profile from the start.
Code examples
Full agent loop with fast mode
Batch ingestion in fast mode
Fast mode can also be used for bootstrap ingestion when speed is more important than extraction depth:Next steps
Accurate Mode
Understand the thorough alternative for complex queries and important documents.
Context Fetch SDK
Full SDK reference for retrieval methods, including mode selection.
Runtime Ingestion
How runtime ingestion integrates fast mode into the agent loop.
Agent Interactions
The full retrieve-generate-ingest pattern for memory-enabled agents.