Building a memory-enabled agent follows a consistent pattern: retrieve context, generate a response, and ingest the conversation. This cycle repeats for every interaction, gradually building a richer memory that makes each subsequent response more informed and personalized. This page walks through the full pattern, from architecture to production-ready code.
This page focuses on the integration pattern — how your agent uses Synap during conversations. For details on the ingestion pipeline itself, see Runtime Ingestion. For retrieval configuration, see Fast Mode and Accurate Mode.
The core pattern for a memory-enabled agent is a three-phase cycle: Retrieve, Generate, Ingest.
1
User sends a message
Your application receives a message from the user through whatever channel you support — a chat widget, API endpoint, mobile app, voice interface, or other integration.
2
Agent retrieves relevant context from Synap
Before calling the LLM, the agent queries Synap for memories relevant to the current message. This retrieval considers the user’s history, their organization’s shared knowledge, and any client-scoped information.
The agent assembles the full prompt for the LLM: a system prompt, the retrieved memories as context, the recent conversation history, and the current user message. The retrieved context bridges the gap between what the LLM knows (nothing about this user) and what it needs to know.
4
Agent generates a response
The agent calls the LLM (OpenAI, Anthropic, or any provider) with the assembled prompt. The LLM generates a response that is informed by the user’s history and organizational context.
5
Agent delivers the response
The generated response is sent back to the user through your application’s interface.
6
Agent ingests the conversation turn
After the response is delivered, the agent sends the full conversation turn (user message + agent response) to Synap for ingestion. This is asynchronous and non-blocking — the user does not wait for ingestion to complete.
When the user sends the next message, the cycle begins again. This time, the retrieval step may return memories from the conversation turn that was just ingested, creating a continuously improving feedback loop.
The most critical part of the integration is how you structure the retrieved context within your LLM prompt. The retrieved memories need to be clearly separated from the system instructions and the conversation history so the LLM can use them effectively.
[System instructions] - Your agent's persona, capabilities, and behavioral guidelines[Retrieved context from Synap] - Relevant facts, preferences, and historical context - Clearly labeled as "context from memory"[Conversation history] - Recent messages in the current session[Current user message] - The message being responded to
def build_prompt(system_instructions: str, context, conversation_history: list, user_message: str): """Build the full prompt with retrieved memories injected.""" messages = [ { "role": "system", "content": ( f"{system_instructions}\n\n" "## Context from memory\n" "The following information has been retrieved from previous conversations " "and documents. Use it to personalize your response and maintain continuity " "across interactions. If the context is not relevant to the current question, " "do not force it into your response.\n\n" f"{context.formatted_context}" ) } ] # Add recent conversation history for msg in conversation_history: messages.append({ "role": msg["role"], "content": msg["content"] }) # Add the current user message messages.append({ "role": "user", "content": user_message }) return messages
Include a brief instruction telling the LLM how to use the retrieved context. Phrases like “Use this to personalize your response” and “If the context is not relevant, do not force it” help the LLM apply memories appropriately without hallucinating connections.
There are three common strategies for when to ingest conversation data, each with different tradeoffs:
After every turn
At conversation end
Hybrid approach
Ingest each conversation turn (user message + agent response) immediately after the response is delivered. This is the most common pattern.Pros:
Memories are available for retrieval within the same conversation session
No risk of data loss if the session ends unexpectedly
Fine-grained temporal resolution
Cons:
Higher API call volume
Each turn is ingested independently, without full conversation context
# After each responseawait sdk.memories.create( document=f"User: {user_message}\nAssistant: {response}", document_type="ai-chat-conversation", user_id=user_id, customer_id=customer_id, mode="fast")
Accumulate the full conversation and ingest it as a single document when the session ends.Pros:
Fewer API calls
Full conversation context available for extraction — better entity resolution and relationship mapping
More efficient for long-range mode processing
Cons:
Memories from this session are not available during the session itself
Risk of data loss if the session terminates unexpectedly (crash, timeout)
Requires session lifecycle management
# At conversation endfull_transcript = "\n".join( f"{'User' if msg['role'] == 'user' else 'Assistant'}: {msg['content']}" for msg in conversation_history)await sdk.memories.create( document=full_transcript, document_type="ai-chat-conversation", user_id=user_id, customer_id=customer_id, mode="long-range")
Ingest each turn in fast mode for immediate availability, then ingest the full conversation in long-range mode at session end for deeper extraction.Pros:
Immediate availability of basic memories
Deep extraction from the full conversation context
Resilient to unexpected session termination
Cons:
Higher API call volume and processing cost
Requires deduplication logic (use document_id to handle overlapping content)
# During conversation: fast mode per turnawait sdk.memories.create( document=f"User: {user_message}\nAssistant: {response}", document_type="ai-chat-conversation", document_id=f"turn_{session_id}_{turn_number}", user_id=user_id, customer_id=customer_id, mode="fast")# At conversation end: long-range mode for the full transcriptawait sdk.memories.create( document=full_transcript, document_type="ai-chat-conversation", document_id=f"session_{session_id}_full", user_id=user_id, customer_id=customer_id, mode="long-range")
If your agent uses streaming responses (returning tokens as they are generated), ingest the conversation turn only after the full response has been assembled. Do not ingest partial responses.
async def handle_streaming_message(user_message: str, user_id: str, customer_id: str): """Handle a streaming response with post-stream ingestion.""" context = await sdk.conversation.context.fetch( user_id=user_id, customer_id=customer_id, query=user_message, mode="fast" ) messages = build_prompt(SYSTEM_PROMPT, context, conversation_history, user_message) # Stream the response to the user full_response = "" async for chunk in await openai_client.chat.completions.create( model="gpt-4o", messages=messages, stream=True ): token = chunk.choices[0].delta.content or "" full_response += token yield token # Stream to user in real-time # Ingest AFTER the full response is complete await sdk.memories.create( document=f"User: {user_message}\nAssistant: {full_response}", document_type="ai-chat-conversation", user_id=user_id, customer_id=customer_id, mode="fast" )
Never ingest a partial or streaming response. Synap expects complete content to produce high-quality memories. Partial content produces fragmented, low-quality results.
Retrieval is on the critical path of your agent’s response time. The user is waiting while your agent fetches context from Synap. Choosing the right retrieval mode helps balance speed and quality.
For most real-time conversational agents, use fast mode for retrieval. The latency is imperceptible to users and adds minimal overhead to the overall response time. Reserve accurate mode for use cases where retrieval quality matters more than speed — for example, generating end-of-day summaries or answering complex analytical questions.
# Fast retrieval for real-time chat (recommended default)context = await sdk.conversation.context.fetch( user_id=user_id, customer_id=customer_id, query=user_message, mode="fast")# Accurate retrieval for complex queriescontext = await sdk.conversation.context.fetch( user_id=user_id, customer_id=customer_id, query="Summarize everything we know about Project Atlas, including all team members and key decisions", mode="accurate")
Multiple agents can share a single Synap Instance, each ingesting and retrieving from the same memory store. This enables sophisticated multi-agent architectures where specialized agents handle different aspects of a user’s needs while sharing a unified memory.
# Sales agent ingests a conversationawait sdk.memories.create( document="User: We're evaluating your enterprise plan for 500 users.\n" "Assistant: Great! The enterprise plan includes SSO, priority support, " "and custom integrations. I'll send over a detailed proposal.", document_type="ai-chat-conversation", user_id="user_123", customer_id="acme_corp", mode="fast")# Support agent later retrieves this context automaticallycontext = await sdk.conversation.context.fetch( user_id="user_123", customer_id="acme_corp", query="I have a question about setting up SSO")# context includes: "User is evaluating enterprise plan for 500 users, interested in SSO"
In this pattern, the support agent automatically knows about the sales conversation because both agents share the same Instance and scope. The support agent can provide more informed assistance without the user having to repeat context.
Multi-agent memory sharing is controlled by scope. If both agents use the same user_id and customer_id, they share memories for that user and organization. If they operate on the same Instance, they share Instance-level memories. See Memory Scopes for details.
This complete example shows a production-ready memory-enabled chatbot using OpenAI, with context retrieval, response generation, and conversation ingestion:
import asynciofrom synap import Synapfrom openai import AsyncOpenAIsdk = Synap(api_key="synap_api_key")openai_client = AsyncOpenAI(api_key="openai_api_key")SYSTEM_PROMPT = """You are a helpful customer success assistant. You remember previousconversations and use that context to provide personalized, continuity-aware responses.Guidelines:- Reference relevant past interactions naturally, without being repetitive.- If you recall a user's preference, apply it without asking again.- If the retrieved context contradicts something the user just said, trust the user's current statement (people change their minds).- Never fabricate memories. If you don't have context, say so honestly."""async def chat( user_message: str, user_id: str, customer_id: str, conversation_history: list[dict]) -> str: """Handle a single chat turn with full memory integration.""" # 1. Retrieve relevant context from Synap context = await sdk.conversation.context.fetch( user_id=user_id, customer_id=customer_id, query=user_message, mode="fast" ) # 2. Build the prompt with memory context messages = [ { "role": "system", "content": ( f"{SYSTEM_PROMPT}\n\n" "## Retrieved context from memory\n" "Use the following information from previous interactions to inform " "your response. Do not reference this section directly -- integrate " "the knowledge naturally.\n\n" f"{context.formatted_context if context.formatted_context else 'No relevant context found.'}" ) } ] # Add conversation history (recent turns in this session) for msg in conversation_history[-10:]: # Keep last 10 turns for context window messages.append({"role": msg["role"], "content": msg["content"]}) # Add the current user message messages.append({"role": "user", "content": user_message}) # 3. Generate the response response = await openai_client.chat.completions.create( model="gpt-4o", messages=messages, temperature=0.7, max_tokens=1024 ) assistant_message = response.choices[0].message.content # 4. Ingest the conversation turn (non-blocking, fire-and-forget) try: await sdk.memories.create( document=f"User: {user_message}\nAssistant: {assistant_message}", document_type="ai-chat-conversation", user_id=user_id, customer_id=customer_id, mode="fast" ) except Exception as e: # Log but do not fail the response print(f"Warning: ingestion failed: {e}") return assistant_message# Usageasync def main(): conversation_history = [] while True: user_input = input("You: ") if user_input.lower() in ("quit", "exit"): break response = await chat( user_message=user_input, user_id="user_123", customer_id="acme_corp", conversation_history=conversation_history ) # Update local conversation history conversation_history.append({"role": "user", "content": user_input}) conversation_history.append({"role": "assistant", "content": response}) print(f"Assistant: {response}")if __name__ == "__main__": asyncio.run(main())
Even if you think the current query does not need historical context, always call retrieval. Synap’s ranking ensures that irrelevant context is not returned, so the overhead is minimal. Skipping retrieval means your agent cannot benefit from accumulated memory.
Keep system prompt instructions about memory concise
The LLM does not need to know how Synap works internally. Simply tell it to use the retrieved context naturally and to not fabricate memories. Over-engineering the memory instructions can cause the LLM to behave unnaturally.
Separate retrieval latency from generation latency
Log retrieval and generation latencies independently. If your agent feels slow, retrieval in fast mode is rarely the bottleneck — LLM generation is usually the dominant factor. Measuring both independently helps you optimize the right layer.
Use conversation_id for multi-turn tracking
When your application supports multi-turn conversations, pass a consistent conversation_id alongside the user_id. This helps Synap group related turns for better contextual understanding during retrieval.
Gracefully degrade when Synap is unavailable
Your agent should still function if Synap is temporarily unreachable. Skip the retrieval step and generate a response without memory context. The user experience degrades (no personalization) but does not break entirely.