Working Memory¶
Working memory stores the current conversation for a session. It holds messages, tracks context, and automatically summarizes old messages when the conversation gets too long.
What Working Memory Does¶
- Stores conversation messages — The chat history for a session
- Tracks session data — Arbitrary key-value data that lives only in this session
- Automatically summarizes — When messages exceed the token limit, older messages are summarized and removed
- Promotes memories — Structured memories added here get indexed in long-term storage
Quick Reference¶
| Feature | Details |
|---|---|
| Scope | One session |
| Lifespan | Persistent (default) or TTL-based |
| Storage | Redis JSON |
| Key Feature | Automatic summarization |
| Search | None (use long-term memory for search) |
Data Structure¶
Working memory contains:
| Field | Description |
|---|---|
messages | Conversation history (role/content pairs) |
context | Summary of older messages (populated by auto-summarization) |
memories | Structured memory records that get promoted to long-term storage |
data | Arbitrary JSON key-value storage for the session |
user_id | Owner of this session |
namespace | Logical grouping |
ttl_seconds | Optional expiration time |
Automatic Summarization¶
When your conversation exceeds the model's context window, working memory automatically:
- Summarizes older messages into a compact summary
- Stores the summary in the
contextfield - Removes the summarized messages to free space
- Keeps recent messages intact
This happens transparently—you don't need to trigger it.
How It Works¶
The server tracks token usage against your model's context window. When messages exceed a threshold (default: 70% of the context window), summarization kicks in:
Messages: [msg1, msg2, msg3, msg4, msg5, msg6, msg7, msg8, msg9, msg10]
↓ (exceeds threshold)
↓ summarize older messages
Context: "User discussed trip planning to Paris, preferences for museums..."
Messages: [msg8, msg9, msg10] ← recent messages preserved
Finding the Summary¶
The summary is stored in the context field of working memory:
# After summarization has occurred
working_memory = await get_working_memory("session_123")
print(working_memory.context)
# "User discussed trip planning to Paris, preferences for museums and food,
# budget constraints around $3000, and interest in Impressionist art..."
print(working_memory.messages)
# [recent messages only]
Monitoring Summarization¶
The WorkingMemoryResponse includes fields to track context usage:
response = await get_working_memory("session_123")
# How much of the total context window is used (0-100%)
print(response.context_percentage_total_used) # e.g., 45.2
# How close to triggering summarization (0-100%)
print(response.context_percentage_until_summarization) # e.g., 64.5
# When this hits 100%, summarization triggers
Configuring Summarization¶
| Environment Variable | Default | Description |
|---|---|---|
SUMMARIZATION_THRESHOLD | 0.7 | Fraction of context window that triggers summarization |
GENERATION_MODEL | gpt-4o-mini | Model used for summarization |
PROGRESSIVE_SUMMARIZATION_PROMPT | (see below) | Custom prompt for summarization |
The summarization prompt can be customized. It must include {prev_summary} and {messages_joined} placeholders:
Storing Messages¶
The primary use of working memory is storing conversation messages:
from datetime import datetime, UTC
import ulid
working_memory = WorkingMemory(
session_id="chat_123",
messages=[
MemoryMessage(
role="user",
content="I'm planning a trip to Paris next month",
id=ulid.ULID(),
created_at=datetime.now(UTC)
),
MemoryMessage(
role="assistant",
content="What type of activities interest you?",
id=ulid.ULID(),
created_at=datetime.now(UTC)
),
]
)
⚠️ Always provide
created_attimestampsThis ensures correct message ordering and proper temporal context when promoting to long-term memory. Omitting
created_attriggers a deprecation warning—it will become required in a future version.
Session-Specific Data¶
Use the data field for temporary information that doesn't need to persist across conversations:
working_memory = WorkingMemory(
session_id="chat_123",
data={
"current_topic": "trip_planning",
"user_timezone": "America/New_York",
}
)
Structured Memories¶
Use the memories field for facts that should persist beyond this session:
working_memory = WorkingMemory(
session_id="chat_123",
memories=[
MemoryRecord(
text="User is planning a trip to Paris next month",
id="trip_planning_paris",
memory_type="episodic",
topics=["travel"],
entities=["Paris"]
)
]
)
These are automatically promoted to long-term storage and become searchable across all sessions.
Key distinction: -
data→ session-only, not searchable, not persisted beyond session -memories→ promoted to long-term storage, searchable, persistent
Memory Promotion to Long-Term Storage¶
Memories added to the memories field are automatically promoted to long-term storage:
- Server identifies memories with
persisted_at=null - Generates vector embeddings
- Indexes in long-term storage
- Updates working memory with
persisted_attimestamps
You can also configure background extraction to automatically extract memories from conversation messages:
working_memory = WorkingMemory(
session_id="chat_123",
messages=[...],
long_term_memory_strategy=MemoryStrategyConfig(
strategy="discrete", # or "summary", "preferences", "custom"
config={}
),
)
See Memory Extraction Strategies for configuration options.
API Reference¶
# Get working memory
GET /v1/working-memory/{session_id}?namespace=demo&model_name=gpt-4o
# Set working memory (replaces existing)
PUT /v1/working-memory/{session_id}?ttl_seconds=3600
# Delete working memory
DELETE /v1/working-memory/{session_id}?namespace=demo
TTL and Persistence¶
Working memory is persistent by default. Set ttl_seconds to auto-expire:
# Persistent (default)
working_memory = WorkingMemory(session_id="chat_123", messages=[...])
# Expires after 1 hour
working_memory = WorkingMemory(session_id="chat_123", messages=[...], ttl_seconds=3600)
Use TTL for: temporary sessions, privacy requirements, resource constraints.
Keep persistent for: conversation history, multi-turn context, support applications.
Reconstruction from Long-Term Memory¶
With INDEX_ALL_MESSAGES_IN_LONG_TERM_MEMORY=true, working memory can be reconstructed after TTL expiration:
- Messages are indexed in long-term memory as they flow through
- When working memory expires, messages remain in long-term storage
- Requesting an expired session reconstructs it from long-term memory
This lets you use TTL to save Redis memory while maintaining conversation continuity.
Configuration Reference¶
| Variable | Default | Description |
|---|---|---|
SUMMARIZATION_THRESHOLD | 0.7 | Fraction of context window that triggers summarization |
GENERATION_MODEL | gpt-4o-mini | Model for summarization |
PROGRESSIVE_SUMMARIZATION_PROMPT | (built-in) | Custom summarization prompt |
LONG_TERM_MEMORY | true | Enable long-term memory features |
INDEX_ALL_MESSAGES_IN_LONG_TERM_MEMORY | false | Index messages for reconstruction |
See the Configuration Guide for all options.
Related Documentation¶
- Long-term Memory — Persistent, cross-session storage
- Memory Integration Patterns — How to integrate memory
- Memory Extraction Strategies — Automatic memory extraction
- LLM Providers — Configure OpenAI, Anthropic, Bedrock, Ollama