Skip to content

Working Memory

Working memory stores the current conversation for a session. It holds messages, tracks context, and automatically summarizes old messages when the conversation gets too long.

What Working Memory Does

  1. Stores conversation messages — The chat history for a session
  2. Tracks session data — Arbitrary key-value data that lives only in this session
  3. Automatically summarizes — When messages exceed the token limit, older messages are summarized and removed
  4. Promotes memories — Structured memories added here get indexed in long-term storage

Quick Reference

Feature Details
Scope One session
Lifespan Persistent (default) or TTL-based
Storage Redis JSON
Key Feature Automatic summarization
Search None (use long-term memory for search)

Data Structure

Working memory contains:

Field Description
messages Conversation history (role/content pairs)
context Summary of older messages (populated by auto-summarization)
memories Structured memory records that get promoted to long-term storage
data Arbitrary JSON key-value storage for the session
user_id Owner of this session
namespace Logical grouping
ttl_seconds Optional expiration time

Automatic Summarization

When your conversation exceeds the model's context window, working memory automatically:

  1. Summarizes older messages into a compact summary
  2. Stores the summary in the context field
  3. Removes the summarized messages to free space
  4. Keeps recent messages intact

This happens transparently—you don't need to trigger it.

How It Works

The server tracks token usage against your model's context window. When messages exceed a threshold (default: 70% of the context window), summarization kicks in:

Messages: [msg1, msg2, msg3, msg4, msg5, msg6, msg7, msg8, msg9, msg10]
                  ↓ (exceeds threshold)
                  ↓ summarize older messages
Context:  "User discussed trip planning to Paris, preferences for museums..."
Messages: [msg8, msg9, msg10]  ← recent messages preserved

Finding the Summary

The summary is stored in the context field of working memory:

# After summarization has occurred
working_memory = await get_working_memory("session_123")

print(working_memory.context)
# "User discussed trip planning to Paris, preferences for museums and food,
#  budget constraints around $3000, and interest in Impressionist art..."

print(working_memory.messages)
# [recent messages only]

Monitoring Summarization

The WorkingMemoryResponse includes fields to track context usage:

response = await get_working_memory("session_123")

# How much of the total context window is used (0-100%)
print(response.context_percentage_total_used)  # e.g., 45.2

# How close to triggering summarization (0-100%)
print(response.context_percentage_until_summarization)  # e.g., 64.5
# When this hits 100%, summarization triggers

Configuring Summarization

Environment Variable Default Description
SUMMARIZATION_THRESHOLD 0.7 Fraction of context window that triggers summarization
GENERATION_MODEL gpt-4o-mini Model used for summarization
PROGRESSIVE_SUMMARIZATION_PROMPT (see below) Custom prompt for summarization

The summarization prompt can be customized. It must include {prev_summary} and {messages_joined} placeholders:

PROGRESSIVE_SUMMARIZATION_PROMPT="Your custom prompt with {prev_summary} and {messages_joined}..."

Storing Messages

The primary use of working memory is storing conversation messages:

from datetime import datetime, UTC
import ulid

working_memory = WorkingMemory(
    session_id="chat_123",
    messages=[
        MemoryMessage(
            role="user",
            content="I'm planning a trip to Paris next month",
            id=ulid.ULID(),
            created_at=datetime.now(UTC)
        ),
        MemoryMessage(
            role="assistant",
            content="What type of activities interest you?",
            id=ulid.ULID(),
            created_at=datetime.now(UTC)
        ),
    ]
)

⚠️ Always provide created_at timestamps

This ensures correct message ordering and proper temporal context when promoting to long-term memory. Omitting created_at triggers a deprecation warning—it will become required in a future version.

Session-Specific Data

Use the data field for temporary information that doesn't need to persist across conversations:

working_memory = WorkingMemory(
    session_id="chat_123",
    data={
        "current_topic": "trip_planning",
        "user_timezone": "America/New_York",
    }
)

Structured Memories

Use the memories field for facts that should persist beyond this session:

working_memory = WorkingMemory(
    session_id="chat_123",
    memories=[
        MemoryRecord(
            text="User is planning a trip to Paris next month",
            id="trip_planning_paris",
            memory_type="episodic",
            topics=["travel"],
            entities=["Paris"]
        )
    ]
)

These are automatically promoted to long-term storage and become searchable across all sessions.

Key distinction: - data → session-only, not searchable, not persisted beyond session - memories → promoted to long-term storage, searchable, persistent

Memory Promotion to Long-Term Storage

Memories added to the memories field are automatically promoted to long-term storage:

  1. Server identifies memories with persisted_at=null
  2. Generates vector embeddings
  3. Indexes in long-term storage
  4. Updates working memory with persisted_at timestamps

You can also configure background extraction to automatically extract memories from conversation messages:

working_memory = WorkingMemory(
    session_id="chat_123",
    messages=[...],
    long_term_memory_strategy=MemoryStrategyConfig(
        strategy="discrete",  # or "summary", "preferences", "custom"
        config={}
    ),
)

See Memory Extraction Strategies for configuration options.

API Reference

# Get working memory
GET /v1/working-memory/{session_id}?namespace=demo&model_name=gpt-4o

# Set working memory (replaces existing)
PUT /v1/working-memory/{session_id}?ttl_seconds=3600

# Delete working memory
DELETE /v1/working-memory/{session_id}?namespace=demo

TTL and Persistence

Working memory is persistent by default. Set ttl_seconds to auto-expire:

# Persistent (default)
working_memory = WorkingMemory(session_id="chat_123", messages=[...])

# Expires after 1 hour
working_memory = WorkingMemory(session_id="chat_123", messages=[...], ttl_seconds=3600)

Use TTL for: temporary sessions, privacy requirements, resource constraints.

Keep persistent for: conversation history, multi-turn context, support applications.

Reconstruction from Long-Term Memory

With INDEX_ALL_MESSAGES_IN_LONG_TERM_MEMORY=true, working memory can be reconstructed after TTL expiration:

  1. Messages are indexed in long-term memory as they flow through
  2. When working memory expires, messages remain in long-term storage
  3. Requesting an expired session reconstructs it from long-term memory

This lets you use TTL to save Redis memory while maintaining conversation continuity.

Configuration Reference

Variable Default Description
SUMMARIZATION_THRESHOLD 0.7 Fraction of context window that triggers summarization
GENERATION_MODEL gpt-4o-mini Model for summarization
PROGRESSIVE_SUMMARIZATION_PROMPT (built-in) Custom summarization prompt
LONG_TERM_MEMORY true Enable long-term memory features
INDEX_ALL_MESSAGES_IN_LONG_TERM_MEMORY false Index messages for reconstruction

See the Configuration Guide for all options.