LLM Cache
RedisVL provides powerful caching mechanisms to optimize Large Language Model (LLM) applications. By caching responses and embeddings, you can significantly reduce costs, improve latency, and increase throughput.
Why Cache LLM Responses?
LLM API calls are:
-
Expensive - Each API call costs money
-
Slow - Network latency and inference time add up
-
Repetitive - Users often ask similar questions
Caching helps address all three issues by storing and reusing results for similar queries.
Semantic Cache
Semantic Cache stores LLM responses and retrieves them based on semantic similarity rather than exact matches. This allows you to serve cached responses for questions that are similar in meaning even if they’re worded differently.
How It Works
-
User asks a question
-
Convert question to an embedding vector
-
Search for semantically similar cached questions
-
If found (above threshold), return cached response
-
Otherwise, call LLM and cache the new result
Basic Usage
import com.redis.vl.extensions.cache.SemanticCache;
import com.redis.vl.utils.vectorize.BaseVectorizer;
import redis.clients.jedis.UnifiedJedis;
// Connect to Redis
UnifiedJedis jedis = new UnifiedJedis("redis://localhost:6379");
// Create a vectorizer (using LangChain4J or ONNX)
BaseVectorizer vectorizer = /* create your vectorizer */;
// Create semantic cache
SemanticCache cache = new SemanticCache(
jedis,
vectorizer,
"llm-cache", // index name
0.9 // similarity threshold (0-1)
);
// Check for cached response
String prompt = "What is Redis?";
Optional<CacheHit> hit = cache.check(prompt);
if (hit.isPresent()) {
// Cache hit! Use cached response
System.out.println("Cache hit! Response: " + hit.get().getResponse());
} else {
// Cache miss - call LLM
String response = callLLM(prompt); // your LLM call
// Store in cache for future use
cache.store(prompt, response);
System.out.println("Cached new response: " + response);
}
Configuration Options
// Create with custom configuration
SemanticCache cache = SemanticCache.builder()
.jedis(jedis)
.vectorizer(vectorizer)
.name("my-llm-cache")
.threshold(0.85) // Similarity threshold (default: 0.9)
.ttl(3600) // Time-to-live in seconds (optional)
.returnFields("prompt", "response", "timestamp")
.build();
Threshold Selection
The threshold determines how similar a query must be to return a cached result:
-
0.95-1.0 - Very strict, almost exact matches
-
0.85-0.95 - Balanced, good for most use cases
-
0.70-0.85 - Loose, more cache hits but less precision
-
< 0.70 - Too loose, may return irrelevant results
| Start with 0.9 and adjust based on your use case. Monitor cache hit rates and response quality. |
Example: Customer Support Bot
import com.redis.vl.extensions.cache.SemanticCache;
public class SupportBot {
private final SemanticCache cache;
private final LLMClient llmClient;
public SupportBot(UnifiedJedis jedis, BaseVectorizer vectorizer) {
this.cache = new SemanticCache(jedis, vectorizer, "support-cache", 0.88);
this.llmClient = new LLMClient(); // your LLM client
}
public String answerQuestion(String userQuestion) {
// Check cache first
Optional<CacheHit> hit = cache.check(userQuestion);
if (hit.isPresent()) {
System.out.println("✓ Cache hit! Saved API call and time.");
return hit.get().getResponse();
}
// Cache miss - call LLM
System.out.println("✗ Cache miss - calling LLM...");
String response = llmClient.complete(userQuestion);
// Store for future use
cache.store(userQuestion, response);
return response;
}
public void clearOldEntries() {
// Clear cache entries older than 7 days
cache.clear();
}
}
// Usage
SupportBot bot = new SupportBot(jedis, vectorizer);
// These semantically similar questions will hit the cache
String answer1 = bot.answerQuestion("How do I reset my password?");
String answer2 = bot.answerQuestion("I forgot my password, what should I do?");
String answer3 = bot.answerQuestion("Password reset procedure?");
// answer2 and answer3 will be cache hits if threshold allows
LangCache Semantic Cache
LangCacheSemanticCache provides integration with the LangCache managed service, allowing you to leverage cloud-based semantic caching without managing your own Redis infrastructure. LangCache handles embedding generation, similarity search, and cache management through a simple API.
For more information about LangCache, see the official LangCache documentation.
How It Works
LangCacheSemanticCache acts as a bridge between RedisVL and the LangCache API service:
-
User asks a question
-
LangCache generates embeddings and searches for similar cached questions
-
If found (above threshold), return cached response from LangCache
-
Otherwise, call LLM and cache the result via LangCache
Basic Usage
import com.redis.vl.extensions.cache.LangCacheSemanticCache;
// Create LangCache wrapper
LangCacheSemanticCache cache = new LangCacheSemanticCache.Builder()
.name("my-langcache")
.serverUrl("https://api.langcache.com")
.cacheId("your-cache-id")
.apiKey("your-api-key")
.ttl(3600) // Optional: 1 hour TTL
.build();
// Check for cached response
String prompt = "What is Redis?";
List<Map<String, Object>> hits = cache.check(prompt, null, 1, null, null, null);
if (!hits.isEmpty()) {
// Cache hit! Use cached response
Map<String, Object> hit = hits.get(0);
System.out.println("Cache hit! Response: " + hit.get("response"));
} else {
// Cache miss - call LLM
String response = callLLM(prompt); // your LLM call
// Store in cache for future use
String entryId = cache.store(prompt, response, null);
System.out.println("Cached new response with ID: " + entryId);
}
Configuration Options
// Create with custom configuration
LangCacheSemanticCache cache = new LangCacheSemanticCache.Builder()
.name("my-langcache")
.serverUrl("https://api.langcache.com")
.cacheId("your-cache-id")
.apiKey("your-api-key")
.ttl(7200) // Time-to-live in seconds
.useExactSearch(true) // Enable exact match search (default: true)
.useSemanticSearch(true) // Enable semantic search (default: true)
.distanceScale("normalized") // "normalized" (0-1) or "redis" (0-2)
.build();
Distance Scale
LangCache returns similarity scores (0-1, higher is better), but RedisVL uses distance (lower is better). The distanceScale parameter controls how distance thresholds are interpreted:
-
"normalized" (default): Distance thresholds are 0-1
-
distance_threshold=0.1means "accept results with distance ≤ 0.1" -
Converts to LangCache:
similarity_threshold = 1.0 - 0.1 = 0.9
-
-
"redis": Distance thresholds are in Redis COSINE format (0-2)
-
distance_threshold=0.4means "accept results with Redis COSINE distance ≤ 0.4" -
Converts to LangCache:
similarity_threshold = (2 - 0.4) / 2 = 0.8
-
// Using normalized scale (default)
LangCacheSemanticCache normalizedCache = new LangCacheSemanticCache.Builder()
.cacheId("cache-id")
.apiKey("api-key")
.distanceScale("normalized")
.build();
// Distance threshold 0.15 = similarity threshold 0.85
normalizedCache.check(prompt, null, 1, null, null, 0.15f);
// Using Redis COSINE scale
LangCacheSemanticCache redisCache = new LangCacheSemanticCache.Builder()
.cacheId("cache-id")
.apiKey("api-key")
.distanceScale("redis")
.build();
// Distance threshold 0.3 = similarity threshold 0.85
redisCache.check(prompt, null, 1, null, null, 0.3f);
Metadata and Filtering
Store and retrieve cache entries with custom attributes (metadata):
// Store with metadata
Map<String, Object> metadata = Map.of(
"user_id", "12345",
"topic", "technical",
"language", "en"
);
cache.store(prompt, response, metadata);
// Search with attribute filtering
Map<String, Object> attributes = Map.of("topic", "technical");
List<Map<String, Object>> hits = cache.check(
prompt,
attributes, // Filter by attributes
5, // Return up to 5 results
null, // returnFields (not used)
null, // filterExpression (not supported)
0.15f // distance threshold
);
for (Map<String, Object> hit : hits) {
System.out.println("Response: " + hit.get("response"));
System.out.println("Distance: " + hit.get("vector_distance"));
System.out.println("Metadata: " + hit.get("metadata"));
}
Delete Operations
// Delete entire cache
cache.delete();
// Delete specific entry by ID
String entryId = cache.store(prompt, response, null);
cache.deleteById(entryId);
// Delete entries matching attributes
Map<String, Object> attributes = Map.of("topic", "outdated");
cache.deleteByAttributes(attributes);
// Clear cache (alias for delete)
cache.clear();
Search Strategies
LangCache supports both exact and semantic search. You can enable either or both:
// Both exact and semantic search (default)
LangCacheSemanticCache cache = new LangCacheSemanticCache.Builder()
.cacheId("cache-id")
.apiKey("api-key")
.useExactSearch(true)
.useSemanticSearch(true)
.build();
// Only semantic search
LangCacheSemanticCache semanticOnly = new LangCacheSemanticCache.Builder()
.cacheId("cache-id")
.apiKey("api-key")
.useExactSearch(false)
.useSemanticSearch(true)
.build();
// Only exact search
LangCacheSemanticCache exactOnly = new LangCacheSemanticCache.Builder()
.cacheId("cache-id")
.apiKey("api-key")
.useExactSearch(true)
.useSemanticSearch(false)
.build();
Example: Multi-Tenant Chatbot
public class MultiTenantChatbot {
private final LangCacheSemanticCache cache;
private final LLMClient llmClient;
public MultiTenantChatbot(String cacheId, String apiKey) {
this.cache = new LangCacheSemanticCache.Builder()
.name("tenant-cache")
.cacheId(cacheId)
.apiKey(apiKey)
.distanceScale("normalized")
.build();
this.llmClient = new LLMClient();
}
public String answer(String userId, String tenantId, String question) {
// Build attributes for filtering
Map<String, Object> attributes = Map.of(
"user_id", userId,
"tenant_id", tenantId
);
// Check cache with attributes
List<Map<String, Object>> hits = cache.check(
question,
attributes,
1,
null,
null,
0.1f // Accept results with distance ≤ 0.1
);
if (!hits.isEmpty()) {
System.out.println("✓ Cache hit!");
return (String) hits.get(0).get("response");
}
// Cache miss - call LLM
String response = llmClient.complete(question);
// Store with attributes
cache.store(question, response, attributes);
return response;
}
public void clearTenantCache(String tenantId) {
// Delete all entries for this tenant
cache.deleteByAttributes(Map.of("tenant_id", tenantId));
}
}
Limitations
LangCacheSemanticCache has some limitations compared to SemanticCache:
-
No custom vectors: LangCache generates embeddings internally
-
No filter expressions: Only attribute-based filtering is supported
-
No updates: Entries cannot be updated; delete and recreate instead
-
No per-entry TTL: TTL is configured at the cache level via LangCache console
When to Use LangCache vs SemanticCache
Use LangCacheSemanticCache when:
-
You want a fully managed solution
-
You don’t want to manage Redis infrastructure
-
You need multi-region caching
-
You want built-in monitoring and analytics
-
You prefer pay-as-you-go pricing
Use SemanticCache when:
-
You already have Redis infrastructure
-
You need full control over embeddings
-
You want to use custom vectorizers
-
You need filter expressions
-
You prefer self-hosted solutions
Embeddings Cache
Embeddings Cache stores vector embeddings to avoid recomputing them. This is useful when you frequently need embeddings for the same text.
How It Works
-
Convert text to a cache key (hash of text)
-
Check if embedding exists in cache
-
If found, return cached embedding
-
Otherwise, compute embedding and cache it
Basic Usage
import com.redis.vl.extensions.cache.EmbeddingsCache;
// Create embeddings cache
EmbeddingsCache embCache = new EmbeddingsCache(
jedis,
vectorizer,
"embeddings-cache"
);
// Get embedding (will cache automatically)
String text = "Redis is an in-memory database";
float[] embedding = embCache.embed(text);
// Second call will hit cache
float[] cachedEmbedding = embCache.embed(text); // Much faster!
// Batch embedding with caching
List<String> texts = List.of(
"First document",
"Second document",
"First document" // Will hit cache
);
List<float[]> embeddings = embCache.embed(texts);
Configuration
// Create with custom configuration
EmbeddingsCache embCache = EmbeddingsCache.builder()
.jedis(jedis)
.vectorizer(vectorizer)
.name("my-emb-cache")
.ttl(86400) // 24 hours
.build();
Example: Document Processing Pipeline
public class DocumentProcessor {
private final EmbeddingsCache embCache;
private final SearchIndex index;
public DocumentProcessor(UnifiedJedis jedis, BaseVectorizer vectorizer) {
this.embCache = new EmbeddingsCache(jedis, vectorizer, "doc-embeddings");
this.index = /* your search index */;
}
public void processDocuments(List<String> documents) {
List<Map<String, Object>> data = new ArrayList<>();
for (String doc : documents) {
// Get embedding (cached if possible)
float[] embedding = embCache.embed(doc);
data.add(Map.of(
"content", doc,
"embedding", embedding,
"processed_at", System.currentTimeMillis()
));
}
// Load into search index
index.load(data);
}
public void updateDocument(String oldContent, String newContent) {
// Clear old embedding from cache
embCache.delete(oldContent);
// Process new content
float[] newEmbedding = embCache.embed(newContent);
// Update in index...
}
}
Cache Statistics
Monitor your cache performance:
// Semantic Cache stats
Map<String, Object> stats = cache.getStats();
System.out.println("Cache hits: " + stats.get("hits"));
System.out.println("Cache misses: " + stats.get("misses"));
System.out.println("Hit rate: " + stats.get("hit_rate") + "%");
// Check cache size
int size = cache.size();
System.out.println("Cached entries: " + size);
Cache Management
Best Practices
-
Choose Appropriate Thresholds
-
Start with 0.9 for semantic cache
-
Adjust based on cache hit rate and quality
-
Monitor false positives
-
-
Set Reasonable TTLs
-
Short TTLs (minutes-hours) for frequently changing content
-
Long TTLs (days-weeks) for stable content
-
No TTL for permanent caching
-
-
Monitor Performance
-
Track cache hit rates
-
Measure latency improvements
-
Calculate cost savings
-
-
Handle Cache Misses Gracefully
try { CacheHit hit = cache.check(prompt); if (hit != null) { return hit.getResponse(); } } catch (Exception e) { logger.warn("Cache check failed, falling back to LLM", e); } // Always have fallback to LLM return callLLM(prompt); -
Clear Stale Data
// Periodic cache cleanup ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1); scheduler.scheduleAtFixedRate(() -> { cache.clear(); // or selective cleanup }, 0, 24, TimeUnit.HOURS);
Performance Impact
Expected improvements with semantic caching:
| Metric | Without Cache | With Cache (80% hit rate) |
|---|---|---|
API Calls |
10,000/day |
2,000/day (-80%) |
Avg Latency |
500ms |
~150ms (-70%) |
Daily Cost |
$50 |
$10 (-80%) |
Throughput |
100 req/s |
400+ req/s (+300%) |
Complete Example: Chatbot with Caching
public class CachedChatbot {
private final SemanticCache semanticCache;
private final EmbeddingsCache embeddingsCache;
private final LLMClient llmClient;
public CachedChatbot(UnifiedJedis jedis, BaseVectorizer vectorizer) {
this.semanticCache = new SemanticCache(
jedis, vectorizer, "chat-cache", 0.9
);
this.embeddingsCache = new EmbeddingsCache(
jedis, vectorizer, "chat-embeddings"
);
this.llmClient = new LLMClient();
}
public String chat(String userMessage, List<String> conversationHistory) {
// Check semantic cache
Optional<CacheHit> hit = semanticCache.check(userMessage);
if (hit.isPresent()) {
return hit.get().getResponse();
}
// Build context with cached embeddings
List<float[]> historyEmbeddings = embeddingsCache.embed(conversationHistory);
// Call LLM with context
String response = llmClient.complete(userMessage, conversationHistory);
// Cache the response
semanticCache.store(userMessage, response);
// Cache embeddings for this message
embeddingsCache.embed(userMessage);
embeddingsCache.embed(response);
return response;
}
public void showStats() {
System.out.println("=== Cache Statistics ===");
System.out.println("Semantic cache size: " + semanticCache.size());
System.out.println("Embeddings cache size: " + embeddingsCache.size());
}
}
Next Steps
-
Vectorizers - Learn about embedding generation options
-
Hybrid Queries - Combine caching with vector search
-
Hash vs JSON - Understand storage options