Every LLM call costs money and takes time. If youre calling the API with the same inputs, youre wasting both.
Cache it.
When to Cache
Good candidates for caching:
- Embeddings - Same text = same vector
- Classifications - Same input = same category
- Translations - Same source = same translation
- Summaries - Same document = same summary
Bad candidates:
- Chat conversations (context changes)
- Creative generation (want variety)
- Time-sensitive queries
Basic Implementation
import { LRUCache } from "lru-cache";
const cache = new LRUCache<string, string>({
max: 1000,
ttl: 1000 * 60 * 60 // 1 hour
});
async function cachedComplete(prompt: string): Promise<string> {
const key = createCacheKey(prompt);
const cached = cache.get(key);
if (cached) {
console.log("Cache hit!");
return cached;
}
const response = await llm.complete(prompt);
cache.set(key, response);
return response;
}
function createCacheKey(prompt: string): string {
// Include model name and any config that affects output
return crypto.createHash("sha256")
.update(JSON.stringify({ prompt, model: "gpt-4", temp: 0 }))
.digest("hex");
}
Redis for Production
In-memory caches dont survive restarts. Use Redis for production:
import Redis from "ioredis";
const redis = new Redis(process.env.REDIS_URL);
async function cachedComplete(prompt: string): Promise<string> {
const key = `llm:${createCacheKey(prompt)}`;
const cached = await redis.get(key);
if (cached) return cached;
const response = await llm.complete(prompt);
await redis.setex(key, 3600, response); // 1 hour TTL
return response;
}
Caching Embeddings
Embeddings are perfect for caching - deterministic and expensive:
async function getEmbedding(text: string): Promise<number[]> {
const key = `embed:${hash(text)}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const embedding = await openai.embeddings.create({
model: "text-embedding-ada-002",
input: text
});
const vector = embedding.data[0].embedding;
await redis.setex(key, 86400 * 7, JSON.stringify(vector)); // 7 days
return vector;
}
Embedding the same document twice? Thats just burning money.
Semantic Caching (Advanced)
What if inputs are similar but not identical? Use embedding similarity:
async function semanticCache(query: string): Promise<string | null> {
const queryEmbedding = await getEmbedding(query);
// Search vector store for similar cached queries
const results = await vectorDB.search(queryEmbedding, {
topK: 1,
minScore: 0.95 // Very similar threshold
});
if (results.length > 0) {
return results[0].metadata.response;
}
return null;
}
This catches variations like "What's the weather?" vs "Whats the weather today?" as cache hits.
Cost Tracking
Know how much youre saving:
const stats = {
hits: 0,
misses: 0,
savedTokens: 0,
savedCost: 0
};
async function cachedWithStats(prompt: string) {
const key = createCacheKey(prompt);
const cached = cache.get(key);
if (cached) {
stats.hits++;
const tokens = estimateTokens(prompt);
stats.savedTokens += tokens;
stats.savedCost += tokens * 0.00003; // GPT-4 input cost
return cached;
}
stats.misses++;
const response = await llm.complete(prompt);
cache.set(key, response);
return response;
}
// Log periodically
setInterval(() => {
const hitRate = stats.hits / (stats.hits + stats.misses);
console.log(`Cache hit rate: ${(hitRate * 100).toFixed(1)}%`);
console.log(`Saved: $${stats.savedCost.toFixed(2)}`);
}, 60000);
Cache Invalidation
The two hard problems: naming things and cache invalidation.
// Invalidate when source data changes
async function updateDocument(docId: string, content: string) {
await db.update(docId, content);
// Clear related caches
await redis.del(`summary:${docId}`);
await redis.del(`embed:${docId}`);
}
// Or use content-based keys that auto-invalidate
function createContentKey(content: string) {
return `llm:${hash(content)}`; // New content = new key
}
Quick Wins
- Cache embeddings - Biggest bang for buck
- Cache repeated prompts - Classification, extraction, etc.
- Set reasonable TTLs - Hours for static content, minutes for dynamic
- Monitor hit rates - Should be >50% or cache isnt helping
- Warm the cache - Pre-compute common queries
LLM calls at scale get expensive fast. A good caching strategy can cut your bill in half while making everything faster.
