Caching LLM Responses: Save Money and Latency

Every LLM call costs money and takes time. If youre calling the API with the same inputs, youre wasting both.

Cache it.

When to Cache

Good candidates for caching:

Embeddings - Same text = same vector
Classifications - Same input = same category
Translations - Same source = same translation
Summaries - Same document = same summary

Bad candidates:

Chat conversations (context changes)
Creative generation (want variety)
Time-sensitive queries

Basic Implementation

import { LRUCache } from "lru-cache";

const cache = new LRUCache<string, string>({
  max: 1000,
  ttl: 1000 * 60 * 60  // 1 hour
});

async function cachedComplete(prompt: string): Promise<string> {
  const key = createCacheKey(prompt);

  const cached = cache.get(key);
  if (cached) {
    console.log("Cache hit!");
    return cached;
  }

  const response = await llm.complete(prompt);
  cache.set(key, response);

  return response;
}

function createCacheKey(prompt: string): string {
  // Include model name and any config that affects output
  return crypto.createHash("sha256")
    .update(JSON.stringify({ prompt, model: "gpt-4", temp: 0 }))
    .digest("hex");
}

Redis for Production

In-memory caches dont survive restarts. Use Redis for production:

import Redis from "ioredis";

const redis = new Redis(process.env.REDIS_URL);

async function cachedComplete(prompt: string): Promise<string> {
  const key = `llm:${createCacheKey(prompt)}`;

  const cached = await redis.get(key);
  if (cached) return cached;

  const response = await llm.complete(prompt);
  await redis.setex(key, 3600, response); // 1 hour TTL

  return response;
}

Caching Embeddings

Embeddings are perfect for caching - deterministic and expensive:

async function getEmbedding(text: string): Promise<number[]> {
  const key = `embed:${hash(text)}`;

  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  const embedding = await openai.embeddings.create({
    model: "text-embedding-ada-002",
    input: text
  });

  const vector = embedding.data[0].embedding;
  await redis.setex(key, 86400 * 7, JSON.stringify(vector)); // 7 days

  return vector;
}

Embedding the same document twice? Thats just burning money.

Semantic Caching (Advanced)

What if inputs are similar but not identical? Use embedding similarity:

async function semanticCache(query: string): Promise<string | null> {
  const queryEmbedding = await getEmbedding(query);

  // Search vector store for similar cached queries
  const results = await vectorDB.search(queryEmbedding, {
    topK: 1,
    minScore: 0.95  // Very similar threshold
  });

  if (results.length > 0) {
    return results[0].metadata.response;
  }

  return null;
}

This catches variations like "What's the weather?" vs "Whats the weather today?" as cache hits.

Cost Tracking

Know how much youre saving:

const stats = {
  hits: 0,
  misses: 0,
  savedTokens: 0,
  savedCost: 0
};

async function cachedWithStats(prompt: string) {
  const key = createCacheKey(prompt);
  const cached = cache.get(key);

  if (cached) {
    stats.hits++;
    const tokens = estimateTokens(prompt);
    stats.savedTokens += tokens;
    stats.savedCost += tokens * 0.00003; // GPT-4 input cost
    return cached;
  }

  stats.misses++;
  const response = await llm.complete(prompt);
  cache.set(key, response);
  return response;
}

// Log periodically
setInterval(() => {
  const hitRate = stats.hits / (stats.hits + stats.misses);
  console.log(`Cache hit rate: ${(hitRate * 100).toFixed(1)}%`);
  console.log(`Saved: $${stats.savedCost.toFixed(2)}`);
}, 60000);

Cache Invalidation

The two hard problems: naming things and cache invalidation.

// Invalidate when source data changes
async function updateDocument(docId: string, content: string) {
  await db.update(docId, content);

  // Clear related caches
  await redis.del(`summary:${docId}`);
  await redis.del(`embed:${docId}`);
}

// Or use content-based keys that auto-invalidate
function createContentKey(content: string) {
  return `llm:${hash(content)}`; // New content = new key
}

Quick Wins

Cache embeddings - Biggest bang for buck
Cache repeated prompts - Classification, extraction, etc.
Set reasonable TTLs - Hours for static content, minutes for dynamic
Monitor hit rates - Should be >50% or cache isnt helping
Warm the cache - Pre-compute common queries

LLM calls at scale get expensive fast. A good caching strategy can cut your bill in half while making everything faster.