Prompt Caching Patterns

Implement effective caching strategies to reduce LLM costs by up to 90%.

When to Use

Same or similar prompts are sent repeatedly
Large system prompts are reused across requests
Responses can be reused for identical queries
Need to reduce latency for common requests
Optimizing costs for high-volume applications

Caching Strategies

1. Provider-Level Caching (Anthropic)

Anthropic offers built-in prompt caching with 90% cost reduction.

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Large system context that will be reused
const systemContext = `
[Your long system prompt, documentation, examples, etc.]
This can be many thousands of tokens that you want to cache.
`;

async function queryWithCache(userQuestion: string) {
  const response = await client.messages.create({
    model: 'claude-3-sonnet-20240229',
    max_tokens: 1024,
    system: [
      {
        type: 'text',
        text: systemContext,
        cache_control: { type: 'ephemeral' } // Cache for 5 minutes
      }
    ],
    messages: [
      { role: 'user', content: userQuestion }
    ]
  });

  // Check cache usage
  console.log('Cache read tokens:', response.usage.cache_read_input_tokens);
  console.log('Cache creation tokens:', response.usage.cache_creation_input_tokens);

  return response;
}

Pricing with cache:

Cache write: 25% more than base input price
Cache read: 90% less than base input price
Break-even: ~2 requests with same cached content

2. Response Caching

Cache LLM responses for identical or similar queries.

interface CacheEntry {
  response: string;
  createdAt: number;
  ttlMs: number;
  metadata: {
    model: string;
    inputTokens: number;
    outputTokens: number;
  };
}

class ResponseCache {
  private cache = new Map<string, CacheEntry>();

  private hashPrompt(prompt: string): string {
    // Simple hash for exact matching
    return crypto.createHash('sha256').update(prompt).digest('hex');
  }

  get(prompt: string): string | null {
    const key = this.hashPrompt(prompt);
    const entry = this.cache.get(key);

    if (!entry) return null;

    // Check TTL
    if (Date.now() - entry.createdAt > entry.ttlMs) {
      this.cache.delete(key);
      return null;
    }

    return entry.response;
  }

  set(prompt: string, response: string, options: { ttlMs?: number; metadata?: any } = {}): void {
    const key = this.hashPrompt(prompt);
    this.cache.set(key, {
      response,
      createdAt: Date.now(),
      ttlMs: options.ttlMs || 3600000, // 1 hour default
      metadata: options.metadata
    });
  }
}

// Usage
const cache = new ResponseCache();

async function cachedQuery(prompt: string): Promise<string> {
  // Check cache first
  const cached = cache.get(prompt);
  if (cached) {
    console.log('Cache hit!');
    return cached;
  }

  // Make API call
  const response = await llm.complete(prompt);

  // Cache the response
  cache.set(prompt, response, { ttlMs: 3600000 });

  return response;
}

3. Semantic Caching

Cache based on meaning, not exact match.

import { OpenAIEmbeddings } from 'langchain/embeddings/openai';

class SemanticCache {
  private entries: { embedding: number[]; response: string; prompt: string }[] = [];
  private embeddings: OpenAIEmbeddings;
  private similarityThreshold = 0.95;

  constructor() {
    this.embeddings = new OpenAIEmbeddings();
  }

  async get(prompt: string): Promise<string | null> {
    const queryEmbedding = await this.embeddings.embedQuery(prompt);

    // Find most similar cached prompt
    let bestMatch: { similarity: number; response: string } | null = null;

    for (const entry of this.entries) {
      const similarity = this.cosineSimilarity(queryEmbedding, entry.embedding);

      if (similarity > this.similarityThreshold) {
        if (!bestMatch || similarity > bestMatch.similarity) {
          bestMatch = { similarity, response: entry.response };
        }
      }
    }

    return bestMatch?.response || null;
  }

  async set(prompt: string, response: string): Promise<void> {
    const embedding = await this.embeddings.embedQuery(prompt);
    this.entries.push({ embedding, response, prompt });
  }

  private cosineSimilarity(a: number[], b: number[]): number {
    let dotProduct = 0;
    let normA = 0;
    let normB = 0;

    for (let i = 0; i < a.length; i++) {
      dotProduct += a[i] * b[i];
      normA += a[i] * a[i];
      normB += b[i] * b[i];
    }

    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
  }
}

// Usage
const semanticCache = new SemanticCache();

// These would hit the cache:
// "What is the capital of France?" -> cached
// "What's France's capital city?" -> semantic match!

4. Template Caching

Cache static parts, vary dynamic parts.

interface PromptTemplate {
  staticPart: string;
  dynamicParts: string[];
}

class TemplateCache {
  private templates = new Map<string, {
    staticPartHash: string;
    responses: Map<string, string>; // dynamicHash -> response
  }>();

  generateKey(template: PromptTemplate, values: Record<string, string>): {
    templateKey: string;
    valuesKey: string;
  } {
    const templateKey = this.hash(template.staticPart);
    const valuesKey = this.hash(JSON.stringify(values));
    return { templateKey, valuesKey };
  }

  get(template: PromptTemplate, values: Record<string, string>): string | null {
    const { templateKey, valuesKey } = this.generateKey(template, values);
    return this.templates.get(templateKey)?.responses.get(valuesKey) || null;
  }

  set(template: PromptTemplate, values: Record<string, string>, response: string): void {
    const { templateKey, valuesKey } = this.generateKey(template, values);

    if (!this.templates.has(templateKey)) {
      this.templates.set(templateKey, {
        staticPartHash: templateKey,
        responses: new Map()
      });
    }

    this.templates.get(templateKey)!.responses.set(valuesKey, response);
  }
}

// Usage
const template: PromptTemplate = {
  staticPart: `You are a helpful assistant that translates text.
    Translate the following to the target language.
    Be accurate and natural.`,
  dynamicParts: ['text', 'targetLanguage']
};

// Cache hit for same text + language combo
const cached = templateCache.get(template, {
  text: 'Hello world',
  targetLanguage: 'Spanish'
});

Redis-Based Distributed Cache

import Redis from 'ioredis';

class DistributedPromptCache {
  private redis: Redis;
  private prefix = 'llm:cache:';

  constructor(redisUrl: string) {
    this.redis = new Redis(redisUrl);
  }

  private key(prompt: string): string {
    const hash = crypto.createHash('sha256').update(prompt).digest('hex');
    return `${this.prefix}${hash}`;
  }

  async get(prompt: string): Promise<string | null> {
    const cached = await this.redis.get(this.key(prompt));
    if (cached) {
      await this.redis.hincrby(`${this.prefix}stats`, 'hits', 1);
    } else {
      await this.redis.hincrby(`${this.prefix}stats`, 'misses', 1);
    }
    return cached;
  }

  async set(prompt: string, response: string, ttlSeconds: number = 3600): Promise<void> {
    await this.redis.setex(this.key(prompt), ttlSeconds, response);
  }

  async getStats(): Promise<{ hits: number; misses: number; hitRate: number }> {
    const stats = await this.redis.hgetall(`${this.prefix}stats`);
    const hits = parseInt(stats.hits || '0');
    const misses = parseInt(stats.misses || '0');
    const total = hits + misses;

    return {
      hits,
      misses,
      hitRate: total > 0 ? hits / total : 0
    };
  }
}

Cache Invalidation

interface CachePolicy {
  ttlMs: number;
  invalidateOn: string[]; // Events that invalidate cache
  tags: string[]; // For tag-based invalidation
}

class SmartCache {
  private cache = new Map<string, { value: string; policy: CachePolicy; createdAt: number }>();
  private tagIndex = new Map<string, Set<string>>(); // tag -> keys

  set(key: string, value: string, policy: CachePolicy): void {
    this.cache.set(key, { value, policy, createdAt: Date.now() });

    // Index by tags
    for (const tag of policy.tags) {
      if (!this.tagIndex.has(tag)) {
        this.tagIndex.set(tag, new Set());
      }
      this.tagIndex.get(tag)!.add(key);
    }
  }

  invalidateByTag(tag: string): number {
    const keys = this.tagIndex.get(tag) || new Set();
    let count = 0;

    for (const key of keys) {
      if (this.cache.delete(key)) count++;
    }

    this.tagIndex.delete(tag);
    return count;
  }

  invalidateByEvent(event: string): number {
    let count = 0;

    for (const [key, entry] of this.cache) {
      if (entry.policy.invalidateOn.includes(event)) {
        this.cache.delete(key);
        count++;
      }
    }

    return count;
  }
}

// Usage
cache.set('user:123:summary', response, {
  ttlMs: 3600000,
  invalidateOn: ['user:123:updated', 'user:123:deleted'],
  tags: ['user:123', 'summaries']
});

// When user updates their profile
cache.invalidateByEvent('user:123:updated');

// Or invalidate all summaries
cache.invalidateByTag('summaries');

Cost Savings Calculator

function calculateCacheSavings(
  stats: { hits: number; misses: number },
  avgInputTokens: number,
  avgOutputTokens: number,
  pricing: { inputPer1M: number; outputPer1M: number }
): {
  withoutCache: number;
  withCache: number;
  savings: number;
  savingsPercent: number;
} {
  const totalRequests = stats.hits + stats.misses;

  // Without cache: all requests hit API
  const withoutCache = totalRequests * (
    (avgInputTokens / 1_000_000) * pricing.inputPer1M +
    (avgOutputTokens / 1_000_000) * pricing.outputPer1M
  );

  // With cache: only misses hit API
  const withCache = stats.misses * (
    (avgInputTokens / 1_000_000) * pricing.inputPer1M +
    (avgOutputTokens / 1_000_000) * pricing.outputPer1M
  );

  return {
    withoutCache,
    withCache,
    savings: withoutCache - withCache,
    savingsPercent: ((withoutCache - withCache) / withoutCache) * 100
  };
}

Best Practices

Cache at the right level - Response, prompt part, or embedding
Set appropriate TTLs - Balance freshness vs. savings
Monitor hit rates - Low hit rate means cache isn't helping
Invalidate intelligently - Don't serve stale data
Use semantic caching carefully - Embedding costs add up
Warm the cache - Pre-populate for known queries
Consider cache size - Memory isn't free either