How to Implement AI Response Caching

Cache repeated AI queries to cut costs and improve response times.

Jay Banlasan

The AI Systems Guy

AI response caching is one of the fastest wins I found when scaling AI systems for clients. The idea is simple: if the same query comes in more than once, you return the stored result instead of paying for another API call. I have seen caching cut monthly API costs by 40-60% in production systems where users ask overlapping questions.

For businesses running AI at scale, this is not optional. A support bot fielding 500 messages a day will get the same 20 questions over and over. Without a cache layer, you pay full price every time. With one, you pay once and serve the answer forever.

What You Need Before Starting

Python 3.9+
An OpenAI or Anthropic API key
Redis (local or managed, like Redis Cloud free tier) or SQLite for simpler setups
redis, openai, and hashlib Python packages

Step 1: Choose Your Cache Backend

Redis is the right choice for production. It handles expiry natively and works across multiple processes. SQLite works fine for single-process scripts or local development.

Install Redis client:

pip install redis openai

For SQLite-only setups (no extra infra needed):

pip install openai

Step 2: Build the Cache Key

The cache key needs to uniquely represent the query. I hash the model name plus the exact prompt text. This prevents collisions when you use different models or system prompts.

import hashlib
import json

def make_cache_key(model: str, messages: list, temperature: float = 0) -> str:
    payload = json.dumps({
        "model": model,
        "messages": messages,
        "temperature": temperature
    }, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()

Note: only cache requests at temperature 0. Higher temperatures mean you want variety, so caching defeats the purpose.

Step 3: Build the Cache Layer with Redis

import redis
import json
import openai
import hashlib

client = openai.OpenAI(api_key="YOUR_API_KEY")
cache = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)

CACHE_TTL_SECONDS = 86400  # 24 hours

def make_cache_key(model, messages, temperature=0):
    payload = json.dumps({"model": model, "messages": messages, "temperature": temperature}, sort_keys=True)
    return "ai_cache:" + hashlib.sha256(payload.encode()).hexdigest()

def cached_completion(model: str, messages: list, temperature: float = 0) -> str:
    # Only cache deterministic requests
    if temperature > 0:
        response = client.chat.completions.create(
            model=model, messages=messages, temperature=temperature
        )
        return response.choices[0].message.content

    key = make_cache_key(model, messages, temperature)
    cached = cache.get(key)

    if cached:
        print("[CACHE HIT]")
        return cached

    print("[CACHE MISS] Calling API...")
    response = client.chat.completions.create(
        model=model, messages=messages, temperature=temperature
    )
    result = response.choices[0].message.content
    cache.setex(key, CACHE_TTL_SECONDS, result)
    return result

Step 4: SQLite Fallback for Simple Projects

If you do not want to run Redis, this SQLite version works for scripts and small bots:

import sqlite3
import hashlib
import json
import openai
from datetime import datetime, timedelta

DB_PATH = "ai_cache.db"

def init_db():
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS cache (
            key TEXT PRIMARY KEY,
            response TEXT,
            created_at TEXT
        )
    """)
    conn.commit()
    conn.close()

def get_cached(key: str, ttl_hours: int = 24) -> str | None:
    conn = sqlite3.connect(DB_PATH)
    row = conn.execute("SELECT response, created_at FROM cache WHERE key = ?", (key,)).fetchone()
    conn.close()
    if not row:
        return None
    created = datetime.fromisoformat(row[1])
    if datetime.now() - created > timedelta(hours=ttl_hours):
        return None
    return row[0]

def set_cached(key: str, response: str):
    conn = sqlite3.connect(DB_PATH)
    conn.execute(
        "INSERT OR REPLACE INTO cache (key, response, created_at) VALUES (?, ?, ?)",
        (key, response, datetime.now().isoformat())
    )
    conn.commit()
    conn.close()

init_db()
client = openai.OpenAI(api_key="YOUR_API_KEY")

def cached_completion(model: str, messages: list) -> str:
    key = "ai:" + hashlib.sha256(json.dumps({"model": model, "messages": messages}, sort_keys=True).encode()).hexdigest()
    cached = get_cached(key)
    if cached:
        return cached
    response = client.chat.completions.create(model=model, messages=messages, temperature=0)
    result = response.choices[0].message.content
    set_cached(key, result)
    return result

Step 5: Add Cache Hit Logging

Track your hit rate so you can see the savings. A 70%+ hit rate on a support bot is normal once it has been running for a week.

import time

stats = {"hits": 0, "misses": 0, "cost_saved": 0.0}

def cached_completion_with_stats(model, messages, cost_per_call=0.002):
    key = make_cache_key(model, messages)
    cached = cache.get(key)

    if cached:
        stats["hits"] += 1
        stats["cost_saved"] += cost_per_call
        return cached

    stats["misses"] += 1
    response = client.chat.completions.create(model=model, messages=messages, temperature=0)
    result = response.choices[0].message.content
    cache.setex(key, CACHE_TTL_SECONDS, result)
    return result

def print_stats():
    total = stats["hits"] + stats["misses"]
    hit_rate = (stats["hits"] / total * 100) if total > 0 else 0
    print(f"Hit rate: {hit_rate:.1f}% | Saved: ${stats['cost_saved']:.4f}")

Step 6: Set Cache Invalidation Rules

Not everything should be cached forever. Use shorter TTLs for time-sensitive content and longer TTLs for stable reference answers.

TTL_BY_TYPE = {
    "faq": 7 * 86400,        # 7 days
    "product_info": 86400,   # 24 hours
    "news_summary": 3600,    # 1 hour
    "greeting": 30 * 86400,  # 30 days
}

def cached_completion_typed(model, messages, cache_type="faq"):
    ttl = TTL_BY_TYPE.get(cache_type, 86400)
    key = make_cache_key(model, messages) + f":{cache_type}"
    cached = cache.get(key)
    if cached:
        return cached
    response = client.chat.completions.create(model=model, messages=messages, temperature=0)
    result = response.choices[0].message.content
    cache.setex(key, ttl, result)
    return result

What to Build Next

Add a cache warming script that pre-populates your top 50 most common queries at deploy time
Build a cache dashboard that shows hit rates per query category so you know which content is worth pre-warming
Implement semantic caching using embeddings to match near-identical queries, not just exact matches

How to Implement AI Response Caching

What You Need Before Starting

Step 1: Choose Your Cache Backend

Step 2: Build the Cache Key

Step 3: Build the Cache Layer with Redis

Step 4: SQLite Fallback for Simple Projects

Step 5: Add Cache Hit Logging

Step 6: Set Cache Invalidation Rules

What to Build Next

Related Reading

Related Systems

How to Write System Prompts That Control AI Behavior

How to Build AI Guardrails for Safe Outputs

How to Build Persona-Based AI Assistants