How to Implement AI Response Caching
Cache repeated AI queries to cut costs and improve response times.
Jay Banlasan
The AI Systems Guy
AI response caching is one of the fastest wins I found when scaling AI systems for clients. The idea is simple: if the same query comes in more than once, you return the stored result instead of paying for another API call. I have seen caching cut monthly API costs by 40-60% in production systems where users ask overlapping questions.
For businesses running AI at scale, this is not optional. A support bot fielding 500 messages a day will get the same 20 questions over and over. Without a cache layer, you pay full price every time. With one, you pay once and serve the answer forever.
What You Need Before Starting
- Python 3.9+
- An OpenAI or Anthropic API key
- Redis (local or managed, like Redis Cloud free tier) or SQLite for simpler setups
redis,openai, andhashlibPython packages
Step 1: Choose Your Cache Backend
Redis is the right choice for production. It handles expiry natively and works across multiple processes. SQLite works fine for single-process scripts or local development.
Install Redis client:
pip install redis openai
For SQLite-only setups (no extra infra needed):
pip install openai
Step 2: Build the Cache Key
The cache key needs to uniquely represent the query. I hash the model name plus the exact prompt text. This prevents collisions when you use different models or system prompts.
import hashlib
import json
def make_cache_key(model: str, messages: list, temperature: float = 0) -> str:
payload = json.dumps({
"model": model,
"messages": messages,
"temperature": temperature
}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
Note: only cache requests at temperature 0. Higher temperatures mean you want variety, so caching defeats the purpose.
Step 3: Build the Cache Layer with Redis
import redis
import json
import openai
import hashlib
client = openai.OpenAI(api_key="YOUR_API_KEY")
cache = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)
CACHE_TTL_SECONDS = 86400 # 24 hours
def make_cache_key(model, messages, temperature=0):
payload = json.dumps({"model": model, "messages": messages, "temperature": temperature}, sort_keys=True)
return "ai_cache:" + hashlib.sha256(payload.encode()).hexdigest()
def cached_completion(model: str, messages: list, temperature: float = 0) -> str:
# Only cache deterministic requests
if temperature > 0:
response = client.chat.completions.create(
model=model, messages=messages, temperature=temperature
)
return response.choices[0].message.content
key = make_cache_key(model, messages, temperature)
cached = cache.get(key)
if cached:
print("[CACHE HIT]")
return cached
print("[CACHE MISS] Calling API...")
response = client.chat.completions.create(
model=model, messages=messages, temperature=temperature
)
result = response.choices[0].message.content
cache.setex(key, CACHE_TTL_SECONDS, result)
return result
Step 4: SQLite Fallback for Simple Projects
If you do not want to run Redis, this SQLite version works for scripts and small bots:
import sqlite3
import hashlib
import json
import openai
from datetime import datetime, timedelta
DB_PATH = "ai_cache.db"
def init_db():
conn = sqlite3.connect(DB_PATH)
conn.execute("""
CREATE TABLE IF NOT EXISTS cache (
key TEXT PRIMARY KEY,
response TEXT,
created_at TEXT
)
""")
conn.commit()
conn.close()
def get_cached(key: str, ttl_hours: int = 24) -> str | None:
conn = sqlite3.connect(DB_PATH)
row = conn.execute("SELECT response, created_at FROM cache WHERE key = ?", (key,)).fetchone()
conn.close()
if not row:
return None
created = datetime.fromisoformat(row[1])
if datetime.now() - created > timedelta(hours=ttl_hours):
return None
return row[0]
def set_cached(key: str, response: str):
conn = sqlite3.connect(DB_PATH)
conn.execute(
"INSERT OR REPLACE INTO cache (key, response, created_at) VALUES (?, ?, ?)",
(key, response, datetime.now().isoformat())
)
conn.commit()
conn.close()
init_db()
client = openai.OpenAI(api_key="YOUR_API_KEY")
def cached_completion(model: str, messages: list) -> str:
key = "ai:" + hashlib.sha256(json.dumps({"model": model, "messages": messages}, sort_keys=True).encode()).hexdigest()
cached = get_cached(key)
if cached:
return cached
response = client.chat.completions.create(model=model, messages=messages, temperature=0)
result = response.choices[0].message.content
set_cached(key, result)
return result
Step 5: Add Cache Hit Logging
Track your hit rate so you can see the savings. A 70%+ hit rate on a support bot is normal once it has been running for a week.
import time
stats = {"hits": 0, "misses": 0, "cost_saved": 0.0}
def cached_completion_with_stats(model, messages, cost_per_call=0.002):
key = make_cache_key(model, messages)
cached = cache.get(key)
if cached:
stats["hits"] += 1
stats["cost_saved"] += cost_per_call
return cached
stats["misses"] += 1
response = client.chat.completions.create(model=model, messages=messages, temperature=0)
result = response.choices[0].message.content
cache.setex(key, CACHE_TTL_SECONDS, result)
return result
def print_stats():
total = stats["hits"] + stats["misses"]
hit_rate = (stats["hits"] / total * 100) if total > 0 else 0
print(f"Hit rate: {hit_rate:.1f}% | Saved: ${stats['cost_saved']:.4f}")
Step 6: Set Cache Invalidation Rules
Not everything should be cached forever. Use shorter TTLs for time-sensitive content and longer TTLs for stable reference answers.
TTL_BY_TYPE = {
"faq": 7 * 86400, # 7 days
"product_info": 86400, # 24 hours
"news_summary": 3600, # 1 hour
"greeting": 30 * 86400, # 30 days
}
def cached_completion_typed(model, messages, cache_type="faq"):
ttl = TTL_BY_TYPE.get(cache_type, 86400)
key = make_cache_key(model, messages) + f":{cache_type}"
cached = cache.get(key)
if cached:
return cached
response = client.chat.completions.create(model=model, messages=messages, temperature=0)
result = response.choices[0].message.content
cache.setex(key, ttl, result)
return result
What to Build Next
- Add a cache warming script that pre-populates your top 50 most common queries at deploy time
- Build a cache dashboard that shows hit rates per query category so you know which content is worth pre-warming
- Implement semantic caching using embeddings to match near-identical queries, not just exact matches
Related Reading
- How to Write System Prompts That Control AI Behavior - pairs well because consistent prompts improve cache hit rates
- How to Build AI Guardrails for Safe Outputs - layer guardrails on top of your cache to keep responses clean
- How to Build Persona-Based AI Assistants - persona systems benefit most from caching repeated greeting and FAQ patterns
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment