How to Optimize AI Prompts for Speed
Rewrite prompts to get the same quality output in fewer tokens and less time.
Jay Banlasan
The AI Systems Guy
I had a classification prompt that was 1,200 tokens long because I kept adding context and examples to improve accuracy. Response latency was 3.2 seconds average. I spent two hours rewriting it using optimization principles and got it to 380 tokens with 2.8 seconds faster response and nearly identical classification accuracy. That's what learning to optimize ai prompt speed reduce latency techniques is worth when you have thousands of calls per day.
Prompt optimization is not about making prompts shorter for the sake of it. It's about removing what doesn't add value and structuring what remains so the model reaches the answer faster with fewer tokens. Every token in your prompt costs money and adds latency. Every unnecessary token is a tax you pay on every single call.
What You Need Before Starting
- Python 3.10+
anthropicSDKtiktokenfor token counting (pip install tiktoken)- Baseline prompts you're currently using in production
- A way to measure quality (evaluation function or human review set)
Step 1: Measure Your Baseline
Before optimizing anything, measure where you are. You need both speed and quality baselines.
import anthropic
import time
import tiktoken
_client = anthropic.Anthropic()
_enc = tiktoken.get_encoding("cl100k_base") # close enough for Anthropic
def count_tokens(text: str) -> int:
return len(_enc.encode(text))
def benchmark_prompt(prompt: str, n_runs: int = 5,
model: str = "claude-haiku-3") -> dict:
latencies = []
token_counts = []
for _ in range(n_runs):
start = time.time()
response = _client.messages.create(
model=model, max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
latency = time.time() - start
latencies.append(latency)
token_counts.append(response.usage.input_tokens)
return {
"avg_latency_s": round(sum(latencies) / n_runs, 3),
"p95_latency_s": round(sorted(latencies)[int(n_runs * 0.95)], 3),
"avg_input_tokens": round(sum(token_counts) / n_runs),
"prompt_chars": len(prompt),
"est_cost_per_1000_calls": round(sum(token_counts) / n_runs / 1000 * 0.00025, 4)
}
# Baseline your current prompt
original_prompt = """Your prompt here..."""
baseline = benchmark_prompt(original_prompt)
print(baseline)
Run this before and after every optimization. Numbers over intuition.
Step 2: Remove Filler and Redundant Instructions
The most common source of prompt bloat is over-explaining. Models don't need paragraphs of context to do simple tasks.
# BEFORE (347 tokens)
bloated = """
You are a helpful AI assistant with expertise in customer service. Your job is to
carefully read the customer message below and determine the sentiment of the message.
The sentiment can be positive, negative, or neutral. Please make sure to consider
the entire message and think about it carefully before responding. It is important
that your response is accurate because this information will be used to route the
customer to the appropriate team. Please respond with only the word that describes
the sentiment: positive, negative, or neutral.
Customer message: {message}
Please provide your sentiment classification:
"""
# AFTER (41 tokens)
lean = """Classify sentiment: positive, negative, or neutral. One word only.
Message: {message}"""
The lean version produces identical classification accuracy. The 306 tokens you saved per call is 306 tokens multiplied by however many times that prompt runs per month.
Common bloat patterns to cut:
- "You are a helpful AI assistant" (models know this)
- "Please make sure to" / "It is important that" (redundant emphasis)
- "Please provide your" / "Your response should be" (just state the format directly)
- Repeated restatements of the same instruction
Step 3: Use Structured Output Instructions
When you need structured responses, explicit format instructions reduce output tokens and parsing overhead.
# BEFORE — vague, verbose output (model writes long explanations)
slow_format = """
Analyze this lead and tell me about their company, what they might need,
and whether they're a good fit. Be thorough.
"""
# AFTER — structured, fast to parse
fast_format = """Analyze this B2B lead. Return JSON only, no prose:
{
"company_summary": "one sentence",
"likely_pain_point": "one phrase",
"fit_score": 1-10,
"recommended_action": "call|email|skip"
}
Lead data: {lead_data}"""
Structured outputs also let you set max_tokens lower because you know exactly how long the response needs to be.
Step 4: Move Examples Out of the Prompt
Few-shot examples dramatically increase token counts. Move them to a system prompt that uses prefix caching, or replace them with a clear format instruction.
# BEFORE — 3 few-shot examples add ~300 tokens every call
with_examples = """
Classify this support ticket into: billing, technical, general
Example 1:
Input: "I was charged twice this month"
Output: billing
Example 2:
Input: "The app crashes on login"
Output: technical
Example 3:
Input: "What are your business hours?"
Output: general
Now classify: {ticket}
Output:"""
# AFTER — use system prompt with prefix caching for examples
# System prompts are cached by Anthropic after the first call to the same system prompt
system_with_examples = """Classify support tickets as: billing | technical | general
billing = payments, invoices, charges
technical = bugs, crashes, errors, features
general = hours, policies, information
Respond with the category only."""
user_prompt = "Classify: {ticket}"
def classify_ticket(ticket: str) -> str:
response = _client.messages.create(
model="claude-haiku-3", max_tokens=20,
system=system_with_examples,
messages=[{"role": "user", "content": f"Classify: {ticket}"}]
)
return response.content[0].text.strip()
Anthropic caches system prompts automatically when the same text appears in consecutive calls. After the first call, you only pay for the new user message tokens, not the system prompt.
Step 5: Set Tight max_tokens Limits
The max_tokens parameter is a ceiling, not a target. If you're generating a one-word classification but you've set max_tokens=1024, the model doesn't start writing faster but you burn buffer tokens on the infrastructure side.
MAX_TOKENS_BY_TASK = {
"classify_sentiment": 5, # single word
"classify_ticket": 10, # one to three words
"extract_json_small": 200, # small JSON object
"lead_summary": 150, # two to three sentences
"email_subject": 25, # one line
"blog_outline": 400, # structured list
"full_blog_post": 2000, # long form
}
def smart_call(prompt: str, task_type: str,
model: str = "claude-haiku-3") -> str:
max_tokens = MAX_TOKENS_BY_TASK.get(task_type, 512)
response = _client.messages.create(
model=model,
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Step 6: Run an Optimization Comparison Test
Before deploying optimized prompts, compare them side by side on your real use cases.
def compare_prompts(
original: str,
optimized: str,
test_inputs: list[str],
evaluator_fn=None # function(original_output, optimized_output) -> bool
) -> dict:
orig_stats = []
opt_stats = []
quality_matches = 0
for inp in test_inputs:
orig_prompt = original.format(input=inp)
opt_prompt = optimized.format(input=inp)
# Original
t0 = time.time()
orig_resp = _client.messages.create(
model="claude-haiku-3", max_tokens=512,
messages=[{"role": "user", "content": orig_prompt}]
)
orig_stats.append({"latency": time.time()-t0, "tokens": orig_resp.usage.input_tokens,
"output": orig_resp.content[0].text})
# Optimized
t0 = time.time()
opt_resp = _client.messages.create(
model="claude-haiku-3", max_tokens=512,
messages=[{"role": "user", "content": opt_prompt}]
)
opt_stats.append({"latency": time.time()-t0, "tokens": opt_resp.usage.input_tokens,
"output": opt_resp.content[0].text})
if evaluator_fn and evaluator_fn(orig_stats[-1]["output"], opt_stats[-1]["output"]):
quality_matches += 1
avg = lambda lst, k: sum(x[k] for x in lst) / len(lst)
return {
"original": {"avg_latency": round(avg(orig_stats,"latency"),3), "avg_tokens": round(avg(orig_stats,"tokens"))},
"optimized": {"avg_latency": round(avg(opt_stats,"latency"),3), "avg_tokens": round(avg(opt_stats,"tokens"))},
"latency_improvement_pct": round((1 - avg(opt_stats,"latency")/avg(orig_stats,"latency"))*100, 1),
"token_reduction_pct": round((1 - avg(opt_stats,"tokens")/avg(orig_stats,"tokens"))*100, 1),
"quality_match_rate": round(quality_matches/len(test_inputs)*100, 1) if evaluator_fn else None
}
What to Build Next
- Build a prompt versioning system that stores all tested variants alongside their benchmark results
- Automate the benchmarking step so it runs as part of your CI pipeline before prompt changes go to production
- Add a prompt token budget so any prompt that exceeds a threshold triggers an automatic optimization review
Related Reading
- How to Build a Multi-Model AI Router - leaner prompts make model selection more accurate and routing cheaper
- How to Build Automatic Model Failover Systems - faster prompts mean less latency during failover transitions
- How to Implement Semantic Caching for AI Queries - optimized consistent prompts improve cache hit rates
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment