Systems Library / AI Model Setup / How to Optimize AI Prompts for Speed
AI Model Setup routing optimization

How to Optimize AI Prompts for Speed

Rewrite prompts to get the same quality output in fewer tokens and less time.

Jay Banlasan

Jay Banlasan

The AI Systems Guy

I had a classification prompt that was 1,200 tokens long because I kept adding context and examples to improve accuracy. Response latency was 3.2 seconds average. I spent two hours rewriting it using optimization principles and got it to 380 tokens with 2.8 seconds faster response and nearly identical classification accuracy. That's what learning to optimize ai prompt speed reduce latency techniques is worth when you have thousands of calls per day.

Prompt optimization is not about making prompts shorter for the sake of it. It's about removing what doesn't add value and structuring what remains so the model reaches the answer faster with fewer tokens. Every token in your prompt costs money and adds latency. Every unnecessary token is a tax you pay on every single call.

What You Need Before Starting

Step 1: Measure Your Baseline

Before optimizing anything, measure where you are. You need both speed and quality baselines.

import anthropic
import time
import tiktoken

_client = anthropic.Anthropic()
_enc = tiktoken.get_encoding("cl100k_base")  # close enough for Anthropic

def count_tokens(text: str) -> int:
    return len(_enc.encode(text))

def benchmark_prompt(prompt: str, n_runs: int = 5,
                      model: str = "claude-haiku-3") -> dict:
    latencies = []
    token_counts = []
    
    for _ in range(n_runs):
        start = time.time()
        response = _client.messages.create(
            model=model, max_tokens=512,
            messages=[{"role": "user", "content": prompt}]
        )
        latency = time.time() - start
        latencies.append(latency)
        token_counts.append(response.usage.input_tokens)
    
    return {
        "avg_latency_s":    round(sum(latencies) / n_runs, 3),
        "p95_latency_s":    round(sorted(latencies)[int(n_runs * 0.95)], 3),
        "avg_input_tokens": round(sum(token_counts) / n_runs),
        "prompt_chars":     len(prompt),
        "est_cost_per_1000_calls": round(sum(token_counts) / n_runs / 1000 * 0.00025, 4)
    }

# Baseline your current prompt
original_prompt = """Your prompt here..."""
baseline = benchmark_prompt(original_prompt)
print(baseline)

Run this before and after every optimization. Numbers over intuition.

Step 2: Remove Filler and Redundant Instructions

The most common source of prompt bloat is over-explaining. Models don't need paragraphs of context to do simple tasks.

# BEFORE (347 tokens)
bloated = """
You are a helpful AI assistant with expertise in customer service. Your job is to 
carefully read the customer message below and determine the sentiment of the message.
The sentiment can be positive, negative, or neutral. Please make sure to consider 
the entire message and think about it carefully before responding. It is important 
that your response is accurate because this information will be used to route the 
customer to the appropriate team. Please respond with only the word that describes 
the sentiment: positive, negative, or neutral.

Customer message: {message}

Please provide your sentiment classification:
"""

# AFTER (41 tokens)
lean = """Classify sentiment: positive, negative, or neutral. One word only.

Message: {message}"""

The lean version produces identical classification accuracy. The 306 tokens you saved per call is 306 tokens multiplied by however many times that prompt runs per month.

Common bloat patterns to cut:

Step 3: Use Structured Output Instructions

When you need structured responses, explicit format instructions reduce output tokens and parsing overhead.

# BEFORE — vague, verbose output (model writes long explanations)
slow_format = """
Analyze this lead and tell me about their company, what they might need,
and whether they're a good fit. Be thorough.
"""

# AFTER — structured, fast to parse
fast_format = """Analyze this B2B lead. Return JSON only, no prose:
{
  "company_summary": "one sentence",
  "likely_pain_point": "one phrase",
  "fit_score": 1-10,
  "recommended_action": "call|email|skip"
}

Lead data: {lead_data}"""

Structured outputs also let you set max_tokens lower because you know exactly how long the response needs to be.

Step 4: Move Examples Out of the Prompt

Few-shot examples dramatically increase token counts. Move them to a system prompt that uses prefix caching, or replace them with a clear format instruction.

# BEFORE — 3 few-shot examples add ~300 tokens every call
with_examples = """
Classify this support ticket into: billing, technical, general

Example 1:
Input: "I was charged twice this month"
Output: billing

Example 2:
Input: "The app crashes on login"
Output: technical

Example 3:
Input: "What are your business hours?"
Output: general

Now classify: {ticket}
Output:"""

# AFTER — use system prompt with prefix caching for examples
# System prompts are cached by Anthropic after the first call to the same system prompt
system_with_examples = """Classify support tickets as: billing | technical | general
billing = payments, invoices, charges
technical = bugs, crashes, errors, features
general = hours, policies, information
Respond with the category only."""

user_prompt = "Classify: {ticket}"

def classify_ticket(ticket: str) -> str:
    response = _client.messages.create(
        model="claude-haiku-3", max_tokens=20,
        system=system_with_examples,
        messages=[{"role": "user", "content": f"Classify: {ticket}"}]
    )
    return response.content[0].text.strip()

Anthropic caches system prompts automatically when the same text appears in consecutive calls. After the first call, you only pay for the new user message tokens, not the system prompt.

Step 5: Set Tight max_tokens Limits

The max_tokens parameter is a ceiling, not a target. If you're generating a one-word classification but you've set max_tokens=1024, the model doesn't start writing faster but you burn buffer tokens on the infrastructure side.

MAX_TOKENS_BY_TASK = {
    "classify_sentiment":  5,      # single word
    "classify_ticket":     10,     # one to three words
    "extract_json_small":  200,    # small JSON object
    "lead_summary":        150,    # two to three sentences
    "email_subject":       25,     # one line
    "blog_outline":        400,    # structured list
    "full_blog_post":      2000,   # long form
}

def smart_call(prompt: str, task_type: str,
               model: str = "claude-haiku-3") -> str:
    max_tokens = MAX_TOKENS_BY_TASK.get(task_type, 512)
    response = _client.messages.create(
        model=model,
        max_tokens=max_tokens,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Step 6: Run an Optimization Comparison Test

Before deploying optimized prompts, compare them side by side on your real use cases.

def compare_prompts(
    original: str,
    optimized: str,
    test_inputs: list[str],
    evaluator_fn=None  # function(original_output, optimized_output) -> bool
) -> dict:
    orig_stats = []
    opt_stats  = []
    quality_matches = 0
    
    for inp in test_inputs:
        orig_prompt = original.format(input=inp)
        opt_prompt  = optimized.format(input=inp)
        
        # Original
        t0 = time.time()
        orig_resp = _client.messages.create(
            model="claude-haiku-3", max_tokens=512,
            messages=[{"role": "user", "content": orig_prompt}]
        )
        orig_stats.append({"latency": time.time()-t0, "tokens": orig_resp.usage.input_tokens,
                           "output": orig_resp.content[0].text})
        
        # Optimized
        t0 = time.time()
        opt_resp = _client.messages.create(
            model="claude-haiku-3", max_tokens=512,
            messages=[{"role": "user", "content": opt_prompt}]
        )
        opt_stats.append({"latency": time.time()-t0, "tokens": opt_resp.usage.input_tokens,
                          "output": opt_resp.content[0].text})
        
        if evaluator_fn and evaluator_fn(orig_stats[-1]["output"], opt_stats[-1]["output"]):
            quality_matches += 1
    
    avg = lambda lst, k: sum(x[k] for x in lst) / len(lst)
    
    return {
        "original":  {"avg_latency": round(avg(orig_stats,"latency"),3), "avg_tokens": round(avg(orig_stats,"tokens"))},
        "optimized": {"avg_latency": round(avg(opt_stats,"latency"),3), "avg_tokens": round(avg(opt_stats,"tokens"))},
        "latency_improvement_pct": round((1 - avg(opt_stats,"latency")/avg(orig_stats,"latency"))*100, 1),
        "token_reduction_pct":     round((1 - avg(opt_stats,"tokens")/avg(orig_stats,"tokens"))*100, 1),
        "quality_match_rate":      round(quality_matches/len(test_inputs)*100, 1) if evaluator_fn else None
    }

What to Build Next

Related Reading

Want this system built for your business?

Get a free assessment. We will map every system your business needs and show you the ROI.

Get Your Free Assessment

Related Systems