How to Create AI Provider Comparison Automation
Automatically benchmark AI providers on quality, speed, and cost for your tasks.
Jay Banlasan
The AI Systems Guy
If you want to compare AI providers with automated benchmarking, you need a system that tests each model against your actual workload. Not synthetic benchmarks from someone else's blog. Your prompts, your data, your latency requirements.
I run this across every client onboarding. Before picking a model, I benchmark Claude, GPT-4, Gemini, and whatever open-source option fits. The results always surprise people.
What You Need Before Starting
- API keys for each provider you want to test (OpenAI, Anthropic, Google)
- Python 3.8+ with
anthropic,openai, andgoogle-generativeaiinstalled - A set of 10-20 representative prompts from your actual business use case
- A SQLite database to store results
Step 1: Create Your Benchmark Prompt Set
Pull real prompts from your workflow. If you run ad copy generation, use 15 actual ad copy requests. If you do customer support, use 15 real tickets.
Save them in a JSON file:
[
{"id": 1, "prompt": "Write a Meta ad headline for a dog grooming business targeting pet owners aged 25-45", "category": "ad_copy"},
{"id": 2, "prompt": "Summarize this customer complaint and suggest a resolution: ...", "category": "support"}
]
Step 2: Build the Benchmark Runner
import anthropic
import openai
import time
import json
import sqlite3
def benchmark_provider(provider, model, prompt, max_tokens=500):
start = time.time()
try:
if provider == "anthropic":
client = anthropic.Anthropic()
resp = client.messages.create(model=model, max_tokens=max_tokens, messages=[{"role": "user", "content": prompt}])
output = resp.content[0].text
tokens_used = resp.usage.input_tokens + resp.usage.output_tokens
elif provider == "openai":
client = openai.OpenAI()
resp = client.chat.completions.create(model=model, max_tokens=max_tokens, messages=[{"role": "user", "content": prompt}])
output = resp.choices[0].message.content
tokens_used = resp.usage.total_tokens
latency = time.time() - start
return {"output": output, "latency": latency, "tokens": tokens_used, "error": None}
except Exception as e:
return {"output": None, "latency": time.time() - start, "tokens": 0, "error": str(e)}
Step 3: Score Quality Automatically
Use a judge model to rate outputs. I use Claude for this since it handles rubric-based scoring well:
def score_output(output, prompt, criteria="accuracy, clarity, actionability"):
client = anthropic.Anthropic()
judge_prompt = f"""Score this AI output from 1-10 on: {criteria}
Original prompt: {prompt}
Output to score: {output}
Return ONLY a JSON object: {{"score": N, "reasoning": "..."}}"""
resp = client.messages.create(model="claude-sonnet-4-20250514", max_tokens=200, messages=[{"role": "user", "content": judge_prompt}])
return json.loads(resp.content[0].text)
Step 4: Store and Compare Results
def save_result(db_path, provider, model, prompt_id, latency, tokens, score):
conn = sqlite3.connect(db_path)
conn.execute("""CREATE TABLE IF NOT EXISTS benchmarks (
id INTEGER PRIMARY KEY, provider TEXT, model TEXT, prompt_id INTEGER,
latency REAL, tokens INTEGER, quality_score REAL, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
)""")
conn.execute("INSERT INTO benchmarks (provider, model, prompt_id, latency, tokens, quality_score) VALUES (?,?,?,?,?,?)",
(provider, model, prompt_id, latency, tokens, score))
conn.commit()
conn.close()
Step 5: Generate the Comparison Report
Query the database to get averages per provider:
def comparison_report(db_path):
conn = sqlite3.connect(db_path)
rows = conn.execute("""
SELECT provider, model,
AVG(quality_score) as avg_quality,
AVG(latency) as avg_latency,
AVG(tokens) as avg_tokens
FROM benchmarks GROUP BY provider, model
""").fetchall()
conn.close()
for row in rows:
print(f"{row[0]} ({row[1]}): Quality={row[2]:.1f} Latency={row[3]:.2f}s Tokens={row[4]:.0f}")
Run this weekly. Models update, pricing changes, and your workload evolves. What wins today might not win next month.
What to Build Next
Add cost calculations using each provider's pricing API. Then feed the results into your model router so it picks the best provider per task automatically.
Related Reading
- Build vs Buy: The AI Framework - how to decide whether to build or buy AI solutions
- The Measurement Framework That Actually Works - measuring what matters in AI systems
- Cost of Manual vs Cost of Automated - understanding the true cost of manual operations
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment