Systems Library / AI Model Setup / How to Create AI Provider Comparison Automation

AI Model Setup routing optimization

How to Create AI Provider Comparison Automation

Automatically benchmark AI providers on quality, speed, and cost for your tasks.

Jay Banlasan

The AI Systems Guy

If you want to compare AI providers with automated benchmarking, you need a system that tests each model against your actual workload. Not synthetic benchmarks from someone else's blog. Your prompts, your data, your latency requirements.

I run this across every client onboarding. Before picking a model, I benchmark Claude, GPT-4, Gemini, and whatever open-source option fits. The results always surprise people.

What You Need Before Starting

API keys for each provider you want to test (OpenAI, Anthropic, Google)
Python 3.8+ with anthropic, openai, and google-generativeai installed
A set of 10-20 representative prompts from your actual business use case
A SQLite database to store results

Step 1: Create Your Benchmark Prompt Set

Pull real prompts from your workflow. If you run ad copy generation, use 15 actual ad copy requests. If you do customer support, use 15 real tickets.

Save them in a JSON file:

[
  {"id": 1, "prompt": "Write a Meta ad headline for a dog grooming business targeting pet owners aged 25-45", "category": "ad_copy"},
  {"id": 2, "prompt": "Summarize this customer complaint and suggest a resolution: ...", "category": "support"}
]

Step 2: Build the Benchmark Runner

import anthropic
import openai
import time
import json
import sqlite3

def benchmark_provider(provider, model, prompt, max_tokens=500):
    start = time.time()
    try:
        if provider == "anthropic":
            client = anthropic.Anthropic()
            resp = client.messages.create(model=model, max_tokens=max_tokens, messages=[{"role": "user", "content": prompt}])
            output = resp.content[0].text
            tokens_used = resp.usage.input_tokens + resp.usage.output_tokens
        elif provider == "openai":
            client = openai.OpenAI()
            resp = client.chat.completions.create(model=model, max_tokens=max_tokens, messages=[{"role": "user", "content": prompt}])
            output = resp.choices[0].message.content
            tokens_used = resp.usage.total_tokens

        latency = time.time() - start
        return {"output": output, "latency": latency, "tokens": tokens_used, "error": None}
    except Exception as e:
        return {"output": None, "latency": time.time() - start, "tokens": 0, "error": str(e)}

Step 3: Score Quality Automatically

Use a judge model to rate outputs. I use Claude for this since it handles rubric-based scoring well:

def score_output(output, prompt, criteria="accuracy, clarity, actionability"):
    client = anthropic.Anthropic()
    judge_prompt = f"""Score this AI output from 1-10 on: {criteria}
    
Original prompt: {prompt}
Output to score: {output}

Return ONLY a JSON object: {{"score": N, "reasoning": "..."}}"""
    
    resp = client.messages.create(model="claude-sonnet-4-20250514", max_tokens=200, messages=[{"role": "user", "content": judge_prompt}])
    return json.loads(resp.content[0].text)

Step 4: Store and Compare Results

def save_result(db_path, provider, model, prompt_id, latency, tokens, score):
    conn = sqlite3.connect(db_path)
    conn.execute("""CREATE TABLE IF NOT EXISTS benchmarks (
        id INTEGER PRIMARY KEY, provider TEXT, model TEXT, prompt_id INTEGER,
        latency REAL, tokens INTEGER, quality_score REAL, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
    )""")
    conn.execute("INSERT INTO benchmarks (provider, model, prompt_id, latency, tokens, quality_score) VALUES (?,?,?,?,?,?)",
                 (provider, model, prompt_id, latency, tokens, score))
    conn.commit()
    conn.close()

Step 5: Generate the Comparison Report

Query the database to get averages per provider:

def comparison_report(db_path):
    conn = sqlite3.connect(db_path)
    rows = conn.execute("""
        SELECT provider, model, 
            AVG(quality_score) as avg_quality,
            AVG(latency) as avg_latency,
            AVG(tokens) as avg_tokens
        FROM benchmarks GROUP BY provider, model
    """).fetchall()
    conn.close()
    for row in rows:
        print(f"{row[0]} ({row[1]}): Quality={row[2]:.1f} Latency={row[3]:.2f}s Tokens={row[4]:.0f}")

Run this weekly. Models update, pricing changes, and your workload evolves. What wins today might not win next month.

What to Build Next

Add cost calculations using each provider's pricing API. Then feed the results into your model router so it picks the best provider per task automatically.

How to Create AI Provider Comparison Automation

What You Need Before Starting

Step 1: Create Your Benchmark Prompt Set

Step 2: Build the Benchmark Runner

Step 3: Score Quality Automatically

Step 4: Store and Compare Results

Step 5: Generate the Comparison Report

What to Build Next

Related Reading

Related Systems

How to Build a Multi-Model AI Router

How to Build Automatic Model Failover Systems

How to Implement Semantic Caching for AI Queries