How to Build AI Evaluation Pipelines

Automate quality scoring of AI outputs using rubrics and judge models.

Jay Banlasan

The AI Systems Guy

Most AI systems get built, deployed, and then silently degrade. The model updates, user inputs drift, and nobody notices until a client complains. An AI evaluation pipeline catches that before it becomes a problem. I use judge models to automatically score outputs against rubrics, log the results, and alert when quality drops below threshold.

This matters for any business that relies on AI-generated content, support responses, or data extraction at scale. One bad batch of outputs going unchecked costs more in trust than the entire system cost to build.

What You Need Before Starting

Python 3.9+
OpenAI or Anthropic API key
A sample dataset of 20-50 inputs and their expected outputs
SQLite or a simple file store for logging scores

Step 1: Define Your Evaluation Rubric

Before writing any code, define what "good" means for your specific task. A rubric for a support bot is different from a rubric for ad copy generation.

SUPPORT_BOT_RUBRIC = """
Score the AI response on these dimensions (1-5 each):

1. Accuracy: Does the response correctly answer the question?
2. Completeness: Does it cover all parts of the user's question?
3. Tone: Is it professional and friendly without being sycophantic?
4. Actionability: Does the user know what to do next?
5. Brevity: Is it appropriately concise? No padding or repetition.

Return JSON: {"accuracy": N, "completeness": N, "tone": N, "actionability": N, "brevity": N, "overall": N, "reason": "..."}
"""

Define your rubric in plain language first. The model needs to understand the criteria, not just apply numbers.

Step 2: Build the Judge Model Function

The judge model reads the original query, the AI response, and the rubric, then returns structured scores.

import openai
import json

judge_client = openai.OpenAI(api_key="YOUR_API_KEY")

def judge_response(
    user_query: str,
    ai_response: str,
    rubric: str,
    judge_model: str = "gpt-4o"
) -> dict:
    prompt = f"""You are an objective evaluator. Score the AI response strictly and fairly.

USER QUERY:
{user_query}

AI RESPONSE:
{ai_response}

{rubric}

Respond with ONLY the JSON object. No explanation outside the JSON."""

    response = judge_client.chat.completions.create(
        model=judge_model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

Using temperature 0 on the judge model keeps scores consistent across runs.

Step 3: Build the Evaluation Logger

Store every evaluation so you can track quality trends over time.

import sqlite3
from datetime import datetime

DB_PATH = "eval_results.db"

def init_eval_db():
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS evaluations (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            timestamp TEXT,
            pipeline_name TEXT,
            user_query TEXT,
            ai_response TEXT,
            model_used TEXT,
            scores_json TEXT,
            overall_score REAL,
            flagged INTEGER DEFAULT 0
        )
    """)
    conn.commit()
    conn.close()

def log_evaluation(pipeline: str, query: str, response: str, model: str, scores: dict):
    conn = sqlite3.connect(DB_PATH)
    overall = scores.get("overall", 0)
    flagged = 1 if overall < 3 else 0
    conn.execute(
        """INSERT INTO evaluations 
        (timestamp, pipeline_name, user_query, ai_response, model_used, scores_json, overall_score, flagged)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
        (datetime.now().isoformat(), pipeline, query, response, model, json.dumps(scores), overall, flagged)
    )
    conn.commit()
    conn.close()
    return flagged

init_eval_db()

Step 4: Run Evaluations Against a Test Set

Build a function that runs your pipeline against a known dataset and reports aggregate scores.

def run_evaluation_batch(
    test_cases: list[dict],  # [{"query": "...", "expected": "..."}]
    pipeline_name: str,
    production_model: str = "gpt-4o-mini",
    rubric: str = SUPPORT_BOT_RUBRIC
) -> dict:
    results = []
    flagged_count = 0

    production_client = openai.OpenAI(api_key="YOUR_API_KEY")

    for case in test_cases:
        # Get production response
        prod_response = production_client.chat.completions.create(
            model=production_model,
            messages=[
                {"role": "system", "content": "You are a helpful support assistant."},
                {"role": "user", "content": case["query"]}
            ],
            temperature=0
        )
        ai_output = prod_response.choices[0].message.content

        # Judge it
        scores = judge_response(case["query"], ai_output, rubric)
        flagged = log_evaluation(pipeline_name, case["query"], ai_output, production_model, scores)

        if flagged:
            flagged_count += 1

        results.append(scores)
        print(f"Query: {case['query'][:50]}... | Score: {scores.get('overall', 0)}/5")

    avg_overall = sum(r.get("overall", 0) for r in results) / len(results)
    return {
        "total_cases": len(test_cases),
        "avg_overall_score": round(avg_overall, 2),
        "flagged_count": flagged_count,
        "flag_rate": round(flagged_count / len(test_cases) * 100, 1)
    }

Step 5: Add Regression Detection

Compare current batch scores against your historical baseline. Alert when quality drops.

def check_regression(pipeline_name: str, current_avg: float, threshold: float = 0.3) -> bool:
    conn = sqlite3.connect(DB_PATH)
    rows = conn.execute(
        """SELECT AVG(overall_score) FROM evaluations 
        WHERE pipeline_name = ? 
        AND timestamp > datetime('now', '-7 days')""",
        (pipeline_name,)
    ).fetchone()
    conn.close()

    if rows[0] is None:
        return False

    historical_avg = rows[0]
    drop = historical_avg - current_avg

    if drop >= threshold:
        print(f"REGRESSION DETECTED: {pipeline_name}")
        print(f"Historical avg (7d): {historical_avg:.2f} | Current: {current_avg:.2f} | Drop: {drop:.2f}")
        return True

    return False

Step 6: Schedule Nightly Evaluation Runs

Run evaluations automatically so you catch issues before users do.

import schedule
import time

TEST_CASES = [
    {"query": "How do I cancel my subscription?"},
    {"query": "I was charged twice this month."},
    {"query": "What's your refund policy?"},
    # Add 20-50 representative queries
]

def nightly_eval():
    print(f"Running nightly eval at {datetime.now()}")
    summary = run_evaluation_batch(TEST_CASES, "support-bot")
    print(f"Summary: {summary}")

    if summary["avg_overall_score"] < 3.5:
        print("ALERT: Quality below threshold. Review flagged outputs.")
        # Add Slack or email alert here

    if check_regression("support-bot", summary["avg_overall_score"]):
        print("ALERT: Regression detected vs last 7 days.")

schedule.every().day.at("02:00").do(nightly_eval)

while True:
    schedule.run_pending()
    time.sleep(60)

What to Build Next

Build a side-by-side comparison tool that runs the same test set against two different models or prompt versions so you can A/B test before promoting changes
Add human review tooling for the top 10% lowest-scoring outputs so you can improve your prompts with real failure examples
Connect evaluation scores to your deployment pipeline so a score drop below threshold blocks a prompt update from going to production

How to Build AI Evaluation Pipelines

What You Need Before Starting

Step 1: Define Your Evaluation Rubric

Step 2: Build the Judge Model Function

Step 3: Build the Evaluation Logger

Step 4: Run Evaluations Against a Test Set

Step 5: Add Regression Detection

Step 6: Schedule Nightly Evaluation Runs

What to Build Next

Related Reading

Related Systems

How to Write System Prompts That Control AI Behavior

How to Build AI Guardrails for Safe Outputs

How to Build Persona-Based AI Assistants