How to Build AI Evaluation Pipelines
Automate quality scoring of AI outputs using rubrics and judge models.
Jay Banlasan
The AI Systems Guy
Most AI systems get built, deployed, and then silently degrade. The model updates, user inputs drift, and nobody notices until a client complains. An AI evaluation pipeline catches that before it becomes a problem. I use judge models to automatically score outputs against rubrics, log the results, and alert when quality drops below threshold.
This matters for any business that relies on AI-generated content, support responses, or data extraction at scale. One bad batch of outputs going unchecked costs more in trust than the entire system cost to build.
What You Need Before Starting
- Python 3.9+
- OpenAI or Anthropic API key
- A sample dataset of 20-50 inputs and their expected outputs
- SQLite or a simple file store for logging scores
Step 1: Define Your Evaluation Rubric
Before writing any code, define what "good" means for your specific task. A rubric for a support bot is different from a rubric for ad copy generation.
SUPPORT_BOT_RUBRIC = """
Score the AI response on these dimensions (1-5 each):
1. Accuracy: Does the response correctly answer the question?
2. Completeness: Does it cover all parts of the user's question?
3. Tone: Is it professional and friendly without being sycophantic?
4. Actionability: Does the user know what to do next?
5. Brevity: Is it appropriately concise? No padding or repetition.
Return JSON: {"accuracy": N, "completeness": N, "tone": N, "actionability": N, "brevity": N, "overall": N, "reason": "..."}
"""
Define your rubric in plain language first. The model needs to understand the criteria, not just apply numbers.
Step 2: Build the Judge Model Function
The judge model reads the original query, the AI response, and the rubric, then returns structured scores.
import openai
import json
judge_client = openai.OpenAI(api_key="YOUR_API_KEY")
def judge_response(
user_query: str,
ai_response: str,
rubric: str,
judge_model: str = "gpt-4o"
) -> dict:
prompt = f"""You are an objective evaluator. Score the AI response strictly and fairly.
USER QUERY:
{user_query}
AI RESPONSE:
{ai_response}
{rubric}
Respond with ONLY the JSON object. No explanation outside the JSON."""
response = judge_client.chat.completions.create(
model=judge_model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
Using temperature 0 on the judge model keeps scores consistent across runs.
Step 3: Build the Evaluation Logger
Store every evaluation so you can track quality trends over time.
import sqlite3
from datetime import datetime
DB_PATH = "eval_results.db"
def init_eval_db():
conn = sqlite3.connect(DB_PATH)
conn.execute("""
CREATE TABLE IF NOT EXISTS evaluations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT,
pipeline_name TEXT,
user_query TEXT,
ai_response TEXT,
model_used TEXT,
scores_json TEXT,
overall_score REAL,
flagged INTEGER DEFAULT 0
)
""")
conn.commit()
conn.close()
def log_evaluation(pipeline: str, query: str, response: str, model: str, scores: dict):
conn = sqlite3.connect(DB_PATH)
overall = scores.get("overall", 0)
flagged = 1 if overall < 3 else 0
conn.execute(
"""INSERT INTO evaluations
(timestamp, pipeline_name, user_query, ai_response, model_used, scores_json, overall_score, flagged)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
(datetime.now().isoformat(), pipeline, query, response, model, json.dumps(scores), overall, flagged)
)
conn.commit()
conn.close()
return flagged
init_eval_db()
Step 4: Run Evaluations Against a Test Set
Build a function that runs your pipeline against a known dataset and reports aggregate scores.
def run_evaluation_batch(
test_cases: list[dict], # [{"query": "...", "expected": "..."}]
pipeline_name: str,
production_model: str = "gpt-4o-mini",
rubric: str = SUPPORT_BOT_RUBRIC
) -> dict:
results = []
flagged_count = 0
production_client = openai.OpenAI(api_key="YOUR_API_KEY")
for case in test_cases:
# Get production response
prod_response = production_client.chat.completions.create(
model=production_model,
messages=[
{"role": "system", "content": "You are a helpful support assistant."},
{"role": "user", "content": case["query"]}
],
temperature=0
)
ai_output = prod_response.choices[0].message.content
# Judge it
scores = judge_response(case["query"], ai_output, rubric)
flagged = log_evaluation(pipeline_name, case["query"], ai_output, production_model, scores)
if flagged:
flagged_count += 1
results.append(scores)
print(f"Query: {case['query'][:50]}... | Score: {scores.get('overall', 0)}/5")
avg_overall = sum(r.get("overall", 0) for r in results) / len(results)
return {
"total_cases": len(test_cases),
"avg_overall_score": round(avg_overall, 2),
"flagged_count": flagged_count,
"flag_rate": round(flagged_count / len(test_cases) * 100, 1)
}
Step 5: Add Regression Detection
Compare current batch scores against your historical baseline. Alert when quality drops.
def check_regression(pipeline_name: str, current_avg: float, threshold: float = 0.3) -> bool:
conn = sqlite3.connect(DB_PATH)
rows = conn.execute(
"""SELECT AVG(overall_score) FROM evaluations
WHERE pipeline_name = ?
AND timestamp > datetime('now', '-7 days')""",
(pipeline_name,)
).fetchone()
conn.close()
if rows[0] is None:
return False
historical_avg = rows[0]
drop = historical_avg - current_avg
if drop >= threshold:
print(f"REGRESSION DETECTED: {pipeline_name}")
print(f"Historical avg (7d): {historical_avg:.2f} | Current: {current_avg:.2f} | Drop: {drop:.2f}")
return True
return False
Step 6: Schedule Nightly Evaluation Runs
Run evaluations automatically so you catch issues before users do.
import schedule
import time
TEST_CASES = [
{"query": "How do I cancel my subscription?"},
{"query": "I was charged twice this month."},
{"query": "What's your refund policy?"},
# Add 20-50 representative queries
]
def nightly_eval():
print(f"Running nightly eval at {datetime.now()}")
summary = run_evaluation_batch(TEST_CASES, "support-bot")
print(f"Summary: {summary}")
if summary["avg_overall_score"] < 3.5:
print("ALERT: Quality below threshold. Review flagged outputs.")
# Add Slack or email alert here
if check_regression("support-bot", summary["avg_overall_score"]):
print("ALERT: Regression detected vs last 7 days.")
schedule.every().day.at("02:00").do(nightly_eval)
while True:
schedule.run_pending()
time.sleep(60)
What to Build Next
- Build a side-by-side comparison tool that runs the same test set against two different models or prompt versions so you can A/B test before promoting changes
- Add human review tooling for the top 10% lowest-scoring outputs so you can improve your prompts with real failure examples
- Connect evaluation scores to your deployment pipeline so a score drop below threshold blocks a prompt update from going to production
Related Reading
- How to Write System Prompts That Control AI Behavior - evaluation without good prompts is just measuring the wrong thing
- How to Build AI Guardrails for Safe Outputs - combine guardrails with evals for a complete quality system
- How to Build Persona-Based AI Assistants - persona consistency is a dimension you can add to your rubric
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment