Systems Library / AI Model Setup / How to Test AI API Responses Before Production

AI Model Setup foundations

How to Test AI API Responses Before Production

Build a testing framework to validate AI outputs before deploying to production.

Jay Banlasan

The AI Systems Guy

Most people deploy AI features and find out they are broken when a user complains. Testing ai api responses staging environments are how you catch problems before they hit production: wrong format, hallucinated data, off-brand tone, missing required fields. I build test suites before every AI feature ships. It takes two hours upfront and saves five hours of debugging later.

The challenge is that AI outputs are probabilistic. You cannot do exact string matching. You need assertion patterns that check structure, content presence, format compliance, and semantic quality without requiring an identical output every time.

What You Need Before Starting

Python 3.10+ with anthropic and pytest (pip install anthropic pytest)
An Anthropic API key in your environment
A defined prompt you want to test (start with one function, not your whole system)

Step 1: Define What a Good Response Looks Like

Before writing tests, write a spec. For each AI function you are testing, define:

Required fields or structure (if JSON output)
Length bounds (min/max words or tokens)
Things that must be present (keywords, phrases)
Things that must not be present (competitor names, prohibited topics, hallucination patterns)
Format rules (starts with a greeting, ends with a CTA, is valid JSON)

Example spec for a lead follow-up generator:

FOLLOW_UP_SPEC = {
    "max_words": 100,
    "min_words": 30,
    "must_include_any": ["thank", "reach out", "happy to", "call", "meeting"],
    "must_not_include": ["unsubscribe", "unfortunately", "I apologize", "as an AI"],
    "must_end_with_question": True,
    "no_placeholders": ["[Name]", "[Company]", "{name}", "INSERT"],
}

Writing the spec first forces you to be specific about what you actually want from the model.

Step 2: Build Assertion Helpers

Create reusable assertion functions for common checks.

import re

def assert_word_count(text: str, min_words: int = 0, max_words: int = 9999, label: str = ""):
    words = len(text.split())
    assert words >= min_words, f"{label}: Too short ({words} words, min {min_words})"
    assert words <= max_words, f"{label}: Too long ({words} words, max {max_words})"

def assert_contains_any(text: str, options: list, label: str = ""):
    text_lower = text.lower()
    found = any(opt.lower() in text_lower for opt in options)
    assert found, f"{label}: None of {options} found in response"

def assert_contains_none(text: str, forbidden: list, label: str = ""):
    text_lower = text.lower()
    for item in forbidden:
        assert item.lower() not in text_lower, f"{label}: Forbidden phrase '{item}' found in response"

def assert_no_placeholders(text: str, placeholders: list = None, label: str = ""):
    defaults = ["[Name]", "[Company]", "{name}", "{company}", "INSERT", "PLACEHOLDER"]
    checks = placeholders or defaults
    for ph in checks:
        assert ph not in text, f"{label}: Unfilled placeholder '{ph}' found in response"

def assert_valid_json(text: str, required_keys: list = None, label: str = ""):
    import json
    try:
        data = json.loads(text)
    except json.JSONDecodeError as e:
        assert False, f"{label}: Response is not valid JSON: {e}"

    if required_keys:
        for key in required_keys:
            assert key in data, f"{label}: Required key '{key}' missing from JSON response"
    return data

def assert_ends_with_question(text: str, label: str = ""):
    sentences = [s.strip() for s in text.split(".") if s.strip()]
    last_sentence = sentences[-1] if sentences else ""
    assert "?" in last_sentence or text.rstrip().endswith("?"), \
        f"{label}: Response does not end with a question. Last: '{last_sentence}'"

Step 3: Build a Test Runner for Your Prompts

Create a test class that calls the AI and runs your assertions.

import anthropic
import pytest

client = anthropic.Anthropic()

def run_prompt(system_prompt: str, user_message: str, model: str = "claude-haiku-4-5") -> str:
    response = client.messages.create(
        model=model,
        max_tokens=500,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text

class TestFollowUpGenerator:
    SYSTEM_PROMPT = """You write short follow-up emails for a B2B SaaS company.
Be conversational, genuine, and end with one specific question.
Never use templates or placeholders. Under 80 words."""

    def test_basic_output_format(self):
        response = run_prompt(
            self.SYSTEM_PROMPT,
            "Write a follow-up for Sarah Chen at TechCorp who attended our webinar on AI automation."
        )
        assert_word_count(response, min_words=20, max_words=100, label="follow-up")
        assert_contains_none(response, ["as an AI", "I cannot", "I apologize"], label="follow-up")
        assert_no_placeholders(response, label="follow-up")

    def test_ends_with_question(self):
        response = run_prompt(
            self.SYSTEM_PROMPT,
            "Write a follow-up for Marcus at BuildCo who downloaded our pricing guide."
        )
        assert_ends_with_question(response, label="follow-up")

    def test_does_not_use_template_language(self):
        response = run_prompt(
            self.SYSTEM_PROMPT,
            "Write a follow-up for a prospect named Alex who requested a demo."
        )
        assert_contains_none(
            response,
            ["Dear [", "Hello [", "Hi [", "template", "as per our"],
            label="anti-template"
        )

    def test_mentions_relevant_context(self):
        response = run_prompt(
            self.SYSTEM_PROMPT,
            "Write a follow-up for Janet who asked specifically about API integration."
        )
        assert_contains_any(
            response,
            ["api", "integrat", "connect", "technical"],
            label="context-relevance"
        )

Step 4: Test JSON Output Prompts

For prompts that return structured JSON, test the schema strictly.

class TestTicketClassifier:
    SYSTEM_PROMPT = """Classify support tickets. Return ONLY valid JSON, no explanation.
Schema: {"category": string, "priority": "urgent|high|normal|low", "summary": string}"""

    def test_returns_valid_json(self):
        response = run_prompt(
            self.SYSTEM_PROMPT,
            "My payment failed three times and I have a demo with a client in 2 hours."
        )
        data = assert_valid_json(
            response,
            required_keys=["category", "priority", "summary"],
            label="ticket-classifier"
        )
        assert data["priority"] in ["urgent", "high", "normal", "low"], \
            f"Invalid priority value: {data['priority']}"

    def test_urgent_ticket_gets_high_priority(self):
        response = run_prompt(
            self.SYSTEM_PROMPT,
            "URGENT: Production is down, all users are locked out right now."
        )
        data = assert_valid_json(response, required_keys=["priority"], label="ticket-classifier")
        assert data["priority"] in ["urgent", "high"], \
            f"Production outage should be urgent/high, got: {data['priority']}"

    def test_low_stakes_gets_lower_priority(self):
        response = run_prompt(
            self.SYSTEM_PROMPT,
            "Hi, just wondering if you have dark mode planned for the future?"
        )
        data = assert_valid_json(response, required_keys=["priority"], label="ticket-classifier")
        assert data["priority"] in ["low", "normal"], \
            f"Feature request should be low/normal, got: {data['priority']}"

Step 5: Run Tests Multiple Times to Catch Flakiness

AI outputs vary. Run each test case 3-5 times and flag prompts that fail more than 20% of the time.

def run_test_multiple_times(test_fn, n: int = 5) -> dict:
    results = {"passed": 0, "failed": 0, "errors": []}

    for i in range(n):
        try:
            test_fn()
            results["passed"] += 1
        except AssertionError as e:
            results["failed"] += 1
            results["errors"].append(str(e))

    results["pass_rate"] = results["passed"] / n
    return results

# Usage
def test_follow_up_stability():
    classifier = TestFollowUpGenerator()
    result = run_test_multiple_times(classifier.test_basic_output_format, n=5)
    print(f"Pass rate: {result['pass_rate']:.0%}")
    if result["errors"]:
        print("Sample failures:")
        for err in result["errors"][:3]:
            print(f"  - {err}")

    assert result["pass_rate"] >= 0.8, \
        f"Prompt is too flaky: only {result['pass_rate']:.0%} pass rate"

Step 6: Add a Regression Suite

Save your test inputs and run them before every deployment. If a prompt change breaks existing behavior, you catch it before it ships.

REGRESSION_CASES = [
    {
        "name": "basic-follow-up",
        "system": "Write follow-up emails. Under 80 words. End with a question.",
        "user": "Follow-up for Tom at BuildCo who attended our webinar.",
        "assertions": [
            lambda r: assert_word_count(r, 20, 90),
            lambda r: assert_ends_with_question(r),
            lambda r: assert_no_placeholders(r),
        ]
    },
    # Add more cases as you build more prompts
]

def run_regression_suite():
    failures = []
    for case in REGRESSION_CASES:
        response = run_prompt(case["system"], case["user"])
        for assertion in case["assertions"]:
            try:
                assertion(response)
            except AssertionError as e:
                failures.append({"case": case["name"], "error": str(e), "response": response[:200]})

    if failures:
        print(f"\nREGRESSION FAILURES: {len(failures)}")
        for f in failures:
            print(f"\n  Case: {f['case']}")
            print(f"  Error: {f['error']}")
            print(f"  Response preview: {f['response']}")
        return False

    print(f"All {len(REGRESSION_CASES)} regression cases passed.")
    return True

if __name__ == "__main__":
    run_regression_suite()

Run pytest on your test files: pytest test_ai_prompts.py -v

What to Build Next

Add LLM-as-judge evaluation: use a second Claude call to score response quality on a 1-10 scale
Build a test data generator that creates diverse edge case inputs automatically
Set up CI integration so regression tests run on every pull request

How to Test AI API Responses Before Production

What You Need Before Starting

Step 1: Define What a Good Response Looks Like

Step 2: Build Assertion Helpers

Step 3: Build a Test Runner for Your Prompts

Step 4: Test JSON Output Prompts

Step 5: Run Tests Multiple Times to Catch Flakiness

Step 6: Add a Regression Suite

What to Build Next

Related Reading

Related Systems

How to Set Up Your First Claude API Call

How to Build a Multi-Turn Conversation with Claude

How to Set Up Groq for Ultra-Fast AI Inference