How to Build Few-Shot Prompts for Consistent Output
Use example-based prompting to get reliable, formatted AI responses every time.
Jay Banlasan
The AI Systems Guy
Few-shot prompt engineering for consistent output is the fastest way to get a model to produce exactly the format and style you want without fine-tuning. Instead of describing what you want in abstract rules, you show the model three to five examples of the exact input-output pairs you expect. The model pattern-matches and produces outputs that follow your demonstrated format. I use this on every project where output format matters and where the cost of fine-tuning is not justified.
The technique works because large language models are trained to continue patterns. When you give them explicit examples, you are not asking them to follow instructions abstractly, you are showing them the pattern to continue.
What You Need Before Starting
- An API key for Claude or GPT-4
- 3-10 high-quality input-output examples that represent your desired output format
- A clear task where format consistency is the problem you are solving
Step 1: Understand When Few-Shot Beats Zero-Shot
Zero-shot prompting: "Classify this support ticket as billing, technical, or account." Few-shot prompting: Same instruction plus three examples showing the exact classification format.
Use few-shot when:
- Zero-shot produces correct content but inconsistent format
- Your output categories have overlap the model keeps confusing
- You need a specific writing style that is hard to describe in rules
- The task involves judgment calls you want the model to replicate
Zero-shot is fine when:
- The task is straightforward (summarize, translate)
- Format is simple (yes/no, single number)
- You are already getting consistent outputs
Step 2: Build Your Example Set
Examples are the core of few-shot prompting. Quality beats quantity.
# Good few-shot examples for a ticket classifier
TICKET_CLASSIFICATION_EXAMPLES = [
{
"input": "My card was charged twice for the same subscription.",
"output": '{"category": "billing", "priority": "high", "action": "refund_check"}'
},
{
"input": "The API is returning 500 errors when I try to fetch user data.",
"output": '{"category": "technical", "priority": "urgent", "action": "engineering_escalate"}'
},
{
"input": "I forgot my password and can't get the reset email.",
"output": '{"category": "account", "priority": "normal", "action": "manual_reset"}'
},
{
"input": "Would love to see dark mode added to the dashboard.",
"output": '{"category": "feature_request", "priority": "low", "action": "log_feedback"}'
},
{
"input": "I've been a customer for 3 years and this is unacceptable service.",
"output": '{"category": "account", "priority": "high", "action": "csm_callback"}'
}
]
Rules for good examples:
- Cover the range of inputs you expect in production
- Include edge cases that are genuinely ambiguous
- Every example output must be exactly the format you want
- No mediocre examples - each one teaches a pattern
Step 3: Build the Few-Shot Prompt Function
import anthropic
import json
client = anthropic.Anthropic()
def build_few_shot_prompt(
task_instruction: str,
examples: list,
new_input: str,
input_label: str = "Input",
output_label: str = "Output"
) -> str:
prompt_parts = [task_instruction, ""]
for example in examples:
prompt_parts.append(f"{input_label}: {example['input']}")
prompt_parts.append(f"{output_label}: {example['output']}")
prompt_parts.append("")
prompt_parts.append(f"{input_label}: {new_input}")
prompt_parts.append(f"{output_label}:")
return "\n".join(prompt_parts)
def classify_ticket_few_shot(ticket_text: str) -> dict:
instruction = """Classify support tickets. Return JSON only.
Schema: {"category": "billing|technical|account|feature_request|other", "priority": "urgent|high|normal|low", "action": string}"""
prompt = build_few_shot_prompt(
instruction,
TICKET_CLASSIFICATION_EXAMPLES,
ticket_text
)
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
raw_output = response.content[0].text.strip()
try:
return json.loads(raw_output)
except json.JSONDecodeError:
# Try to extract JSON if there is extra text
import re
match = re.search(r'\{.*?\}', raw_output, re.DOTALL)
if match:
return json.loads(match.group())
return {"error": "Parse failed", "raw": raw_output}
# Test
result = classify_ticket_few_shot("I need to update the billing email on my account.")
print(result)
Step 4: Few-Shot for Style and Voice
Few-shot is especially powerful for style replication. Give it samples of writing and it mimics the voice.
STYLE_EXAMPLES = [
{
"input": "Write a follow-up for someone who attended our webinar on AI automation.",
"output": "Hey [Name] - good to have you on the AI automation call. The workflow piece you asked about during Q&A - I can show you exactly how we built that. Worth a 20-minute call? I have Thursday afternoon free."
},
{
"input": "Write a follow-up for someone who downloaded our pricing guide 3 days ago.",
"output": "[Name] - you grabbed the pricing guide a few days back. Usually people have questions about the setup timeline after reading it. Anything I can clear up, or does the scope feel about right for what you're working on?"
},
{
"input": "Write a follow-up for someone who went cold after two previous emails.",
"output": "[Name], I'll keep this short. Still working on the [problem area]? If the timing's off, totally fine - just let me know and I'll check back in Q3. If you're ready to move, I can get you a live demo this week."
}
]
def generate_follow_up_few_shot(context: str) -> str:
instruction = "Write a short B2B follow-up email. Under 80 words. Conversational. End with one question. Use [Name] as placeholder."
prompt = build_few_shot_prompt(instruction, STYLE_EXAMPLES, context)
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Step 5: Dynamic Few-Shot Selection
For tasks with many categories, selecting the most relevant examples for each input improves accuracy more than using the same fixed set every time.
def select_relevant_examples(
input_text: str,
example_pool: list,
n: int = 3
) -> list:
"""Select examples most similar to the input using basic keyword matching.
For production, use embedding similarity instead."""
scored = []
input_words = set(input_text.lower().split())
for example in example_pool:
example_words = set(example["input"].lower().split())
overlap = len(input_words & example_words)
scored.append((overlap, example))
# Sort by overlap score descending
scored.sort(key=lambda x: x[0], reverse=True)
# Return top n, ensuring variety in categories if possible
selected = [ex for _, ex in scored[:n]]
return selected
def classify_with_dynamic_examples(ticket_text: str) -> dict:
relevant_examples = select_relevant_examples(
ticket_text,
TICKET_CLASSIFICATION_EXAMPLES,
n=3
)
instruction = """Classify support tickets. Return JSON only.
Schema: {"category": "billing|technical|account|feature_request|other", "priority": "urgent|high|normal|low", "action": string}"""
prompt = build_few_shot_prompt(instruction, relevant_examples, ticket_text)
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
try:
return json.loads(response.content[0].text.strip())
except json.JSONDecodeError:
return {"error": "parse_failed", "raw": response.content[0].text}
For production, replace keyword matching with embedding cosine similarity using a model like text-embedding-3-small.
Step 6: Measure Consistency Across Runs
Few-shot prompts are more consistent than zero-shot but still vary. Measure it.
def measure_consistency(
prompt_fn,
test_input: str,
n_runs: int = 10
) -> dict:
outputs = [prompt_fn(test_input) for _ in range(n_runs)]
# For classification tasks, measure how often the same category appears
if outputs and isinstance(outputs[0], dict):
categories = [o.get("category", "error") for o in outputs]
from collections import Counter
category_counts = Counter(categories)
most_common = category_counts.most_common(1)[0]
consistency_rate = most_common[1] / n_runs
return {
"consistency_rate": consistency_rate,
"dominant_output": most_common[0],
"distribution": dict(category_counts),
"n_runs": n_runs
}
return {"outputs": outputs}
# Measure classification consistency
result = measure_consistency(
classify_ticket_few_shot,
"I need to cancel my subscription and get a refund for this month.",
n_runs=5
)
print(f"Consistency: {result['consistency_rate']:.0%} agreed on '{result['dominant_output']}'")
Anything below 80% consistency on a clear-cut input means your examples need refinement. Add a counterexample for the case causing disagreement.
What to Build Next
- Build an example library organized by task type that you reuse across projects
- Add embedding-based example selection so prompts automatically pick the most relevant examples at runtime
- Create a consistency dashboard that tracks output stability across model versions
Related Reading
- Few-Shot Prompting: Teaching AI by Example - few shot prompting examples business
- Using JSON Mode for Reliable API Output - json mode api ai output
- Input, Process, Output: The Universal AI Framework - input process output ai framework
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment