Systems Library / AI Capabilities / How to Build a Citation System for RAG Answers
AI Capabilities rag knowledge

How to Build a Citation System for RAG Answers

Show source citations for every AI answer to build user trust.

Jay Banlasan

Jay Banlasan

The AI Systems Guy

A rag citation system with source attribution for every answer builds the trust that makes AI answers usable for real decisions. I build these because "trust me" is not an acceptable answer from AI. Users need to verify claims. Citations link every statement to the specific document, page, and paragraph it came from.

This turns RAG from "magic black box" into "verifiable reference tool."

What You Need Before Starting

Step 1: Structure Citations in the Prompt

import anthropic
import json

client = anthropic.Anthropic()

CITATION_PROMPT = """Answer the question using the numbered sources below.
For EVERY factual claim in your answer, add a citation like [1] or [2].
If multiple sources support a claim, cite all of them like [1][3].
If no source supports a claim, do not make that claim.

After your answer, list which sources you cited and why.

Sources:
{sources}"""

def query_with_citations(question, collection):
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("all-MiniLM-L6-v2")

    query_embedding = model.encode(question).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=5)

    sources = []
    for i in range(len(results["ids"][0])):
        sources.append({
            "index": i + 1,
            "text": results["documents"][0][i],
            "metadata": results["metadatas"][0][i]
        })

    source_text = "\n\n".join([f"[{s['index']}] ({s['metadata']['source']}, p.{s['metadata'].get('page', 'N/A')})\n{s['text']}" for s in sources])

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=600,
        system=CITATION_PROMPT.format(sources=source_text),
        messages=[{"role": "user", "content": question}]
    )

    return {"answer": response.content[0].text, "sources": sources}

Step 2: Parse and Validate Citations

import re

def extract_citations(answer_text):
    citation_pattern = r'\[(\d+)\]'
    cited_indices = set(int(m) for m in re.findall(citation_pattern, answer_text))
    return cited_indices

def validate_citations(answer, sources):
    cited = extract_citations(answer)
    available = set(s["index"] for s in sources)
    invalid = cited - available
    uncited_sources = available - cited

    return {
        "valid_citations": cited & available,
        "invalid_citations": invalid,
        "uncited_sources": uncited_sources,
        "all_valid": len(invalid) == 0
    }

Step 3: Build Clickable Citation Links

def format_citations_for_frontend(answer, sources):
    citation_data = {}
    for source in sources:
        citation_data[source["index"]] = {
            "document": source["metadata"]["source"],
            "page": source["metadata"].get("page", "N/A"),
            "excerpt": source["text"][:200],
            "url": source["metadata"].get("url", "")
        }

    def replace_citation(match):
        idx = int(match.group(1))
        if idx in citation_data:
            data = citation_data[idx]
            return f'<cite data-source="{idx}" data-doc="{data["document"]}" data-page="{data["page"]}">[{idx}]</cite>'
        return match.group(0)

    formatted = re.sub(r'\[(\d+)\]', replace_citation, answer)
    return {"html": formatted, "citation_data": citation_data}

Step 4: Track Citation Quality

def log_citation_metrics(answer, sources):
    cited = extract_citations(answer)
    validation = validate_citations(answer, sources)

    conn = sqlite3.connect("citations.db")
    conn.execute("""
        INSERT INTO citation_log (total_sources, cited_count, valid_count, logged_at)
        VALUES (?, ?, ?, datetime('now'))
    """, (len(sources), len(cited), len(validation["valid_citations"])))
    conn.commit()

def get_citation_report(days=30):
    conn = sqlite3.connect("citations.db")
    start = f"-{days} days"
    avg_cited = conn.execute(
        "SELECT AVG(cited_count), AVG(valid_count) FROM citation_log WHERE logged_at > datetime('now', ?)", (start,)
    ).fetchone()
    return {"avg_citations_per_answer": round(avg_cited[0] or 0, 1), "avg_valid": round(avg_cited[1] or 0, 1)}

Step 5: Handle Citation Conflicts

When sources disagree, surface the conflict:

CONFLICT_PROMPT = """If any of the sources below contradict each other, note the discrepancy.
State what each source says and let the user decide which applies to their situation.
Do not pick a side unless one source is clearly more recent."""

What to Build Next

Add citation feedback. Let users click a citation and mark it as "relevant" or "not relevant." That feedback improves retrieval over time and helps you identify which documents need updating.

Related Reading

Want this system built for your business?

Get a free assessment. We will map every system your business needs and show you the ROI.

Get Your Free Assessment

Related Systems