How to Build a Citation System for RAG Answers

Show source citations for every AI answer to build user trust.

Jay Banlasan

The AI Systems Guy

A rag citation system with source attribution for every answer builds the trust that makes AI answers usable for real decisions. I build these because "trust me" is not an acceptable answer from AI. Users need to verify claims. Citations link every statement to the specific document, page, and paragraph it came from.

This turns RAG from "magic black box" into "verifiable reference tool."

What You Need Before Starting

A working RAG system with metadata-rich chunks
Python 3.8+ with anthropic
Chunk-level source tracking (document name, page, section)
A frontend that can render inline citations

Step 1: Structure Citations in the Prompt

import anthropic
import json

client = anthropic.Anthropic()

CITATION_PROMPT = """Answer the question using the numbered sources below.
For EVERY factual claim in your answer, add a citation like [1] or [2].
If multiple sources support a claim, cite all of them like [1][3].
If no source supports a claim, do not make that claim.

After your answer, list which sources you cited and why.

Sources:
{sources}"""

def query_with_citations(question, collection):
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("all-MiniLM-L6-v2")

    query_embedding = model.encode(question).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=5)

    sources = []
    for i in range(len(results["ids"][0])):
        sources.append({
            "index": i + 1,
            "text": results["documents"][0][i],
            "metadata": results["metadatas"][0][i]
        })

    source_text = "\n\n".join([f"[{s['index']}] ({s['metadata']['source']}, p.{s['metadata'].get('page', 'N/A')})\n{s['text']}" for s in sources])

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=600,
        system=CITATION_PROMPT.format(sources=source_text),
        messages=[{"role": "user", "content": question}]
    )

    return {"answer": response.content[0].text, "sources": sources}

Step 2: Parse and Validate Citations

import re

def extract_citations(answer_text):
    citation_pattern = r'\[(\d+)\]'
    cited_indices = set(int(m) for m in re.findall(citation_pattern, answer_text))
    return cited_indices

def validate_citations(answer, sources):
    cited = extract_citations(answer)
    available = set(s["index"] for s in sources)
    invalid = cited - available
    uncited_sources = available - cited

    return {
        "valid_citations": cited & available,
        "invalid_citations": invalid,
        "uncited_sources": uncited_sources,
        "all_valid": len(invalid) == 0
    }

Step 3: Build Clickable Citation Links

def format_citations_for_frontend(answer, sources):
    citation_data = {}
    for source in sources:
        citation_data[source["index"]] = {
            "document": source["metadata"]["source"],
            "page": source["metadata"].get("page", "N/A"),
            "excerpt": source["text"][:200],
            "url": source["metadata"].get("url", "")
        }

    def replace_citation(match):
        idx = int(match.group(1))
        if idx in citation_data:
            data = citation_data[idx]
            return f'<cite data-source="{idx}" data-doc="{data["document"]}" data-page="{data["page"]}">[{idx}]</cite>'
        return match.group(0)

    formatted = re.sub(r'\[(\d+)\]', replace_citation, answer)
    return {"html": formatted, "citation_data": citation_data}

Step 4: Track Citation Quality

def log_citation_metrics(answer, sources):
    cited = extract_citations(answer)
    validation = validate_citations(answer, sources)

    conn = sqlite3.connect("citations.db")
    conn.execute("""
        INSERT INTO citation_log (total_sources, cited_count, valid_count, logged_at)
        VALUES (?, ?, ?, datetime('now'))
    """, (len(sources), len(cited), len(validation["valid_citations"])))
    conn.commit()

def get_citation_report(days=30):
    conn = sqlite3.connect("citations.db")
    start = f"-{days} days"
    avg_cited = conn.execute(
        "SELECT AVG(cited_count), AVG(valid_count) FROM citation_log WHERE logged_at > datetime('now', ?)", (start,)
    ).fetchone()
    return {"avg_citations_per_answer": round(avg_cited[0] or 0, 1), "avg_valid": round(avg_cited[1] or 0, 1)}

Step 5: Handle Citation Conflicts

When sources disagree, surface the conflict:

CONFLICT_PROMPT = """If any of the sources below contradict each other, note the discrepancy.
State what each source says and let the user decide which applies to their situation.
Do not pick a side unless one source is clearly more recent."""

What to Build Next

Add citation feedback. Let users click a citation and mark it as "relevant" or "not relevant." That feedback improves retrieval over time and helps you identify which documents need updating.

How to Build a Citation System for RAG Answers

What You Need Before Starting

Step 1: Structure Citations in the Prompt

Step 2: Parse and Validate Citations

Step 3: Build Clickable Citation Links

Step 4: Track Citation Quality

Step 5: Handle Citation Conflicts

What to Build Next

Related Reading

Related Systems

How to Build a RAG System with Your Business Documents

How to Build a Multi-Source RAG System

How to Build Hybrid Search for RAG Systems