Systems Library / AI Capabilities / How to Automate RAG Knowledge Base Maintenance
AI Capabilities rag knowledge

How to Automate RAG Knowledge Base Maintenance

Keep RAG knowledge bases current with automated content updates.

Jay Banlasan

Jay Banlasan

The AI Systems Guy

When you automate rag knowledge base updates and maintenance, your AI answers stay accurate without manual re-indexing. I build maintenance pipelines because stale RAG systems give wrong answers confidently. A product changed 3 months ago but the old docs are still indexed. That is worse than having no answer at all.

The system detects stale content, re-indexes changed documents, and reports on knowledge base health weekly.

What You Need Before Starting

Step 1: Track Document Freshness

import sqlite3
import os
from datetime import datetime

def scan_documents(docs_path):
    conn = sqlite3.connect("rag_maintenance.db")
    for root, dirs, files in os.walk(docs_path):
        for f in files:
            path = os.path.join(root, f)
            modified = datetime.fromtimestamp(os.path.getmtime(path)).isoformat()
            last_indexed = conn.execute(
                "SELECT indexed_at FROM document_index WHERE path = ?", (path,)
            ).fetchone()

            if not last_indexed or last_indexed[0] < modified:
                yield {"path": path, "modified": modified, "status": "needs_reindex"}

Step 2: Re-Index Changed Documents

def reindex_stale_documents(docs_path, collection):
    stale = list(scan_documents(docs_path))

    for doc in stale:
        remove_old_chunks(doc["path"], collection)
        chunks = process_document(doc["path"])
        index_chunks(chunks, collection, source=doc["path"])
        update_index_record(doc["path"])

    return {"reindexed": len(stale)}

def remove_old_chunks(doc_path, collection):
    existing = collection.get(where={"source": doc_path})
    if existing["ids"]:
        collection.delete(ids=existing["ids"])

Step 3: Detect Dead Content

Find indexed documents that no longer exist:

def find_orphaned_chunks(docs_path, collection):
    all_chunks = collection.get()
    orphaned = []

    sources = set(m["source"] for m in all_chunks["metadatas"])
    for source in sources:
        if not os.path.exists(source):
            orphaned.append(source)

    return orphaned

def cleanup_orphans(collection):
    orphaned = find_orphaned_chunks("./documents", collection)
    for source in orphaned:
        chunks = collection.get(where={"source": source})
        if chunks["ids"]:
            collection.delete(ids=chunks["ids"])
    return {"removed_sources": len(orphaned)}

Step 4: Generate Health Reports

def get_kb_health_report():
    conn = sqlite3.connect("rag_maintenance.db")

    total_docs = conn.execute("SELECT COUNT(*) FROM document_index").fetchone()[0]
    stale = conn.execute("""
        SELECT COUNT(*) FROM document_index
        WHERE indexed_at < datetime('now', '-30 days')
    """).fetchone()[0]

    recent_queries = conn.execute("SELECT COUNT(*) FROM query_log WHERE asked_at > datetime('now', '-7 days')").fetchone()[0]
    failed_queries = conn.execute("""
        SELECT COUNT(*) FROM query_log
        WHERE answer LIKE '%could not find%' AND asked_at > datetime('now', '-7 days')
    """).fetchone()[0]

    return {
        "total_documents": total_docs,
        "stale_documents": stale,
        "freshness_pct": round((total_docs - stale) / total_docs * 100, 1) if total_docs else 100,
        "weekly_queries": recent_queries,
        "failed_queries": failed_queries,
        "answer_rate": round((recent_queries - failed_queries) / recent_queries * 100, 1) if recent_queries else 100
    }

Step 5: Schedule Maintenance

from apscheduler.schedulers.blocking import BlockingScheduler

def run_maintenance():
    result = reindex_stale_documents("./documents", collection)
    orphans = cleanup_orphans(collection)
    health = get_kb_health_report()

    if health["freshness_pct"] < 80:
        send_alert(f"RAG knowledge base freshness below 80%: {health['freshness_pct']}%")

    send_weekly_report(health)

scheduler = BlockingScheduler()
scheduler.add_job(run_maintenance, "cron", day_of_week="sun", hour=2)
scheduler.start()

What to Build Next

Add content quality scoring. Not all documents contribute equally to answer quality. Track which documents get cited most in successful answers and which never get retrieved. Low-value content can be removed or consolidated to improve retrieval speed.

Related Reading

Want this system built for your business?

Get a free assessment. We will map every system your business needs and show you the ROI.

Get Your Free Assessment

Related Systems