How to Automate RAG Knowledge Base Maintenance
Keep RAG knowledge bases current with automated content updates.
Jay Banlasan
The AI Systems Guy
When you automate rag knowledge base updates and maintenance, your AI answers stay accurate without manual re-indexing. I build maintenance pipelines because stale RAG systems give wrong answers confidently. A product changed 3 months ago but the old docs are still indexed. That is worse than having no answer at all.
The system detects stale content, re-indexes changed documents, and reports on knowledge base health weekly.
What You Need Before Starting
- A working RAG system (see system 409)
- Document source with modification timestamps
- Python 3.8+ with scheduling
- A freshness tracking database
Step 1: Track Document Freshness
import sqlite3
import os
from datetime import datetime
def scan_documents(docs_path):
conn = sqlite3.connect("rag_maintenance.db")
for root, dirs, files in os.walk(docs_path):
for f in files:
path = os.path.join(root, f)
modified = datetime.fromtimestamp(os.path.getmtime(path)).isoformat()
last_indexed = conn.execute(
"SELECT indexed_at FROM document_index WHERE path = ?", (path,)
).fetchone()
if not last_indexed or last_indexed[0] < modified:
yield {"path": path, "modified": modified, "status": "needs_reindex"}
Step 2: Re-Index Changed Documents
def reindex_stale_documents(docs_path, collection):
stale = list(scan_documents(docs_path))
for doc in stale:
remove_old_chunks(doc["path"], collection)
chunks = process_document(doc["path"])
index_chunks(chunks, collection, source=doc["path"])
update_index_record(doc["path"])
return {"reindexed": len(stale)}
def remove_old_chunks(doc_path, collection):
existing = collection.get(where={"source": doc_path})
if existing["ids"]:
collection.delete(ids=existing["ids"])
Step 3: Detect Dead Content
Find indexed documents that no longer exist:
def find_orphaned_chunks(docs_path, collection):
all_chunks = collection.get()
orphaned = []
sources = set(m["source"] for m in all_chunks["metadatas"])
for source in sources:
if not os.path.exists(source):
orphaned.append(source)
return orphaned
def cleanup_orphans(collection):
orphaned = find_orphaned_chunks("./documents", collection)
for source in orphaned:
chunks = collection.get(where={"source": source})
if chunks["ids"]:
collection.delete(ids=chunks["ids"])
return {"removed_sources": len(orphaned)}
Step 4: Generate Health Reports
def get_kb_health_report():
conn = sqlite3.connect("rag_maintenance.db")
total_docs = conn.execute("SELECT COUNT(*) FROM document_index").fetchone()[0]
stale = conn.execute("""
SELECT COUNT(*) FROM document_index
WHERE indexed_at < datetime('now', '-30 days')
""").fetchone()[0]
recent_queries = conn.execute("SELECT COUNT(*) FROM query_log WHERE asked_at > datetime('now', '-7 days')").fetchone()[0]
failed_queries = conn.execute("""
SELECT COUNT(*) FROM query_log
WHERE answer LIKE '%could not find%' AND asked_at > datetime('now', '-7 days')
""").fetchone()[0]
return {
"total_documents": total_docs,
"stale_documents": stale,
"freshness_pct": round((total_docs - stale) / total_docs * 100, 1) if total_docs else 100,
"weekly_queries": recent_queries,
"failed_queries": failed_queries,
"answer_rate": round((recent_queries - failed_queries) / recent_queries * 100, 1) if recent_queries else 100
}
Step 5: Schedule Maintenance
from apscheduler.schedulers.blocking import BlockingScheduler
def run_maintenance():
result = reindex_stale_documents("./documents", collection)
orphans = cleanup_orphans(collection)
health = get_kb_health_report()
if health["freshness_pct"] < 80:
send_alert(f"RAG knowledge base freshness below 80%: {health['freshness_pct']}%")
send_weekly_report(health)
scheduler = BlockingScheduler()
scheduler.add_job(run_maintenance, "cron", day_of_week="sun", hour=2)
scheduler.start()
What to Build Next
Add content quality scoring. Not all documents contribute equally to answer quality. Track which documents get cited most in successful answers and which never get retrieved. Low-value content can be removed or consolidated to improve retrieval speed.
Related Reading
- Why Monitoring Is Not Optional - maintaining RAG systems requires ongoing monitoring
- Building Resilient Operations - maintenance as a resilience practice
- The Feedback Loop That Powers Everything - failed queries driving knowledge base improvements
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment