How to Automate Document Backup and Archival

Back up and archive business documents automatically on schedule.

Jay Banlasan

The AI Systems Guy

Losing a critical document because nobody backed it up is a problem you should only have once. I built an automated document backup and archival system that runs on schedule, copies files to cloud storage, and cleans up old versions so nothing important gets lost.

This system handles the boring work of automate document backup archival for any cloud provider.

What You Need Before Starting

Python 3.8+
Cloud storage credentials (AWS S3, Google Cloud Storage, or Backblaze B2)
A schedule runner (cron on Linux, Task Scheduler on Windows)

Step 1: Define Your Backup Configuration

BACKUP_CONFIG = {
    "source_dirs": [
        "/home/user/contracts",
        "/home/user/client-briefs",
        "/home/user/sops"
    ],
    "destination": "s3://my-backups/documents/",
    "retention_days": 90,
    "file_types": [".pdf", ".docx", ".md", ".xlsx"],
    "max_file_size_mb": 100
}

Step 2: Build the File Collector

from pathlib import Path
from datetime import datetime

def collect_files(config):
    files_to_backup = []
    max_bytes = config["max_file_size_mb"] * 1024 * 1024

    for source_dir in config["source_dirs"]:
        for filepath in Path(source_dir).rglob("*"):
            if not filepath.is_file():
                continue
            if filepath.suffix.lower() not in config["file_types"]:
                continue
            if filepath.stat().st_size > max_bytes:
                continue
            files_to_backup.append(filepath)

    return files_to_backup

Step 3: Upload to Cloud Storage

Using boto3 for S3:

import boto3
from pathlib import Path

def upload_to_s3(files, bucket, prefix):
    s3 = boto3.client("s3")
    date_prefix = datetime.now().strftime("%Y/%m/%d")
    uploaded = []

    for filepath in files:
        key = f"{prefix}{date_prefix}/{filepath.name}"
        s3.upload_file(str(filepath), bucket, key)
        uploaded.append(key)
        print(f"Uploaded: {key}")

    return uploaded

Step 4: Clean Up Old Backups

Delete backups older than your retention period:

from datetime import timedelta

def cleanup_old_backups(bucket, prefix, retention_days):
    s3 = boto3.client("s3")
    cutoff = datetime.now() - timedelta(days=retention_days)

    paginator = s3.get_paginator("list_objects_v2")
    deleted_count = 0

    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            if obj["LastModified"].replace(tzinfo=None) < cutoff:
                s3.delete_object(Bucket=bucket, Key=obj["Key"])
                deleted_count += 1

    print(f"Cleaned up {deleted_count} expired backups")

Step 5: Run on Schedule

def run_backup():
    files = collect_files(BACKUP_CONFIG)
    print(f"Found {len(files)} files to backup")

    bucket = "my-backups"
    prefix = "documents/"

    uploaded = upload_to_s3(files, bucket, prefix)
    cleanup_old_backups(bucket, prefix, BACKUP_CONFIG["retention_days"])

    print(f"Backup complete: {len(uploaded)} files")

if __name__ == "__main__":
    run_backup()

Schedule with cron for daily runs:

0 2 * * * python3 /path/to/backup_system.py >> /var/log/backup.log 2>&1

What to Build Next

Add backup verification that downloads a random sample and checks file integrity after each run. Or add email notifications with a summary of what was backed up and what was cleaned up.

How to Automate Document Backup and Archival

What You Need Before Starting

Step 1: Define Your Backup Configuration

Step 2: Build the File Collector

Step 3: Upload to Cloud Storage

Step 4: Clean Up Old Backups

Step 5: Run on Schedule

What to Build Next

Related Reading

Related Systems

How to Build an AI Document Filing System

How to Build an AI Document Summarizer

How to Build an AI-Powered OCR Document Processor