Systems Library / Operations & Admin / How to Automate Document Backup and Archival
Operations & Admin document management

How to Automate Document Backup and Archival

Back up and archive business documents automatically on schedule.

Jay Banlasan

Jay Banlasan

The AI Systems Guy

Losing a critical document because nobody backed it up is a problem you should only have once. I built an automated document backup and archival system that runs on schedule, copies files to cloud storage, and cleans up old versions so nothing important gets lost.

This system handles the boring work of automate document backup archival for any cloud provider.

What You Need Before Starting

Step 1: Define Your Backup Configuration

BACKUP_CONFIG = {
    "source_dirs": [
        "/home/user/contracts",
        "/home/user/client-briefs",
        "/home/user/sops"
    ],
    "destination": "s3://my-backups/documents/",
    "retention_days": 90,
    "file_types": [".pdf", ".docx", ".md", ".xlsx"],
    "max_file_size_mb": 100
}

Step 2: Build the File Collector

from pathlib import Path
from datetime import datetime

def collect_files(config):
    files_to_backup = []
    max_bytes = config["max_file_size_mb"] * 1024 * 1024

    for source_dir in config["source_dirs"]:
        for filepath in Path(source_dir).rglob("*"):
            if not filepath.is_file():
                continue
            if filepath.suffix.lower() not in config["file_types"]:
                continue
            if filepath.stat().st_size > max_bytes:
                continue
            files_to_backup.append(filepath)

    return files_to_backup

Step 3: Upload to Cloud Storage

Using boto3 for S3:

import boto3
from pathlib import Path

def upload_to_s3(files, bucket, prefix):
    s3 = boto3.client("s3")
    date_prefix = datetime.now().strftime("%Y/%m/%d")
    uploaded = []

    for filepath in files:
        key = f"{prefix}{date_prefix}/{filepath.name}"
        s3.upload_file(str(filepath), bucket, key)
        uploaded.append(key)
        print(f"Uploaded: {key}")

    return uploaded

Step 4: Clean Up Old Backups

Delete backups older than your retention period:

from datetime import timedelta

def cleanup_old_backups(bucket, prefix, retention_days):
    s3 = boto3.client("s3")
    cutoff = datetime.now() - timedelta(days=retention_days)

    paginator = s3.get_paginator("list_objects_v2")
    deleted_count = 0

    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            if obj["LastModified"].replace(tzinfo=None) < cutoff:
                s3.delete_object(Bucket=bucket, Key=obj["Key"])
                deleted_count += 1

    print(f"Cleaned up {deleted_count} expired backups")

Step 5: Run on Schedule

def run_backup():
    files = collect_files(BACKUP_CONFIG)
    print(f"Found {len(files)} files to backup")

    bucket = "my-backups"
    prefix = "documents/"

    uploaded = upload_to_s3(files, bucket, prefix)
    cleanup_old_backups(bucket, prefix, BACKUP_CONFIG["retention_days"])

    print(f"Backup complete: {len(uploaded)} files")

if __name__ == "__main__":
    run_backup()

Schedule with cron for daily runs:

0 2 * * * python3 /path/to/backup_system.py >> /var/log/backup.log 2>&1

What to Build Next

Add backup verification that downloads a random sample and checks file integrity after each run. Or add email notifications with a summary of what was backed up and what was cleaned up.

Related Reading

Want this system built for your business?

Get a free assessment. We will map every system your business needs and show you the ROI.

Get Your Free Assessment

Related Systems