How to Automate Document Backup and Archival
Back up and archive business documents automatically on schedule.
Jay Banlasan
The AI Systems Guy
Losing a critical document because nobody backed it up is a problem you should only have once. I built an automated document backup and archival system that runs on schedule, copies files to cloud storage, and cleans up old versions so nothing important gets lost.
This system handles the boring work of automate document backup archival for any cloud provider.
What You Need Before Starting
- Python 3.8+
- Cloud storage credentials (AWS S3, Google Cloud Storage, or Backblaze B2)
- A schedule runner (cron on Linux, Task Scheduler on Windows)
Step 1: Define Your Backup Configuration
BACKUP_CONFIG = {
"source_dirs": [
"/home/user/contracts",
"/home/user/client-briefs",
"/home/user/sops"
],
"destination": "s3://my-backups/documents/",
"retention_days": 90,
"file_types": [".pdf", ".docx", ".md", ".xlsx"],
"max_file_size_mb": 100
}
Step 2: Build the File Collector
from pathlib import Path
from datetime import datetime
def collect_files(config):
files_to_backup = []
max_bytes = config["max_file_size_mb"] * 1024 * 1024
for source_dir in config["source_dirs"]:
for filepath in Path(source_dir).rglob("*"):
if not filepath.is_file():
continue
if filepath.suffix.lower() not in config["file_types"]:
continue
if filepath.stat().st_size > max_bytes:
continue
files_to_backup.append(filepath)
return files_to_backup
Step 3: Upload to Cloud Storage
Using boto3 for S3:
import boto3
from pathlib import Path
def upload_to_s3(files, bucket, prefix):
s3 = boto3.client("s3")
date_prefix = datetime.now().strftime("%Y/%m/%d")
uploaded = []
for filepath in files:
key = f"{prefix}{date_prefix}/{filepath.name}"
s3.upload_file(str(filepath), bucket, key)
uploaded.append(key)
print(f"Uploaded: {key}")
return uploaded
Step 4: Clean Up Old Backups
Delete backups older than your retention period:
from datetime import timedelta
def cleanup_old_backups(bucket, prefix, retention_days):
s3 = boto3.client("s3")
cutoff = datetime.now() - timedelta(days=retention_days)
paginator = s3.get_paginator("list_objects_v2")
deleted_count = 0
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
for obj in page.get("Contents", []):
if obj["LastModified"].replace(tzinfo=None) < cutoff:
s3.delete_object(Bucket=bucket, Key=obj["Key"])
deleted_count += 1
print(f"Cleaned up {deleted_count} expired backups")
Step 5: Run on Schedule
def run_backup():
files = collect_files(BACKUP_CONFIG)
print(f"Found {len(files)} files to backup")
bucket = "my-backups"
prefix = "documents/"
uploaded = upload_to_s3(files, bucket, prefix)
cleanup_old_backups(bucket, prefix, BACKUP_CONFIG["retention_days"])
print(f"Backup complete: {len(uploaded)} files")
if __name__ == "__main__":
run_backup()
Schedule with cron for daily runs:
0 2 * * * python3 /path/to/backup_system.py >> /var/log/backup.log 2>&1
What to Build Next
Add backup verification that downloads a random sample and checks file integrity after each run. Or add email notifications with a summary of what was backed up and what was cleaned up.
Related Reading
- The Retry Strategy - retry strategy automation operations
- The Documentation Habit - documentation habit ai operations
- Scaling from One Automation to an Operations Stack - scaling automations operations stack
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment