Systems Library / Infrastructure / How to Create an Automated Incident Response System

Infrastructure monitoring

How to Create an Automated Incident Response System

Automate incident detection, notification, and initial response.

Jay Banlasan

The AI Systems Guy

An automated incident response system handles the first minutes of an outage without you. Detection, notification, initial triage, and sometimes even auto-recovery. I built this after waking up to a 3-hour outage that could have been a 3-minute blip if the system had restarted the crashed service automatically.

What You Need Before Starting

Python 3.8+
Health check system already running (see system 596)
Slack webhook for notifications
SSH access to your server
A list of services and their restart commands

Step 1: Define Your Incident Playbooks

PLAYBOOKS = {
    "Main API": {
        "severity": "critical",
        "restart_command": "systemctl restart fastapi-app",
        "auto_restart": True,
        "max_auto_restarts": 3,
        "escalation_after_minutes": 5,
        "notify": ["slack", "sms"]
    },
    "Database": {
        "severity": "critical",
        "restart_command": "systemctl restart postgresql",
        "auto_restart": False,
        "escalation_after_minutes": 2,
        "notify": ["slack", "sms"]
    },
    "Cron Worker": {
        "severity": "warning",
        "restart_command": "systemctl restart cron-worker",
        "auto_restart": True,
        "max_auto_restarts": 5,
        "escalation_after_minutes": 15,
        "notify": ["slack"]
    }
}

Step 2: Build the Incident Tracker

import sqlite3
from datetime import datetime

def init_incident_db(db_path="incidents.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS incidents (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            service_name TEXT,
            severity TEXT,
            status TEXT DEFAULT 'open',
            auto_restarts INTEGER DEFAULT 0,
            detected_at TEXT,
            resolved_at TEXT,
            resolution TEXT
        )
    """)
    conn.commit()
    conn.close()

def open_incident(service_name, severity, db_path="incidents.db"):
    conn = sqlite3.connect(db_path)
    conn.execute(
        "INSERT INTO incidents (service_name, severity, detected_at) VALUES (?,?,?)",
        (service_name, severity, datetime.utcnow().isoformat())
    )
    conn.commit()
    incident_id = conn.execute("SELECT last_insert_rowid()").fetchone()[0]
    conn.close()
    return incident_id

Step 3: Build the Auto-Recovery Logic

import subprocess

def attempt_auto_restart(service_name, incident_id, db_path="incidents.db"):
    playbook = PLAYBOOKS.get(service_name)
    if not playbook or not playbook.get("auto_restart"):
        return False
    
    conn = sqlite3.connect(db_path)
    row = conn.execute(
        "SELECT auto_restarts FROM incidents WHERE id = ?", (incident_id,)
    ).fetchone()
    
    if row[0] >= playbook.get("max_auto_restarts", 3):
        conn.close()
        return False
    
    result = subprocess.run(
        playbook["restart_command"].split(),
        capture_output=True, text=True
    )
    
    conn.execute(
        "UPDATE incidents SET auto_restarts = auto_restarts + 1 WHERE id = ?",
        (incident_id,)
    )
    conn.commit()
    conn.close()
    
    return result.returncode == 0

Step 4: Add Notification Routing

import requests

def notify_incident(service_name, incident_id, message):
    playbook = PLAYBOOKS.get(service_name, {})
    channels = playbook.get("notify", ["slack"])
    
    if "slack" in channels:
        severity_emoji = {"critical": "🔴", "warning": "🟡"}.get(playbook.get("severity", "warning"), "⚪")
        requests.post("YOUR_SLACK_WEBHOOK", json={
            "text": f"{severity_emoji} Incident #{incident_id} - {service_name}\n{message}"
        })

Step 5: Wire It Into Health Checks

def handle_failure(service_name):
    playbook = PLAYBOOKS.get(service_name, {})
    severity = playbook.get("severity", "warning")
    
    incident_id = open_incident(service_name, severity)
    notify_incident(service_name, incident_id, f"{service_name} is DOWN. Incident opened.")
    
    restarted = attempt_auto_restart(service_name, incident_id)
    if restarted:
        notify_incident(service_name, incident_id, f"Auto-restart attempted for {service_name}.")
    else:
        notify_incident(service_name, incident_id, f"Manual intervention needed for {service_name}.")

What to Build Next

Add post-incident reports that auto-generate a timeline from the incident database. Include detection time, restart attempts, and resolution. That gives you data to improve the playbooks over time.

How to Create an Automated Incident Response System

What You Need Before Starting

Step 1: Define Your Incident Playbooks

Step 2: Build the Incident Tracker

Step 3: Build the Auto-Recovery Logic

Step 4: Add Notification Routing

Step 5: Wire It Into Health Checks

What to Build Next

Related Reading

Related Systems

How to Set Up Application Performance Monitoring

How to Build a Server Resource Monitoring System