Systems Library / Infrastructure / How to Create an Automated Incident Response System
Infrastructure monitoring

How to Create an Automated Incident Response System

Automate incident detection, notification, and initial response.

Jay Banlasan

Jay Banlasan

The AI Systems Guy

An automated incident response system handles the first minutes of an outage without you. Detection, notification, initial triage, and sometimes even auto-recovery. I built this after waking up to a 3-hour outage that could have been a 3-minute blip if the system had restarted the crashed service automatically.

What You Need Before Starting

Step 1: Define Your Incident Playbooks

PLAYBOOKS = {
    "Main API": {
        "severity": "critical",
        "restart_command": "systemctl restart fastapi-app",
        "auto_restart": True,
        "max_auto_restarts": 3,
        "escalation_after_minutes": 5,
        "notify": ["slack", "sms"]
    },
    "Database": {
        "severity": "critical",
        "restart_command": "systemctl restart postgresql",
        "auto_restart": False,
        "escalation_after_minutes": 2,
        "notify": ["slack", "sms"]
    },
    "Cron Worker": {
        "severity": "warning",
        "restart_command": "systemctl restart cron-worker",
        "auto_restart": True,
        "max_auto_restarts": 5,
        "escalation_after_minutes": 15,
        "notify": ["slack"]
    }
}

Step 2: Build the Incident Tracker

import sqlite3
from datetime import datetime

def init_incident_db(db_path="incidents.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS incidents (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            service_name TEXT,
            severity TEXT,
            status TEXT DEFAULT 'open',
            auto_restarts INTEGER DEFAULT 0,
            detected_at TEXT,
            resolved_at TEXT,
            resolution TEXT
        )
    """)
    conn.commit()
    conn.close()

def open_incident(service_name, severity, db_path="incidents.db"):
    conn = sqlite3.connect(db_path)
    conn.execute(
        "INSERT INTO incidents (service_name, severity, detected_at) VALUES (?,?,?)",
        (service_name, severity, datetime.utcnow().isoformat())
    )
    conn.commit()
    incident_id = conn.execute("SELECT last_insert_rowid()").fetchone()[0]
    conn.close()
    return incident_id

Step 3: Build the Auto-Recovery Logic

import subprocess

def attempt_auto_restart(service_name, incident_id, db_path="incidents.db"):
    playbook = PLAYBOOKS.get(service_name)
    if not playbook or not playbook.get("auto_restart"):
        return False
    
    conn = sqlite3.connect(db_path)
    row = conn.execute(
        "SELECT auto_restarts FROM incidents WHERE id = ?", (incident_id,)
    ).fetchone()
    
    if row[0] >= playbook.get("max_auto_restarts", 3):
        conn.close()
        return False
    
    result = subprocess.run(
        playbook["restart_command"].split(),
        capture_output=True, text=True
    )
    
    conn.execute(
        "UPDATE incidents SET auto_restarts = auto_restarts + 1 WHERE id = ?",
        (incident_id,)
    )
    conn.commit()
    conn.close()
    
    return result.returncode == 0

Step 4: Add Notification Routing

import requests

def notify_incident(service_name, incident_id, message):
    playbook = PLAYBOOKS.get(service_name, {})
    channels = playbook.get("notify", ["slack"])
    
    if "slack" in channels:
        severity_emoji = {"critical": "🔴", "warning": "🟡"}.get(playbook.get("severity", "warning"), "⚪")
        requests.post("YOUR_SLACK_WEBHOOK", json={
            "text": f"{severity_emoji} Incident #{incident_id} - {service_name}\n{message}"
        })

Step 5: Wire It Into Health Checks

def handle_failure(service_name):
    playbook = PLAYBOOKS.get(service_name, {})
    severity = playbook.get("severity", "warning")
    
    incident_id = open_incident(service_name, severity)
    notify_incident(service_name, incident_id, f"{service_name} is DOWN. Incident opened.")
    
    restarted = attempt_auto_restart(service_name, incident_id)
    if restarted:
        notify_incident(service_name, incident_id, f"Auto-restart attempted for {service_name}.")
    else:
        notify_incident(service_name, incident_id, f"Manual intervention needed for {service_name}.")

What to Build Next

Add post-incident reports that auto-generate a timeline from the incident database. Include detection time, restart attempts, and resolution. That gives you data to improve the playbooks over time.

Related Reading

Want this system built for your business?

Get a free assessment. We will map every system your business needs and show you the ROI.

Get Your Free Assessment

Related Systems