Infrastructure
monitoring
How to Create an Automated Incident Response System
Automate incident detection, notification, and initial response.
Jay Banlasan
The AI Systems Guy
An automated incident response system handles the first minutes of an outage without you. Detection, notification, initial triage, and sometimes even auto-recovery. I built this after waking up to a 3-hour outage that could have been a 3-minute blip if the system had restarted the crashed service automatically.
What You Need Before Starting
- Python 3.8+
- Health check system already running (see system 596)
- Slack webhook for notifications
- SSH access to your server
- A list of services and their restart commands
Step 1: Define Your Incident Playbooks
PLAYBOOKS = {
"Main API": {
"severity": "critical",
"restart_command": "systemctl restart fastapi-app",
"auto_restart": True,
"max_auto_restarts": 3,
"escalation_after_minutes": 5,
"notify": ["slack", "sms"]
},
"Database": {
"severity": "critical",
"restart_command": "systemctl restart postgresql",
"auto_restart": False,
"escalation_after_minutes": 2,
"notify": ["slack", "sms"]
},
"Cron Worker": {
"severity": "warning",
"restart_command": "systemctl restart cron-worker",
"auto_restart": True,
"max_auto_restarts": 5,
"escalation_after_minutes": 15,
"notify": ["slack"]
}
}
Step 2: Build the Incident Tracker
import sqlite3
from datetime import datetime
def init_incident_db(db_path="incidents.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS incidents (
id INTEGER PRIMARY KEY AUTOINCREMENT,
service_name TEXT,
severity TEXT,
status TEXT DEFAULT 'open',
auto_restarts INTEGER DEFAULT 0,
detected_at TEXT,
resolved_at TEXT,
resolution TEXT
)
""")
conn.commit()
conn.close()
def open_incident(service_name, severity, db_path="incidents.db"):
conn = sqlite3.connect(db_path)
conn.execute(
"INSERT INTO incidents (service_name, severity, detected_at) VALUES (?,?,?)",
(service_name, severity, datetime.utcnow().isoformat())
)
conn.commit()
incident_id = conn.execute("SELECT last_insert_rowid()").fetchone()[0]
conn.close()
return incident_id
Step 3: Build the Auto-Recovery Logic
import subprocess
def attempt_auto_restart(service_name, incident_id, db_path="incidents.db"):
playbook = PLAYBOOKS.get(service_name)
if not playbook or not playbook.get("auto_restart"):
return False
conn = sqlite3.connect(db_path)
row = conn.execute(
"SELECT auto_restarts FROM incidents WHERE id = ?", (incident_id,)
).fetchone()
if row[0] >= playbook.get("max_auto_restarts", 3):
conn.close()
return False
result = subprocess.run(
playbook["restart_command"].split(),
capture_output=True, text=True
)
conn.execute(
"UPDATE incidents SET auto_restarts = auto_restarts + 1 WHERE id = ?",
(incident_id,)
)
conn.commit()
conn.close()
return result.returncode == 0
Step 4: Add Notification Routing
import requests
def notify_incident(service_name, incident_id, message):
playbook = PLAYBOOKS.get(service_name, {})
channels = playbook.get("notify", ["slack"])
if "slack" in channels:
severity_emoji = {"critical": "🔴", "warning": "🟡"}.get(playbook.get("severity", "warning"), "⚪")
requests.post("YOUR_SLACK_WEBHOOK", json={
"text": f"{severity_emoji} Incident #{incident_id} - {service_name}\n{message}"
})
Step 5: Wire It Into Health Checks
def handle_failure(service_name):
playbook = PLAYBOOKS.get(service_name, {})
severity = playbook.get("severity", "warning")
incident_id = open_incident(service_name, severity)
notify_incident(service_name, incident_id, f"{service_name} is DOWN. Incident opened.")
restarted = attempt_auto_restart(service_name, incident_id)
if restarted:
notify_incident(service_name, incident_id, f"Auto-restart attempted for {service_name}.")
else:
notify_incident(service_name, incident_id, f"Manual intervention needed for {service_name}.")
What to Build Next
Add post-incident reports that auto-generate a timeline from the incident database. Include detection time, restart attempts, and resolution. That gives you data to improve the playbooks over time.
Related Reading
- Designing for Failure - building systems that recover gracefully
- How to Build Automated Alerts That Actually Help - reducing alert fatigue
- Why Monitoring Is Not Optional - the case for proactive monitoring
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment