Systems Library / Infrastructure / How to Create Automated Health Check Systems
Infrastructure monitoring

How to Create Automated Health Check Systems

Run automated health checks on all endpoints and services.

Jay Banlasan

Jay Banlasan

The AI Systems Guy

An automated health check system for endpoint monitoring tells you which services are up, which are degraded, and which are down. I run health checks across every API, database, and external service my systems depend on. When something breaks at 3am, I find out from my alert, not my client.

What You Need Before Starting

Step 1: Define Your Health Check Registry

HEALTH_CHECKS = [
    {
        "name": "Main API",
        "url": "https://api.yoursite.com/health",
        "method": "GET",
        "expected_status": 200,
        "timeout_seconds": 10
    },
    {
        "name": "Database",
        "type": "db",
        "connection_string": "sqlite:///production.db"
    },
    {
        "name": "Claude API",
        "url": "https://api.anthropic.com/v1/messages",
        "method": "HEAD",
        "expected_status": 401,
        "timeout_seconds": 5
    },
    {
        "name": "Webhook Endpoint",
        "url": "https://yoursite.com/webhook/intake",
        "method": "GET",
        "expected_status": 200,
        "timeout_seconds": 8
    }
]

Step 2: Build the Health Check Runner

import requests
import sqlite3
from datetime import datetime

def check_http(check):
    try:
        response = requests.request(
            method=check.get("method", "GET"),
            url=check["url"],
            timeout=check.get("timeout_seconds", 10)
        )
        is_healthy = response.status_code == check.get("expected_status", 200)
        return {
            "name": check["name"],
            "healthy": is_healthy,
            "status_code": response.status_code,
            "response_ms": round(response.elapsed.total_seconds() * 1000, 2)
        }
    except requests.exceptions.Timeout:
        return {"name": check["name"], "healthy": False, "error": "timeout"}
    except requests.exceptions.ConnectionError:
        return {"name": check["name"], "healthy": False, "error": "connection_failed"}

def check_database(check):
    try:
        conn = sqlite3.connect(check["connection_string"].replace("sqlite:///", ""))
        conn.execute("SELECT 1")
        conn.close()
        return {"name": check["name"], "healthy": True}
    except Exception as e:
        return {"name": check["name"], "healthy": False, "error": str(e)}

def run_all_checks():
    results = []
    for check in HEALTH_CHECKS:
        if check.get("type") == "db":
            results.append(check_database(check))
        else:
            results.append(check_http(check))
    return results

Step 3: Log Results to a Database

def init_health_db(db_path="health_checks.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS check_results (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            service_name TEXT,
            healthy INTEGER,
            response_ms REAL,
            error TEXT,
            checked_at TEXT
        )
    """)
    conn.commit()
    conn.close()

def log_results(results, db_path="health_checks.db"):
    conn = sqlite3.connect(db_path)
    for r in results:
        conn.execute(
            "INSERT INTO check_results (service_name, healthy, response_ms, error, checked_at) VALUES (?,?,?,?,?)",
            (r["name"], 1 if r["healthy"] else 0, r.get("response_ms"), r.get("error"), datetime.utcnow().isoformat())
        )
    conn.commit()
    conn.close()

Step 4: Add Alerting for Failures

def alert_on_failures(results):
    failures = [r for r in results if not r["healthy"]]
    if not failures:
        return
    
    message = "Health Check Failures:\n"
    for f in failures:
        error_detail = f.get("error", f"status {f.get('status_code', 'unknown')}")
        message += f"  {f['name']}: {error_detail}\n"
    
    requests.post("YOUR_SLACK_WEBHOOK", json={"text": message})

Step 5: Schedule It

*/2 * * * * python3 /path/to/health_check.py

The main script ties it all together:

if __name__ == "__main__":
    init_health_db()
    results = run_all_checks()
    log_results(results)
    alert_on_failures(results)

What to Build Next

Add consecutive failure tracking so you only get alerted after two or three failures in a row. Single blips happen. Sustained outages need action.

Related Reading

Want this system built for your business?

Get a free assessment. We will map every system your business needs and show you the ROI.

Get Your Free Assessment

Related Systems