Service Health Checker
Monitor service health with automated checks and alerting. Designs comprehensive health check systems covering application health, dependency status, and automated recovery.
Usage
Describe your services and dependencies. The checker designs health check endpoints, monitoring configurations, and alerting rules that catch real problems while avoiding false alarms.
Parameters
- Service type: Web application, API, Database, Message queue, or Microservices
- Dependencies: Database, Cache, External APIs, File storage, or Multiple
- Monitoring: Internal health endpoints, External uptime monitoring, or Both
- Recovery: Manual, Semi-automated (alert + runbook), or Fully automated
Examples
- Health Check Endpoint: Design a /health endpoint for a Node.js API — check database connectivity, Redis connection, disk space, and memory usage with degraded vs. unhealthy status levels.
- Uptime Monitoring: Set up external monitoring for a web application — HTTP checks from multiple regions, SSL certificate expiry monitoring, and response time threshold alerting.
- Dependency Health Matrix: Monitor a microservice's dependencies — circuit breaker pattern for external APIs, connection pool monitoring for databases, and queue depth for message brokers.
- Automated Recovery: Configure PM2/systemd automatic restarts on health check failure, with restart limits to prevent restart loops and escalation to human intervention.
Guidelines
- Health endpoints return structured JSON with status, component health, and version information
- Three-level health status: Healthy (all good), Degraded (partial functionality), Unhealthy (down)
- Shallow health checks (is the process running?) are separate from deep checks (can it serve requests?)
- Dependency checks use timeouts to prevent health endpoints from hanging on unresponsive services
- Load balancer health checks should be fast and lightweight (shallow check)
- Monitoring checks from multiple geographic locations detect regional issues
- Alert escalation progresses: log → Slack → email → page, based on duration and severity
- Flapping detection prevents alert storms from intermittent issues
- Health check history is retained for post-incident analysis and SLA reporting
- Automated recovery includes restart limits, backoff timers, and manual override capabilities
- Health checks are themselves monitored — a broken health check masks real outages