Service Health Checker

Name: Service Health Checker
Author: Community

Monitor service health with automated checks and alerting. Designs comprehensive health check systems covering application health, dependency status, and automated recovery.

Usage

Describe your services and dependencies. The checker designs health check endpoints, monitoring configurations, and alerting rules that catch real problems while avoiding false alarms.

Parameters

Service type: Web application, API, Database, Message queue, or Microservices
Dependencies: Database, Cache, External APIs, File storage, or Multiple
Monitoring: Internal health endpoints, External uptime monitoring, or Both
Recovery: Manual, Semi-automated (alert + runbook), or Fully automated

Examples

Health Check Endpoint: Design a /health endpoint for a Node.js API — check database connectivity, Redis connection, disk space, and memory usage with degraded vs. unhealthy status levels.

Uptime Monitoring: Set up external monitoring for a web application — HTTP checks from multiple regions, SSL certificate expiry monitoring, and response time threshold alerting.

Dependency Health Matrix: Monitor a microservice's dependencies — circuit breaker pattern for external APIs, connection pool monitoring for databases, and queue depth for message brokers.

Automated Recovery: Configure PM2/systemd automatic restarts on health check failure, with restart limits to prevent restart loops and escalation to human intervention.

Guidelines

Health endpoints return structured JSON with status, component health, and version information
Three-level health status: Healthy (all good), Degraded (partial functionality), Unhealthy (down)
Shallow health checks (is the process running?) are separate from deep checks (can it serve requests?)
Dependency checks use timeouts to prevent health endpoints from hanging on unresponsive services
Load balancer health checks should be fast and lightweight (shallow check)
Monitoring checks from multiple geographic locations detect regional issues
Alert escalation progresses: log → Slack → email → page, based on duration and severity
Flapping detection prevents alert storms from intermittent issues
Health check history is retained for post-incident analysis and SLA reporting
Automated recovery includes restart limits, backoff timers, and manual override capabilities
Health checks are themselves monitored — a broken health check masks real outages

Service Health Checker

Usage

Parameters

Examples

Guidelines

More System Skills