System Monitoring Setup
Set up comprehensive system monitoring and alerting for servers and applications. Covers metric collection, dashboard design, alert configuration, and incident response integration.
Usage
Describe your infrastructure and what you need to monitor. The guide recommends tools, configures metric collection, designs dashboards, and sets up meaningful alerts that reduce noise.
Parameters
- Scale: Single server, Small cluster (2-10), or Large infrastructure (10+)
- Stack: Prometheus+Grafana, Datadog, CloudWatch, or Lightweight (scripts)
- Metrics: System resources, Application performance, or Both
- Budget: Free/open-source only, Budget-friendly, or Enterprise
Examples
- Single Server Monitoring: Set up lightweight monitoring for a VPS — node_exporter + Prometheus + Grafana with dashboards for CPU, memory, disk, network, and PM2 process metrics.
- Web Application Monitoring: Monitor a Next.js application — response times, error rates, throughput, database query performance, and user-facing availability with uptime checks.
- Docker Host Monitoring: Monitor a Docker host and all containers — cAdvisor for container metrics, per-container CPU/memory/network, and volume usage with container restart alerting.
- Alert Fatigue Reduction: Redesign an alerting system that sends too many notifications — implement severity levels, alert grouping, escalation chains, and root-cause-based alerts.
Guidelines
- Monitoring follows the USE method (Utilization, Saturation, Errors) for system resources
- Application monitoring follows RED method (Rate, Errors, Duration) for services
- Alert thresholds are set based on historical baselines, not arbitrary numbers
- Alerts use severity levels: Info (log only), Warning (investigate), Critical (page on-call)
- Dashboard design shows the most important metrics at a glance with drill-down capability
- Data retention balances storage cost with the need for historical trend analysis
- Synthetic monitoring (uptime checks) catches issues before users report them
- Log-based metrics supplement traditional monitoring for application-specific insights
- Runbooks are linked to alerts so responders know what actions to take
- Monitoring itself is monitored — dead monitoring is worse than no monitoring