Monitoring Alerts and Incident Response: When Things Go Wrong
Alerts notify the right people at the right time when a system issue requires attention. Without good alerting, you discover problems when users call — with good alerting, you fix problems before users notice. But alert fatigue — too many alerts, too many false positives — is as damaging as no alerts.
Alert Design Principles
- Alert on symptoms, not causes: Alert when users are affected (high error rate, slow response time) rather than on infrastructure metrics (high CPU) that may not impact users. CPU at 90% may be fine; error rate at 5% is always a problem.
- Every alert requires action: If an alert fires and the correct response is "do nothing", the alert should not exist. Alert fatigue causes real alerts to be missed.
- Set sensible thresholds: Thresholds should be based on evidence — what level of metric indicates a real problem? Avoid arbitrary round numbers.
- Consider seasonality: Traffic patterns vary by time of day and day of week — static thresholds that trigger during low-traffic hours may be meaningless during peak.
Alert Routing and Escalation
- PagerDuty / OpsGenie: On-call management platforms that route alerts to the right team member, escalate if not acknowledged, and manage on-call rotations
- Severity tiers: P1 (critical — service down, page immediately), P2 (high — degraded service, page within 15 minutes), P3 (medium — non-critical issue, business hours response), P4 (low — informational, no immediate action)
Incident Response Process
A defined incident response process: detect, triage, communicate, investigate, resolve, review. Post-incident reviews (blameless retrospectives) identify root causes and preventive measures — improving system reliability over time.