Infrastructure Monitoring: Keeping Your Systems Healthy
Infrastructure monitoring provides continuous visibility into the health, performance, and availability of your systems. Without monitoring, you find out about problems when users report them — monitoring enables you to detect, diagnose, and often automatically remediate issues before users are affected.
What We Monitor
- Infrastructure metrics: CPU utilisation, memory usage, disk I/O, network traffic — across all servers and managed services
- Application metrics: Request rate, error rate, latency (the RED method: Rate, Errors, Duration)
- Business metrics: Order rates, user sign-ups, payment success rates — metrics that indicate the health of your business, not just your infrastructure
- Uptime / availability: External synthetic monitoring that checks your endpoints from outside your infrastructure — confirms what users experience
- Database metrics: Query performance, connection counts, replication lag, storage utilisation
Monitoring Levels
- Infrastructure level: VM/container CPU, memory, disk — are the machines healthy?
- Application level: Are API endpoints responding within acceptable latency? What is the error rate?
- User experience level: Real User Monitoring (RUM) — how fast do pages load for real users on real devices and networks?
- Synthetic monitoring: Automated browser sessions that simulate user journeys and alert if they fail
Tools We Use
- Datadog: Full-stack monitoring — infrastructure, APM, log management, synthetics. Our default for complex systems.
- Grafana + Prometheus: Open-source monitoring stack — highly customisable, no licensing cost
- AWS CloudWatch: Native AWS monitoring — included with AWS services