Reliability vs Availability
A system can be available but unreliable (up but returning wrong data) or reliable but unavailable (down for maintenance). You want both.
The Nines of Availability
| Availability | Downtime/year | Downtime/month |
|---|---|---|
| 99% | 3.65 days | 7.2 hours |
| 99.9% | 8.7 hours | 43.8 minutes |
| 99.99% | 52.6 minutes | 4.4 minutes |
| 99.999% | 5.3 minutes | 26 seconds |
Eliminating Single Points of Failure
Active-Passive vs Active-Active Failover
Circuit Breaker Pattern
Health Check Flow
MTTR and MTBF
Key Takeaway
High availability requires all of these together:
- Redundancy at every layer — no SPOFs
- Automatic failover when components fail
- Health checks to detect failures fast
- Circuit breakers to prevent cascade failures
- Graceful degradation — serve partial functionality under failure