TechBlog
system-design

Availability and Reliability

What availability and reliability mean in distributed systems, how to measure them, and the patterns used to achieve high availability.

3 min read

Reliability vs Availability

A system can be available but unreliable (up but returning wrong data) or reliable but unavailable (down for maintenance). You want both.


The Nines of Availability

AvailabilityDowntime/yearDowntime/month
99%3.65 days7.2 hours
99.9%8.7 hours43.8 minutes
99.99%52.6 minutes4.4 minutes
99.999%5.3 minutes26 seconds

Eliminating Single Points of Failure


Active-Passive vs Active-Active Failover


Circuit Breaker Pattern


Health Check Flow


MTTR and MTBF


Key Takeaway

High availability requires all of these together:

  1. Redundancy at every layer — no SPOFs
  2. Automatic failover when components fail
  3. Health checks to detect failures fast
  4. Circuit breakers to prevent cascade failures
  5. Graceful degradation — serve partial functionality under failure