The Two Core Performance Metrics
When measuring system performance, two numbers matter most: latency and throughput. They sound similar but measure completely different things — and optimizing for one can hurt the other.
Latency
Latency is the time it takes to complete a single operation — from request to response.
Common latency benchmarks
| Operation | Typical Latency |
|---|---|
| L1 cache read | ~1 ns |
| L2 cache read | ~4 ns |
| RAM read | ~100 ns |
| SSD read | ~100 µs |
| Network round trip (same DC) | ~500 µs |
| Network round trip (cross-region) | ~150 ms |
| HDD read | ~10 ms |
These numbers are worth memorizing. A database call is ~1ms, a cross-continent network call is ~150ms.
Types of latency
Always look at p99, not just the average. If your average is 50ms but p99 is 5 seconds, 1 in 100 users has a terrible experience.
Throughput
Throughput is the number of operations a system can handle per unit of time.
Measured as:
- RPS — requests per second (APIs)
- TPS — transactions per second (databases)
- Mbps/Gbps — megabits per second (networks)
What limits throughput?
The Relationship: Little's Law
Throughput = Concurrency / Latency
If your system handles 100 concurrent requests and each takes 100ms:
Throughput = 100 / 0.1s = 1000 RPS
To double throughput, you can either:
- Double concurrency (add more servers)
- Halve latency (make each request faster)
The Latency vs Throughput Trade-off
Kafka, database bulk inserts, and HTTP/2 multiplexing all use batching to maximize throughput at the cost of some latency.
How to Optimize Each
Reducing Latency
- Caching — serve from memory instead of disk/network
- CDN — serve static assets from edge nodes close to users
- Database indexes — avoid full table scans
- Connection pooling — reuse DB connections instead of creating new ones
- Async processing — don't make users wait for non-critical work
Increasing Throughput
- Horizontal scaling — more servers = more parallel processing
- Async I/O — don't block threads waiting for I/O
- Batching — process multiple items in one operation
- Compression — send less data over the network
- Load balancing — distribute work evenly
Key Takeaway
| Latency | Throughput | |
|---|---|---|
| Measures | Speed of one request | Volume of requests per second |
| Goal | As low as possible | As high as possible |
| Optimized by | Caching, indexes, async | Scaling, batching, parallelism |
| Hurt by | Network hops, blocking I/O | Resource contention, single-threading |
Always define your latency and throughput requirements before designing a system. They drive every architectural decision.