Reliability Playbooks
Capacity and Performance
Capacity and performance work keeps systems healthy as usage changes. It should be driven by user experience, saturation signals, and cost, not by panic during incidents.
Capacity inputs
Plan capacity from:
- Current traffic and growth trends.
- Seasonal or campaign-driven spikes.
- Batch windows and replay needs.
- Dependency limits and quotas.
- Regional or tenant distribution.
- Cost constraints.
Signals
Track the four golden signals where they apply:
- Latency.
- Traffic.
- Errors.
- Saturation.
Also track queue depth, consumer lag, database locks, cache hit rate, thread pools, connection pools, and external API limits when they are relevant to the workload.
Performance review flow
Practices
- Load test critical paths before major launches.
- Define scaling limits and safe maximums.
- Keep dashboards for saturation and headroom.
- Review capacity after incidents and traffic changes.
- Tie performance work to user-visible outcomes.
Watchouts
- Autoscaling does not fix slow dependencies or bad queries.
- Overprovisioning hides design issues and raises cost.
- Load tests without realistic data can mislead.
- Performance changes need rollback like any other production change.