Night Mode LabsBlue Book
Reliability Playbooks

Capacity and Performance

Capacity and performance work keeps systems healthy as usage changes. It should be driven by user experience, saturation signals, and cost, not by panic during incidents.

Capacity inputs

Plan capacity from:

  • Current traffic and growth trends.
  • Seasonal or campaign-driven spikes.
  • Batch windows and replay needs.
  • Dependency limits and quotas.
  • Regional or tenant distribution.
  • Cost constraints.

Signals

Track the four golden signals where they apply:

  • Latency.
  • Traffic.
  • Errors.
  • Saturation.

Also track queue depth, consumer lag, database locks, cache hit rate, thread pools, connection pools, and external API limits when they are relevant to the workload.

Performance review flow

Practices

  • Load test critical paths before major launches.
  • Define scaling limits and safe maximums.
  • Keep dashboards for saturation and headroom.
  • Review capacity after incidents and traffic changes.
  • Tie performance work to user-visible outcomes.

Watchouts

  • Autoscaling does not fix slow dependencies or bad queries.
  • Overprovisioning hides design issues and raises cost.
  • Load tests without realistic data can mislead.
  • Performance changes need rollback like any other production change.

On this page