Reliability Playbooks
SLO Implementation
Service-level objectives turn reliability into an engineering decision. They define what users should expect and how much unreliability the team can spend while shipping change.
SLO workflow
Select SLIs
Good service-level indicators reflect user experience.
Common SLIs include:
- Successful request rate.
- Request latency below threshold.
- Job completion within expected time.
- Freshness of data or reports.
- Availability of a critical workflow.
Avoid infrastructure-only SLIs unless they directly represent user impact. CPU usage is useful context, not usually a user promise.
Set targets
SLO targets should balance user need, system reality, and investment. A target that requires heroic operations is not a useful objective.
Document:
- Measurement source.
- Query or calculation.
- Target and window.
- Included and excluded traffic.
- Known blind spots.
Error budget policy
Define what happens when the budget burns too quickly.
- Increase review for risky changes.
- Prioritize reliability work.
- Pause non-critical launches.
- Escalate ownership or dependency issues.
- Revisit unrealistic objectives.
Watchouts
- Too many SLOs create noise.
- Alerting directly on the SLO target is often too late.
- SLOs without product and engineering agreement become dashboard art.
- Error budgets need decision authority to matter.