Reliability Playbooks
On-Call and Alerting
On-call exists to protect users and business operations. Alerting should wake humans only when human action is needed quickly.
Alert quality
A good alert has:
- Clear user or system impact.
- Defined owner and escalation path.
- Actionable runbook.
- Severity aligned to urgency.
- Links to dashboards, logs, and recent deploys.
- Noise level low enough to preserve trust.
Alert types
Prefer alerts for:
- Fast SLO burn.
- Critical workflow failure.
- Data loss or corruption risk.
- Saturation that will soon cause impact.
- Security or access anomalies.
Avoid paging for:
- Non-actionable warnings.
- One-off transient blips.
- Metrics nobody knows how to interpret.
- Symptoms already covered by a better alert.
Escalation model
On-call health
Track:
- Pages per shift.
- Off-hours pages.
- Alert acknowledgement time.
- Alerts with no action taken.
- Repeated alerts from the same root cause.
- Runbook gaps discovered during incidents.
Watchouts
- Alert fatigue is a reliability risk.
- Rotations without authority create helpless responders.
- Escalation paths need to be tested before incidents.
- Business-hours-only support must be explicit in SLOs and customer expectations.