Night Mode LabsBlue Book
Reliability Playbooks

On-Call and Alerting

On-call exists to protect users and business operations. Alerting should wake humans only when human action is needed quickly.

Alert quality

A good alert has:

  • Clear user or system impact.
  • Defined owner and escalation path.
  • Actionable runbook.
  • Severity aligned to urgency.
  • Links to dashboards, logs, and recent deploys.
  • Noise level low enough to preserve trust.

Alert types

Prefer alerts for:

  • Fast SLO burn.
  • Critical workflow failure.
  • Data loss or corruption risk.
  • Saturation that will soon cause impact.
  • Security or access anomalies.

Avoid paging for:

  • Non-actionable warnings.
  • One-off transient blips.
  • Metrics nobody knows how to interpret.
  • Symptoms already covered by a better alert.

Escalation model

On-call health

Track:

  • Pages per shift.
  • Off-hours pages.
  • Alert acknowledgement time.
  • Alerts with no action taken.
  • Repeated alerts from the same root cause.
  • Runbook gaps discovered during incidents.

Watchouts

  • Alert fatigue is a reliability risk.
  • Rotations without authority create helpless responders.
  • Escalation paths need to be tested before incidents.
  • Business-hours-only support must be explicit in SLOs and customer expectations.

On this page