Night Mode LabsBlue Book
Reliability Playbooks

Chaos and Game Days

Chaos exercises and game days help teams verify that systems and people can handle failure. Start with controlled exercises before automating random failure.

Preconditions

Do not run chaos experiments until the basics exist:

  • Clear owners and responders.
  • Dashboards and alerts.
  • Rollback and mitigation paths.
  • Runbooks for critical systems.
  • Safe test environment or scoped production blast radius.
  • Leadership agreement on risk.

Game day flow

Scenario ideas

  • Primary database unavailable.
  • Queue backlog grows faster than consumers drain it.
  • Third-party API returns errors or high latency.
  • Bad deployment requires rollback.
  • Regional dependency outage.
  • Expired certificate.
  • Privileged access path unavailable.

Output

Each exercise should produce:

  • What was tested.
  • What worked.
  • What failed or surprised the team.
  • Gaps in observability, access, or runbooks.
  • Follow-up actions with owners and due dates.

Watchouts

  • Chaos without follow-up is theater.
  • Do not surprise teams with production failure injection.
  • Keep blast radius small until confidence grows.
  • Include humans, communications, and access paths, not only servers.

On this page