Night Mode LabsBlue Book
Reliability Playbooks

Dependency Reliability

Most production systems depend on databases, queues, third-party APIs, identity providers, and shared platforms. Reliability depends on how those dependencies fail and recover.

Dependency inventory

For each critical dependency, record:

  • Owner and support path.
  • Expected availability and latency.
  • Timeout and retry behavior.
  • Quotas, rate limits, and concurrency limits.
  • Failure mode and customer impact.
  • Fallback, cache, queue, or degraded behavior.
  • Monitoring and alert ownership.

Failure controls

Common controls include:

  • Timeouts shorter than user-facing deadlines.
  • Retries with jitter and bounded attempts.
  • Circuit breakers for repeated failures.
  • Bulkheads for resource isolation.
  • Queues for async buffering.
  • Caches for safe stale reads.
  • Feature flags to disable optional dependency paths.

Review map

Watchouts

  • Default SDK retries can exceed user-facing timeouts.
  • Shared dependencies can create correlated failures.
  • Optional dependencies often become required accidentally.
  • Third-party SLAs do not replace internal failure design.

On this page