Reliability Playbooks
Dependency Reliability
Most production systems depend on databases, queues, third-party APIs, identity providers, and shared platforms. Reliability depends on how those dependencies fail and recover.
Dependency inventory
For each critical dependency, record:
- Owner and support path.
- Expected availability and latency.
- Timeout and retry behavior.
- Quotas, rate limits, and concurrency limits.
- Failure mode and customer impact.
- Fallback, cache, queue, or degraded behavior.
- Monitoring and alert ownership.
Failure controls
Common controls include:
- Timeouts shorter than user-facing deadlines.
- Retries with jitter and bounded attempts.
- Circuit breakers for repeated failures.
- Bulkheads for resource isolation.
- Queues for async buffering.
- Caches for safe stale reads.
- Feature flags to disable optional dependency paths.
Review map
Watchouts
- Default SDK retries can exceed user-facing timeouts.
- Shared dependencies can create correlated failures.
- Optional dependencies often become required accidentally.
- Third-party SLAs do not replace internal failure design.