Architecture
Resilience Patterns
Resilience is the ability to keep serving users, degrade safely, and recover quickly when dependencies fail. Design it into service behavior, not only infrastructure.
Failure-first design
For each critical dependency, document:
- What happens when it is slow?
- What happens when it is unavailable?
- What data can be stale?
- What user action should be blocked, queued, or degraded?
- How do operators detect and mitigate the failure?
Common patterns
- Timeouts to prevent hung requests.
- Retries with jitter and bounded attempts.
- Circuit breakers for failing dependencies.
- Bulkheads to isolate resource exhaustion.
- Queues for async smoothing and backpressure.
- Idempotency keys for retried writes.
- Read-through or write-through caches where consistency allows it.
- Feature flags to disable risky paths quickly.
Recovery objectives
Define recovery objectives per capability, not only per system.
- RTO: how quickly service must recover.
- RPO: how much data loss is acceptable.
- SLO: what reliability users should expect.
- Error budget: how much unreliability can be spent.
Dependency map
Testing resilience
Test failure modes before incidents force the test.
- Kill dependencies in lower environments.
- Run restore tests for critical data stores.
- Verify alert routes and escalation paths.
- Exercise rollback and feature flag controls.
- Simulate queue buildup and replay behavior.
Chaos testing is useful only after basic observability, ownership, and rollback paths exist.