Resilience Patterns

Resilience is the ability to keep serving users, degrade safely, and recover quickly when dependencies fail. Design it into service behavior, not only infrastructure.

Failure-first design

For each critical dependency, document:

What happens when it is slow?
What happens when it is unavailable?
What data can be stale?
What user action should be blocked, queued, or degraded?
How do operators detect and mitigate the failure?

Common patterns

Timeouts to prevent hung requests.
Retries with jitter and bounded attempts.
Circuit breakers for failing dependencies.
Bulkheads to isolate resource exhaustion.
Queues for async smoothing and backpressure.
Idempotency keys for retried writes.
Read-through or write-through caches where consistency allows it.
Feature flags to disable risky paths quickly.

Recovery objectives

Define recovery objectives per capability, not only per system.

RTO: how quickly service must recover.
RPO: how much data loss is acceptable.
SLO: what reliability users should expect.
Error budget: how much unreliability can be spent.

Dependency map

Testing resilience

Test failure modes before incidents force the test.

Kill dependencies in lower environments.
Run restore tests for critical data stores.
Verify alert routes and escalation paths.
Exercise rollback and feature flag controls.
Simulate queue buildup and replay behavior.

Chaos testing is useful only after basic observability, ownership, and rollback paths exist.