Operations
Incident Management
Incident management turns production failure into coordinated response, clear communication, and durable learning. The process should be simple enough to use under stress.
Severity model
Define severity by customer impact, data risk, and operational urgency.
- SEV1: broad outage, data loss risk, security impact, or blocked core business process.
- SEV2: degraded critical path, limited customer impact, or high-risk workaround.
- SEV3: localized degradation with workaround.
- SEV4: low-impact issue or follow-up cleanup.
Incident roles
Assign roles explicitly for significant incidents.
- Incident commander coordinates response.
- Technical lead drives investigation and mitigation.
- Communications lead handles stakeholder updates.
- Scribe records timeline, decisions, and follow-up items.
Response flow
Communications
Every incident update should include:
- Current impact.
- What changed since the last update.
- Current mitigation or investigation path.
- Next update time.
- Customer or compliance implications, if known.
Post-incident review
Postmortems should be blameless but specific. Record what happened, why existing controls did not prevent or detect it sooner, and what will change.
Follow-up actions need owners, due dates, and priority. Unowned action items are just incident theater.