Incident Management

Incident management turns production failure into coordinated response, clear communication, and durable learning. The process should be simple enough to use under stress.

Severity model

Define severity by customer impact, data risk, and operational urgency.

SEV1: broad outage, data loss risk, security impact, or blocked core business process.
SEV2: degraded critical path, limited customer impact, or high-risk workaround.
SEV3: localized degradation with workaround.
SEV4: low-impact issue or follow-up cleanup.

Incident roles

Assign roles explicitly for significant incidents.

Incident commander coordinates response.
Technical lead drives investigation and mitigation.
Communications lead handles stakeholder updates.
Scribe records timeline, decisions, and follow-up items.

Response flow

Communications

Every incident update should include:

Current impact.
What changed since the last update.
Current mitigation or investigation path.
Next update time.
Customer or compliance implications, if known.

Post-incident review

Postmortems should be blameless but specific. Record what happened, why existing controls did not prevent or detect it sooner, and what will change.

Follow-up actions need owners, due dates, and priority. Unowned action items are just incident theater.