Incident Response Playbook

The First Five Minutes

When an incident starts, the first five minutes determine the next five hours. Establish an incident commander immediately. Their job is coordination, not technical fix. Alert the on-call team, open a war room (physical or virtual), and start a timeline document.

Incident Commander: The IC is the single point of coordination. They do not write code or restart servers. They track the incident timeline, assign tasks, and communicate status. The IC role rotates: every incident should have a different IC to build organisational capability.

War Room: The war room is not optional. It concentrates attention, reduces distractions, and creates shared context. We have seen incidents where the war room was not opened for 30 minutes, and the team worked in isolation, duplicating effort and missing dependencies.

Timeline Document: Every action is logged with timestamp, actor, and effect. "10:05: Alice identified database CPU at 100%. 10:08: Bob restarted the primary. 10:12: CPU dropped to 20%. 10:15: Failover to secondary initiated." The timeline prevents confusion about what has been tried.

Communication Protocols

Internal communication: real-time updates in the war room, regular status broadcasts every 15 minutes. External communication: status page updates, customer notifications, stakeholder briefings. Never let customers find out before you tell them.

Update Rhythm: Every 15 minutes, the IC broadcasts a status update including: what happened, what we are doing, and when we expect resolution. The update is honest: if we do not know the cause, we say so. Vague updates create confusion and erode trust.

The golden rule of incident communication: never let customers find out before you tell them. If the system is down, notify customers immediately. Delayed notification looks like cover-up. Proactive notification builds trust. We have seen customers thank us for early notification even when the incident lasted hours.

The Blameless Postmortem

Within 24 hours of resolution, conduct a blameless postmortem. Focus on systemic factors: what about our process, tools, or culture allowed this to happen? The goal is not to assign blame but to prevent recurrence. Action items must be tracked and completed.

Postmortem Template: Timeline (what happened, minute by minute), impact (customers affected, revenue lost, data corrupted), root cause (the systemic factor, not the individual mistake), contributing factors (other things that made the incident worse), and action items (specific, measurable, assigned, and time-bound).

The blameless principle is non-negotiable. The question is never "who made the mistake?" but "what about our system allowed this mistake to cause an incident?" The person who made the mistake is the expert: they know exactly how the system failed. Their insight is essential for prevention.

Action Item Tracking: Action items are where learning turns into improvement. Each action item has an owner and a deadline. We track action items in the same system as regular work: Jira, Linear, or GitHub Issues. Action items that are not tracked are not completed. We review postmortem action items weekly until all are closed.

Our Recommendation

Practice incidents with game days. Simulate failures, measure response times, identify gaps. The teams that handle real incidents well are the teams that practice regularly. Incident response is a skill, not a procedure.

Game Day Format: Announce a simulated incident (e.g., "database primary is down"), start the timer, and observe the response. The IC is appointed, the war room is opened, the timeline is started, and the team works the incident. After 30 minutes, call "all clear" and debrief. The debrief focuses on what went well, what did not, and what to improve.

The goal of game days is not to test the system but to test the team. Systems fail in predictable ways; teams fail in unpredictable ways. A team that has never practiced incident response will be slow, confused, and ineffective when a real incident occurs. A team that practices monthly will be fast, coordinated, and effective.

Progressive Complexity: Start with simple game days: simulate a single failure (server down, database slow). Progress to complex scenarios: cascading failures, multiple simultaneous incidents, and communication failures. The complexity should match the team's maturity. A team that cannot handle a simple game day will not handle a complex incident.

The First Five Minutes

Communication Protocols

The Blameless Postmortem

Our Recommendation

Incidents taking too long to resolve?