Act within 24 hours: create a three-column incident log recording (A) measurable losses (dollars, hours, headcount), (B) controllable variables, (C) immediate mitigations. Example entry: “Revenue impact: $12,400 (−8% vs. 30-day avg); Lost hours: 160; Responsible module: billing API.” Update the log every 12 hours for the first 72 hours.
Run a rapid triage: freeze non-critical expenditures and produce a 90-day cash-flow reforecast within 48 hours using three scenarios: optimistic (recovery in 30 days), baseline (60 days), adverse (90+ days). Assign one finance owner and require scenario inputs: current runway, monthly burn, receivables age. Target: show runway extension options that buy at least one additional 30-day period.
Use a simple decision matrix (score 1–5) across three axes: business impact, recurrence likelihood, fix cost. Sum scores; treat totals ≥9 as high-priority. Example: impact=4, recurrence=3, cost=3 → total 10 → escalate to leadership and allocate up to 10% of the project budget for a hotfix or alternate path within 14 days.
Address operations with concrete deadlines: list top 5 affected processes, assign owners, set a 48-hour temporary workaround and a 14-day permanent correction plan. Track progress with a daily 10-minute standup and a visual board showing % complete; require any task falling behind by more than 25% to trigger a secondary resource allocation.
Capture lessons using a 7-day micro-retrospective: apply the “5 Whys” for root cause, document three preventive controls and one monitoring metric per control (example: increase automated test coverage by 20% and alert if post-deploy error rate >0.5%). Publish a one-page summary to stakeholders within 10 days and add the incident to the risk register with explicit trigger thresholds.
Regain momentum with measurable targets: set a 90-day recovery roadmap with weekly milestones and three KPIs (cash burn, % on-time delivery, customer satisfaction score). Assign a single owner within 24 hours and publish the first formal update to customers and teammates within 72 hours.
Assess Immediate Consequences and Prioritize Safety Risks
Stop life threats first: apply direct pressure to massive hemorrhage or use a tourniquet placed 5–7 cm proximal to the wound and write the application time on the patient; clear and maintain airway (chin lift/jaw thrust), begin rescue breaths or CPR per local protocol if absent, treat altered mental status as priority for airway control.
Use objective vital thresholds for triage: respiratory rate <10 or >30 breaths/min, systolic blood pressure <90 mmHg, capillary refill >2 seconds, or Glasgow Coma Scale ≤8 require Immediate category and continuous monitoring every 5–10 minutes.
Survey scene hazards within the first 60–90 seconds: identify fire, risk of structural collapse, energized electrical lines, spilled chemicals, visible vapors or smoke. If any hazard exists, remove uninjured personnel upwind/upslope and establish a hard perimeter; for unknown vapors assume an evacuation distance of 200–500 m upwind unless authoritative guidance indicates otherwise.
Communicate concise incident data to emergency responders immediately: exact location (GPS coordinates if available), number of casualties by priority category, dominant hazards (fire, collapse, chemical, electrical), access routes for responders, and any known substance names or SDS references. Use clear text, repeat back instructions, and keep radio transmissions under 20 seconds.
Control entry: restrict access to trained personnel only. Use PPE matched to hazard–nitrile gloves and face shield for blood/body fluids; APR or SCBA plus chemical-resistant suit for confirmed hazardous material exposure–and log who enters, time in/time out, and respiratory protection used.
Decide shelter versus evacuation using risk tradeoffs: if source uncontrolled or toxic plume present, evacuate upwind at 200–500 m; if evacuation path exposes people to greater harm (night, traffic, severe weather), close windows/vents, seal gaps with tape/cloth, turn off HVAC, and monitor outside air for 30–60 minutes while awaiting responder advice.
Document interventions and timestamps immediately: triage tags, tourniquet application time, medications given, vital signs, and witness statements. Photographs of scene and hazards should be taken only from safe locations and stored with incident records for handover to responders and later analysis.
Conduct a Rapid Root-Cause Check and Record Key Fault Points
Execute a focused five-question analysis within 30 minutes of detecting an incident: (1) what changed in the last 60 minutes; (2) who deployed or modified configs; (3) which metrics crossed thresholds; (4) what external dependencies show errors; (5) what immediate mitigation was attempted. Log the answers to a single incident record ID (format INC-YYYYMMDD-NNN).
Collect these three artifact categories immediately: system logs (−10m..+10m), deployment manifests at current git SHA, and monitoring snapshots for the last 60 minutes. Use atomic captures to avoid drift: capture timestamp, commit, node IDs, and process IDs for every artifact entry.
Use measurable criteria to classify fault points and set target actions:
– Severity P1: service unavailable or error rate ≥10% across ≥2 regions – initial hypothesis in 15 minutes, mitigation decision within 30 minutes.
– Severity P2: error rate 1–10% or latency degradation >200 ms – hypothesis within 1 hour, mitigation plan within 4 hours.
– Severity P3: intermittent or single-user impact – hypothesis within 8 hours, remediation scheduled within 48 hours.
| Artifact | Capture Command / Location | Owner | Retention |
|---|---|---|---|
| Service logs | kubectl logs --since=10m -l app=svc or journalctl -u svc -S -10m |
SRE on call | retain 30 days |
| Deployment manifest & SHA | git rev-parse HEAD, CI artifact link |
Release engineer | retain 90 days |
| Monitoring snapshot | Grafana panel export (last 60m), alert history | Monitoring owner | retain 90 days |
| External dependency status | third-party status page + curl -I health endpoint |
Integration owner | retain 30 days |
Populate the incident record with fixed fields: incident_id, start_timestamp, detected_by, hypothesis (one-sentence), evidence_links (artifact URLs), immediate_action, suggested_next_step, assignee, classification (P1/P2/P3). Use a single-line hypothesis; reserve a separate field for supporting notes.
Apply triage rules to attribute root cause quickly: if deployment SHA ≠ last stable SHA and error spike starts within 3 minutes of rollout, mark likely code change; if deployment SHA matches but CPU/memory jump precedes errors, mark infrastructure; if external dependency shows 5xx across regions, mark third-party.
Commands and quick checks to run (copy into incident chat): kubectl rollout history deploy/NAME --revision, kubectl get pods -o wide --field-selector=status.phase!=Running, curl -sS -D - https://service/health, aws cloudwatch get-metric-statistics --start-time (or equivalent). Paste outputs to evidence_links.
Close the loop in the record: mark root_cause: confirmed/hypothesis/probable, list corrective action taken (rollback, config patch, scaling), note time-to-contain, and attach exact commands used. Archive the incident record after post-mortem with tags: component, release_sha, primary_cause, human_error? yes/no.
Decide Which Objectives to Salvage, Pause, or Abandon
Prioritize objectives by expected net value per remaining week: salvage efforts with expected net value > $5,000/month or probability of success > 60% with < 8 weeks to completion; pause initiatives with expected net value between $0–$5,000/month or uncertainty 30–60%; abandon items with negative expected net value, required additional investment >100% of original budget, or measurable market interest < 10% based on recent user metrics.
Decision criteria
Evaluate each objective using four numeric indicators: current completion percentage; cost to finish in USD; estimated monthly benefit in USD; probability of delivery within the target window. Apply these thresholds: completion < 30% plus cost-to-finish > 2x remaining budget → lean toward abandonment; completion ≥ 60% with cost-to-finish ≤ remaining budget → prioritize salvage. Use opportunity cost: compare resource redeployment value per month; if redeployment yield > objective yield by ≥ 20%, consider pause or drop.
Execution actions
Decision timeline: make a call within 72 hours of a trigger event (budget overrun > 25%, missed milestone > 2 weeks, or a new competitor signal); record rationale in the project log with numeric fields for cost, benefit, probability. Assign decision authority by category: product features → product manager; process improvements → operations lead; budget increases > $50,000 → CFO approval. For paused objectives, schedule automatic re-evaluation every 30 days with a required update on at least two metrics: market traction and resource availability.
Communication template: notify stakeholders within 48 hours; include decision type (salvage/pause/abandon), three quantified impacts (monthly cost change, expected benefit delta, headcount effect), next steps and deadline for follow-up. Red flags that justify immediate abandonment: >3 consecutive missed milestones, team turnover >25% for the project team, user engagement decline >10% month-over-month. Use the three-question filter before any final move: 1) What remains to deliver? 2) What is the precise cost to finish? 3) What measurable outcome will change if stopped now?
Set Short-Term Remediation Actions with Clear Deadlines and Owners
Assign one accountable owner per immediate action, set a fixed deadline, and capture the required deliverable and max budget in the action record. Example: Owner: Jenna Morris (Engineering Lead) – Deadline: 72 hours – Deliverable: hotfix branch + deployment playbook – Budget cap: $2,000.
Use strict time buckets: 0–24 hours for triage and impact containment, 24–72 hours for a temporary workaround or hotfix, 72 hours–7 days for stabilization and service-level restoration. Attach a measurable target to each bucket (e.g., restore 90% of API throughput, reduce customer error rate to <1% of baseline).
Provide owner authority and limits: Owner must have delegated authority to commit up to 8 engineer-hours or approve third-party services up to $2k without further sign-off. Any request beyond those limits requires escalation with written justification and an estimated cost/time delta.
Define acceptance criteria and artifacts: “Done” requires: ticket ID update, code link or rollback plan, automated test report, deployment timestamp, and a customer-impact log. QA or Ops must verify artifacts within 24 hours and mark the action closed or reopen with clear reasons.
Apply a RACI mapping for rapid clarity: Responsible = owner, Accountable = project sponsor, Consulted = security/ops, Informed = affected stakeholders. Record RACI entry on the action card and surface it in all status updates.
Enforce a fixed communication cadence: 10-minute standup twice daily (09:00 and 16:00 local) for active actions; concise written snapshot every 12 hours in the single-source channel with fields: ActionID, Owner, Current Status, ETA, Blockers, Next Step. Use subject format: [ActionID] Owner – Status – ETA.
Set automatic escalation triggers: escalate to the sponsor if any of these occur: deadline missed by >12 hours, >10 related customer tickets open, SLA breach generating revenue exposure >$5,000/day. Escalation message must include last update, corrective steps taken, and revised ETA.
Timebox efforts and manage resources: cap owner focus to two high-priority actions simultaneously; request additional headcount by submitting a one-paragraph justification with estimated hours and target date. Reassign ownership only with documented handover including pending tasks and risks.
Capture short post-action work: within 48 hours of closure, owner files a short incident note: root cause outline, single permanent fix candidate, two preventive controls, and residual risk. Schedule a 60‑minute review meeting within 14 days to assign owners and deadlines for permanent fixes identified.
Record every change in the action card; require timestamps and names on all status changes to preserve accountability and enable rapid audits.
Update Stakeholders with a Concise Remediation Brief and Next Actions
Issue a one-page executive brief within 24 hours to all key stakeholders listing measured impact, confirmed cause (or investigation status), containment steps completed, prioritized tasks with owners and hard deadlines, plus explicit decision points and escalation triggers.
What the brief must contain
- Headline: single-sentence status (example: “Payments service degraded – 14:02–14:35 UTC; 12% of transactions delayed”).
- Scope & impact: affected components, user segments, exact user count or percentage, estimated revenue-at-risk per hour (numeric).
- Cause status: “Root cause confirmed” with evidence links (log IDs, monitoring alert IDs) or “Under investigation” with next evidence-gathering step and owner.
- Containment actions completed: timestamped list with owner initials and outcome for each action.
- Next actions: numbered tasks (max 6), owner, due date/time in UTC, and measurable acceptance criteria (what success looks like).
- Decision points: options (A/B), recommended option, decision deadline and decision owner.
- Residual risk & rollback criteria: rating (Low/Medium/High) and explicit rollback/stop conditions tied to metrics.
- Customer & public communications: assigned communicator, channel (email/status page/social), next update time.
Delivery and follow-up
- Distribute the brief via email to the stakeholder list and post to the incident channel (Slack/Teams); include a single link to the live incident timeline document.
- If impact is high, schedule a 15-minute stakeholder sync within 2 hours; update the brief after the sync and redistribute with version stamp.
- Maintain one-line status updates every 2 hours (UTC) until normal operations restore; each update must reference the brief version and any changed owners/deadlines.
- Produce a short remedial report within 72 hours containing final root-cause analysis, corrective actions completed, and a timeline for permanent fixes with assigned owners.
Reference: Project Management Institute
Reallocate Resources and Create a Minimal Contingency Budget
Immediately freeze noncritical discretionary spend and transfer 10–20% of the active budget, or an amount equal to two weeks of operating burn, into a separate contingency ledger within 48 hours.
Classify expenditures by priority: Critical (payroll, SLA-bound vendors, regulatory costs) – preserve 95–100%; High-priority (client deliverables, revenue-driving features) – reduce 15–30%; Low-priority (new product experiments, nonessential marketing, conferences) – suspend 100% until review. Target savings by category: payroll adjustments only via hiring freeze or reduced contractor hours; vendor deferrals for 15–30 days; pause advertising buys with >30% conversion lag.
Funding, triggers, approval
Contingency sizing rule: contingency = greater of (10% of active project budget) or (2 weeks of cash burn). Trigger thresholds to access contingency: revenue shortfall ≥10% month-over-month, schedule slip ≥2 weeks for a critical milestone, vendor breach of contract. Authorization matrix: project lead may approve up to $5,000 from contingency; project lead + finance signoff for $5,001–$50,000; CFO or designee for amounts >$50,000. Every draw requires a one-paragraph justification, three cost-line impact, and expected duration of use.
Execution, tracking, replenishment
Execution timeline: 0–48 hours – freeze noncritical spend and open contingency ledger; 72 hours – produce a reforecast with updated burn rate and contingency balance; 7 days – implement vendor negotiations and contractor hour reductions. Tracking: use a dedicated cost code and weekly subledger report; reconcile contingency ledger against bank balance every Monday. Replenishment rule: restore contingency to target within 90 days after cash inflows exceed a 15% improvement in monthly burn, or via monthly transfers of at least 25% of monthly surplus until target reached.
Concrete example: active budget $500,000, monthly burn $100,000 → contingency = max(10% ($50,000), two weeks burn $50,000) = $50,000. Reallocation moves: pause marketing $20,000, reduce contractor hours $10,000, extend vendor terms to free $15,000, draw $5,000 from general reserves. Resulting contingency available: $50,000; expected runway extension: 2 weeks.
Questions and Answers:
What should I do in the first 24 hours after a plan fails?
Take three practical steps: 1) Pause and take a short break to reduce stress so you can think clearly. 2) Make a quick assessment: note what went wrong, who is affected, and any immediate risks that need containment. 3) Communicate with key people (team members, clients, partners) with a concise status update and next steps. These actions stop further harm, keep stakeholders informed, and buy time for a deeper review later.
How can I figure out which parts of my original plan are still useful and which should be abandoned?
Start by listing the plan’s original goals and the outcomes you actually observed. For each element, ask: did this produce value or reduce risk? Use a simple matrix with columns like “works as is,” “needs change,” and “discard.” Run a short root-cause check for failures that seem fixable with small tweaks versus those caused by flawed assumptions. Talk with at least one colleague or advisor to get a different perspective, then prioritize fixes based on effort versus impact. Finally, create a revised timeline that keeps salvageable pieces and removes the rest.