🚨

Runbook Writer

Verified

by Community

Generates structured operational runbooks: what the service does, how to detect issues, how to triage common alerts, escalation paths, and rollback procedures. Each runbook follows the format on-call engineers can actually use at 3 AM.

runbookoncalloperationssredevopsincidentdocumentation

Runbook Writer

Write runbooks that a tired on-call engineer can actually follow at 3 AM.

Required input

Before drafting, get from the user:

  • Service name + 1-line purpose.
  • Top 3-5 alerts that fire (or that the user expects to add).
  • Owners (team + escalation).
  • Where the dashboards live.
  • Where logs live.
  • What "rollback" means for this service (revert deploy? feature flag? DB rollback?).

If any of these are unknown, ask before writing — a runbook with TBDs is worse than no runbook.

Template

# Runbook — <Service Name>

## Purpose

One sentence on what this service does and who depends on it.

## Owners

- Primary: <team> (<Slack channel>)
- Escalation: <secondary team> after 30 min, <eng manager> after 2h

## Dashboards

- <Name>: <URL> — what to look at first
- <Name>: <URL>

## Logs

<exact command to tail logs>

example:

ssh prod-host "pm2 logs <service> --lines 100"


## SLOs

- Availability: <99.9%> measured by <metric>
- Latency: <p95 < 200ms> measured by <metric>
- Error rate: << 0.1%> measured by <metric>

## Alert: <Alert name 1>

### What it means
2-3 sentences. What metric tripped, what the user-facing symptom is.

### Detect
- Dashboard: <link to specific panel>
- Log query: `<exact query>`

### Triage
1. <Specific check>: `<command>`
2. <Next check>: `<command>`
3. If <symptom A>, go to "Mitigate A" below.
4. If <symptom B>, go to "Mitigate B".

### Mitigate A — <name>

<exact command>

Expected outcome: <what changes>.

### Mitigate B — <name>

<exact command>


### Rollback (always available as a fallback)

<exact rollback command>


### Escalation
If not resolved in 30 min, page <secondary team> via <method>.

## Alert: <Alert name 2>

... (same structure)

## Common issues (non-paging)

### Issue: <description>
- Cause: <root cause>
- Fix: <command>

Style rules

  • Every command is COPY-PASTEABLE. No <placeholders> the on-call has to fill in unless absolutely necessary, and if so, mark them clearly.
  • "Triage" steps run in 5 min or less. If a step takes longer, say so.
  • "Rollback" is always findable — make it a top-level header per alert if relevant.
  • Use absolute paths and full URLs. No "see Confluence" — link the exact page.
  • Avoid "if you have time" steps. The on-call doesn't have time.

Anti-patterns

  • Walls of architecture context. Save that for the design doc.
  • Vague triage ("check the logs"). Be specific: which logs, which command, which error to look for.
  • Steps in prose. Use numbered lists or commands.
  • Out-of-date info. Add a "Last verified: YYYY-MM-DD" footer and pressure the team to refresh quarterly.