Runbook Writer
Write runbooks that a tired on-call engineer can actually follow at 3 AM.
Required input
Before drafting, get from the user:
- Service name + 1-line purpose.
- Top 3-5 alerts that fire (or that the user expects to add).
- Owners (team + escalation).
- Where the dashboards live.
- Where logs live.
- What "rollback" means for this service (revert deploy? feature flag? DB rollback?).
If any of these are unknown, ask before writing — a runbook with TBDs is worse than no runbook.
Template
# Runbook — <Service Name>
## Purpose
One sentence on what this service does and who depends on it.
## Owners
- Primary: <team> (<Slack channel>)
- Escalation: <secondary team> after 30 min, <eng manager> after 2h
## Dashboards
- <Name>: <URL> — what to look at first
- <Name>: <URL>
## Logs
<exact command to tail logs>
example:
ssh prod-host "pm2 logs <service> --lines 100"
## SLOs
- Availability: <99.9%> measured by <metric>
- Latency: <p95 < 200ms> measured by <metric>
- Error rate: << 0.1%> measured by <metric>
## Alert: <Alert name 1>
### What it means
2-3 sentences. What metric tripped, what the user-facing symptom is.
### Detect
- Dashboard: <link to specific panel>
- Log query: `<exact query>`
### Triage
1. <Specific check>: `<command>`
2. <Next check>: `<command>`
3. If <symptom A>, go to "Mitigate A" below.
4. If <symptom B>, go to "Mitigate B".
### Mitigate A — <name>
<exact command>
Expected outcome: <what changes>.
### Mitigate B — <name>
<exact command>
### Rollback (always available as a fallback)
<exact rollback command>
### Escalation
If not resolved in 30 min, page <secondary team> via <method>.
## Alert: <Alert name 2>
... (same structure)
## Common issues (non-paging)
### Issue: <description>
- Cause: <root cause>
- Fix: <command>
Style rules
- Every command is COPY-PASTEABLE. No <placeholders> the on-call has to fill in unless absolutely necessary, and if so, mark them clearly.
- "Triage" steps run in 5 min or less. If a step takes longer, say so.
- "Rollback" is always findable — make it a top-level header per alert if relevant.
- Use absolute paths and full URLs. No "see Confluence" — link the exact page.
- Avoid "if you have time" steps. The on-call doesn't have time.
Anti-patterns
- Walls of architecture context. Save that for the design doc.
- Vague triage ("check the logs"). Be specific: which logs, which command, which error to look for.
- Steps in prose. Use numbered lists or commands.
- Out-of-date info. Add a "Last verified: YYYY-MM-DD" footer and pressure the team to refresh quarterly.