Runbook Writer

Name: Runbook Writer
Author: Community

Write runbooks that a tired on-call engineer can actually follow at 3 AM.

Required input

Before drafting, get from the user:

Service name + 1-line purpose.
Top 3-5 alerts that fire (or that the user expects to add).
Owners (team + escalation).
Where the dashboards live.
Where logs live.
What "rollback" means for this service (revert deploy? feature flag? DB rollback?).

If any of these are unknown, ask before writing — a runbook with TBDs is worse than no runbook.

Template

# Runbook — <Service Name>

## Purpose

One sentence on what this service does and who depends on it.

## Owners

- Primary: <team> (<Slack channel>)
- Escalation: <secondary team> after 30 min, <eng manager> after 2h

## Dashboards

- <Name>: <URL> — what to look at first
- <Name>: <URL>

## Logs

example:

ssh prod-host "pm2 logs <service> --lines 100"


## SLOs

- Availability: <99.9%> measured by <metric>
- Latency: <p95 < 200ms> measured by <metric>
- Error rate: << 0.1%> measured by <metric>

## Alert: <Alert name 1>

### What it means
2-3 sentences. What metric tripped, what the user-facing symptom is.

### Detect
- Dashboard: <link to specific panel>
- Log query: `<exact query>`

### Triage
1. <Specific check>: `<command>`
2. <Next check>: `<command>`
3. If <symptom A>, go to "Mitigate A" below.
4. If <symptom B>, go to "Mitigate B".

### Mitigate A — <name>

Expected outcome: <what changes>.

### Mitigate B — <name>


### Rollback (always available as a fallback)


### Escalation
If not resolved in 30 min, page <secondary team> via <method>.

## Alert: <Alert name 2>

... (same structure)

## Common issues (non-paging)

### Issue: <description>
- Cause: <root cause>
- Fix: <command>

Style rules

Every command is COPY-PASTEABLE. No <placeholders> the on-call has to fill in unless absolutely necessary, and if so, mark them clearly.
"Triage" steps run in 5 min or less. If a step takes longer, say so.
"Rollback" is always findable — make it a top-level header per alert if relevant.
Use absolute paths and full URLs. No "see Confluence" — link the exact page.
Avoid "if you have time" steps. The on-call doesn't have time.

Anti-patterns

Walls of architecture context. Save that for the design doc.
Vague triage ("check the logs"). Be specific: which logs, which command, which error to look for.
Steps in prose. Use numbered lists or commands.
Out-of-date info. Add a "Last verified: YYYY-MM-DD" footer and pressure the team to refresh quarterly.