Developer Kit
Outage Response Playbook
Outage Response Playbook generates complete, scenario-specific runbooks — not generic templates. It produces severity tiers with measurable criteria, role assignments for every step, step-by-step response procedures, escalation trees with observable triggers, communication templates, resolution checklists, and blameless post-mortem templates. Engineering managers, platform teams, and SRE leads use it to document failure modes before they happen instead of improvising under pressure. A playbook built from this skill is immediately usable — not a starting point that needs another hour of editing before it is safe to hand to an on-call engineer. What makes it production-grade is specificity. Every step has an explicit role owner. Severity tiers use measurable thresholds. Escalation contacts are role-based. Post-mortems are blameless by construction. Write for 3 AM, not for ideal conditions.
One-Time Purchase
$19.99
Outage Playbook: Database Connection Pool Exhaustion
Classification
Type: Infrastructure — Database Severity Tiers:
- P1: >50% of API requests returning 500 errors for >5 minutes; customer-facing impact confirmed
- P2: >20% of API requests returning 500 errors; degraded performance but partial functionality
- P3: Connection pool warnings in logs; no customer-facing impact yet; preventive investigation
Trigger Conditions: Alert "db-pool-exhaustion" fires when available connections drop below 10% of max pool size for >60 seconds
Roles
| Role | Responsibility | Default Owner | |------|----------------|---------------| | Incident Commander | Coordinates response, makes severity calls, owns communication | On-call Engineering Manager | | Technical Lead | Diagnoses root cause, executes remediation steps | On-call Backend Engineer | | Communications Lead | Updates status page, notifies stakeholders | On-call Engineering Manager (doubles) |
Detection & Triage (0–5 min)
- [IC] Acknowledge the PagerDuty alert within 2 minutes. Open #incidents Slack channel.
- [Tech Lead] Verify the alert is real — check Grafana dashboard "DB Pool Health":
https://grafana.internal/d/db-pool - [IC] Classify severity using the tier definitions above. Declare incident in #incidents.
Response Steps
- [Tech Lead] Check active connections:
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';— if >90% of max_connections, confirm pool exhaustion. - [Tech Lead] Identify long-running queries:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10; - [Tech Lead] Kill queries running >5 minutes that are not critical batch jobs:
SELECT pg_terminate_backend(pid); - [IC] If killing queries restores pool within 5 minutes, monitor for 15 minutes and proceed to resolution checklist.
- [Tech Lead] If pool remains exhausted, restart the application service:
kubectl rollout restart deployment/api-server -n production
Resolution Checklist
- [ ] Connection pool utilization below 60%
- [ ] Error rate returned to baseline (<0.1%)
- [ ] Status page updated: Resolved
- [ ] PagerDuty incident resolved
- [ ] Post-mortem scheduled within 48 hours
View full sample →
All sales final. No refunds on digital products.
Includes support for Claude Code, Codex, and OpenClaw in the same license.
What You Get With This Skill
Generates structured, role-clear incident response playbooks for specific failure scenarios. Covers detection through resolution and post-mortem — ready to use when an incident actually happens.
All ClearPoint Nexus Skills Include
- Production-ready workflow packaging for three supported platforms.
- Reusable structure designed for repeatable operator tasks.
- Clear deliverable format, not just raw prompt output.
Related Skills
$19.99
One-time license
$19.99
One-time license
$19.99
One-time license