On-Call Schedule Best Practices
Building effective on-call schedules requires thoughtful planning around shift length, rotation frequency, coverage patterns, and escalation strategies. This guide covers best practices to help your organization design schedules that balance coverage with responder wellbeing.
Shift Length and Rotation Frequency
Recommended Shift Lengths
- 24-hour shifts - Suitable for small teams or on-call rotation with multiple layers, minimizes context switching
- 12-hour shifts - Good balance between continuous coverage and responder fatigue, enables two-person rotation
- 8-hour shifts - Better for high-incident environments or when multiple shifts per day are staffed, can be paired with follow-the-sun models
- On-demand/Ad-hoc - Use manual schedules for rarely-needed coverage, responders claim shifts as needed
Handoff Timing
- Standard handoff times (e.g., 9am, 3pm, 9pm) work better for distributed teams to coordinate transitions
- Avoid handoffs during known busy periods when possible (e.g., during peak traffic hours, before major releases)
- Document handoff procedures so new on-call responders know what to expect and what context to gather
Schedule Structures
Single Rotation (Primary)
Simplest model, works for:
- Small teams (3-5 people)
- Lower-incident services
- Early-stage companies
Pros: Simple to understand and manage Cons: Single point of failure if primary responder is unavailable
Dual Rotation (Primary + Secondary)
Recommended for most production services:
- Primary handles first response and decision-making
- Secondary is backup/escalation target
- Both receive alerts simultaneously during overlapping shifts, or secondary activates if primary doesn't respond
Pros: Provides depth, enables faster escalation, good for mentoring Cons: Requires more scheduling coordination
Follow-the-Sun (Geographic Distribution)
Split coverage across time zones:
- Americas: 9am-6pm (primary) + 6pm-9am (secondary)
- EMEA: Overlaps with Americas morning + full coverage + Overlaps with APAC afternoon
- APAC: Overlaps with EMEA evening + full coverage
Pros: Faster response times, no graveyard shifts Cons: Complex to set up, requires trained people in multiple zones
Specialized Roles
Separate schedules for different responsibilities:
- Infrastructure/Platform on-call
- Application on-call
- Database on-call
- On-call Manager/Incident Commander
Pros: Clear responsibility boundaries, easier to find on-call person for specific systems Cons: Can create silos, requires good escalation policies to coordinate
Escalation Strategy
First Escalation Layer
- Target: Primary on-call responder(s)
- Timeout: 5-10 minutes for acknowledgment
- Goal: Fast initial response to triage and determine severity
Second Escalation Layer
- Target: Secondary on-call or team manager
- Timeout: 10-15 minutes
- Goal: Ensure coverage if primary is unavailable or needs support
Additional Escalation
- Target: Entire team, manager, or executive escalation
- Timeout: 15-30 minutes
- Condition: Critical/P1 incidents only, or if no one from lower layers responds
Tip: Escalation timeouts should account for notification delivery time (SMS, push, email) and responder reaction time. Start conservative (5-10 min) and adjust based on your actual response metrics.
Avoiding On-Call Burnout
Frequency and Breaks
- Rotation frequency: At least 2 weeks between shifts for the same person (ideally 1 week on, 3 weeks off)
- Consecutive rotations: Avoid putting the same person on consecutive weeks when possible
- Break scheduling: Plan for breaks during heavy incident periods or before major projects
Support During Shifts
- Staffing: Have secondary/backup available, don't rely solely on primary responder
- Incident support: Ensure on-call responder isn't also responsible for their normal daily duties during shifts
- Communication: Set expectations with manager that on-call time is "working time" even if responding to incidents
Post-Incident Review
- Blameless approach: Focus on processes/systems, not individual failures
- Learning: Use incidents as opportunities to improve runbooks and alerting
- Compensation: Consider comp time, monetary compensation, or schedule adjustments after severe incidents
Alerting and Notification
Notification Preferences
- Encourage responders to configure all notification channels (SMS, push, email, Slack) for redundancy
- Phone calls should be considered for critical alerts in critical systems
- Test notification chains periodically to ensure alerts reach responders
Alert Fatigue Prevention
- De-duplicate alerts using grouping rules so one issue doesn't create dozens of alerts
- Alert routing - direct specific alerts to relevant on-call schedule, not entire team
- Tuning: Regularly review and tune alert thresholds to reduce false positives
Team Practices
On-Call Handoff Meetings
- Weekly or before shift changes: Short (15 min) sync to discuss:
- Current known issues or degradation
- Recent major incidents
- Runbook updates or known workarounds
- Accessibility/network issues with the responder
Documentation
- Runbooks: Keep incident response runbooks up-to-date and linked in alerts
- Architecture: Maintain clear documentation of system dependencies and critical paths
- Escalation contacts: Document who to page for different types of issues
Training and Readiness
- Onboarding: New team members should do a shadowing shift before going on-call alone
- Regular drills: Conduct monthly or quarterly incident simulations to practice response
- Runbook reviews: Periodically validate that runbooks still work
Tools and Automation
Slack Integration
- Link team channel to schedule so handoff messages remind team who's on-call
- Use
/fh on-callcommands during incidents to quickly identify responders - Incident channels automatically notify on-call responder(s) when incident is created
Integration with Incident Management
- Alert routing: Wire up on-call schedules to receive alerts from your monitoring tools
- Escalation policies: Use escalation policies that match your incident severity levels
- Metrics: Review on-call metrics regularly (response time, escalation frequency, incident volume)
Review and Iteration
On-call schedules should evolve with your organization:
- Quarterly review: Check whether current schedule structure is working, gather feedback from responders
- Incident metrics: Track response times and patterns - if certain shifts have more alerts, investigate why
- Responder feedback: Regularly ask on-call people what's working and what could be improved
- Adjust gradually: Don't make drastic changes, pilot new structures with small groups first
Updated 4 days ago
