Being On-Call

A summary of expectations and helpful information for being on-call.

What is On-Call?#

Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise for the system you are responsible for. For example, if you are on-call for your service at PagerDuty, should any alarms be triggered for that service, you will receive a "page" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what is broken and how to fix it. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state.

On-call responsibilities extend beyond normal office hours, and if you are on-call, you are expected to be able to respond to issues - even at 2am. This sounds horrible (and it can be), but this is what our customers go through and is the problem that the PagerDuty product itself is trying to fix!

Responsibilities#

  1. Prepare

    • Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc).
      • Have a way to charge your MiFi.
    • Team alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone, etc.) accordingly.
    • Be prepared (environment is set up, a current working copy of the necessary repositories is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current, etc.)
    • Read our incident response documentation (that's this!) to understand how we handle serious incidents, what the different roles and methods of communication are, etc.
    • Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments, etc.
  2. Triage

    • Acknowledge and act on alerts whenever you can (see the first "Not responsibilities" point below)
    • Determine the urgency of the problem:
      • Is it something that should be worked on right now or escalated into a major incident? (e.g. "production server on fire" situations. Security alerts) - do so.
      • Is it some tactical work that doesn't have to happen during the night? For example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom, etc. just snooze the alert until a more suitable time (working hours or the next morning) and get back to fixing it then.
    • Check Slack for current activity. Oftentimes - but not always - actions that could potentially cause alerts will be announced there.
    • Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to another team.
  3. Fix

    • You are empowered to dive into any problem and act to fix it.
    • Involve other team members as necessary: Do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe, or if the service / alert is something you have not tackled before.
    • If the issue is not time sensitive and you have other priority work, create a JIRA ticket to keep a track of it (with an appropriate severity).
  4. Improve

    • If a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue – perhaps improving this should be a longer-term task.
      • Disks that fill up, logs that should be rotated, noisy alerts...
    • If information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized.
  5. Support

    • When your on-call "shift" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note.
    • If you are making a change that impacts the schedule (e.g. adding / removing yourself), let others know since many of us make arrangements around the on-call schedule well in advance.
    • Support each other: When doing activities that might generate plenty of pages, it is courteous to "take the page" away from the on-call by notifying them and scheduling an override for the duration.

Not Responsibilities#

  1. There should be no expectation to be the first to acknowledge all of the alerts during the on-call period.

    • Commutes (and other necessary distractions) are facts of life, and sometimes it is not possible to receive or act on an alert before it escalates. That's what we have the backup on-call and schedule for.
  2. There's no expectation to fix all issues by yourself.

    • No one knows everything. Your whole team is here to help. There is no shame, and much to be learned, by escalating issues you are not certain about. Our motto is "Never hesitate to escalate."
    • Service owners will always know more about how their stuff works. Especially if our and their documentation is lacking, double checking with the relevant team can avoid mistakes. Measure twice, cut once – and it's often best to let the subject matter expert (SME) do the cutting.

Recommendations#

If your team is starting its own on-call rotation, here are some scheduling recommendations from the operations team.

Escalation

Notification Method Recommendations#

You are free to set up your notification rules as you see fit to match how you would like to best respond to incidents. If you're not sure how to configure them, the operations team has some recommendations.

Mobile Alerts

Etiquette#

Acknowledging