Severity Levels

The first step in any incident response process is to determine what actually constitutes an incident. Incidents can then be classified by severity, usually done by using "SEV" definitions, with lower numbered severities being more urgent. Operational issues can be classified at one of these severity levels, and in general you are able to take more risky moves to resolve a higher severity issue. Anything above a SEV-3 is automatically considered a "major incident" and gets a more intensive response than a normal incident.

Always Assume The Worst

If you are unsure which level an incident is (e.g. not sure if SEV-2 or SEV-1), treat it as the higher one. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem.

Can a SEV-3 be a major incident?

All SEV-2's are major incidents, but not all major incidents need to be SEV-2's. If you require co-ordinated response, even for lower severity issues, then trigger our incident response process. The IC can make a determination on whether full incident response is necessary.

Severity Description Typical Response

Critical issue that warrants public notification and liaison with executive teams.

  • The system is in a critical state and is actively impacting a large number of customers.
  • Functionality has been severely impaired for a long time, breaking SLA.
  • Customer-data-exposing security vulnerability has come to our attention.

Major incident response.

  • Page an IC in Slack !ic page.
  • See During an Incident.
  • Notify internal stakeholders.
  • Public notification.

Critical system issue actively impacting many customers' ability to use the product.

  • Notification pipeline is severely impaired.
  • Incident response functionality (ack, resolve, etc) is severely impaired.
  • Web app is unavailable or experiencing severe performance degradation for most/all users.
  • Monitoring of PagerDuty systems for major incident conditions is impaired.
  • Any other event to which a PagerDuty employee deems necessary of incident response.

Major incident response.

Anything above this line is considered a "Major Incident". Our incident response process should be triggered for any major incidents.

Stability or minor customer-impacting issues that require immediate attention from service owners.

  • Partial loss of functionality, not affecting majority of customers.
  • Something that has the likelihood of becoming a SEV-2 if nothing is done.
  • No redundancy in a service (failure of 1 more node will cause outage).

High-Urgency page to service team.

  • Work on issue as your top priority.
  • Liaise with engineers of affected systems to identify cause.
  • If related to recent deployment, rollback.
  • Monitor status and notice if/when it escalates.
  • Mention on Slack if you think it has the potential to escalate.
  • Trigger incident response if necessary (!ic page).

Minor issues requiring action, but not affecting customer ability to use the product.

  • Performance issues (delays, etc).
  • Individual host failure (i.e. one node out of a cluster).
  • Delayed job failure (not impacting event & notification pipeline).
  • Cron failure (not impacting event & notification pipeline).

Low-Urgency page to service team.

  • Work on the issue as your first priority (above "normal" tasks).
  • Monitor status and notice if/when it escalates.

Cosmetic issues or bugs, not affecting customer ability to use the product.

  • Bugs not impacting the immediate ability to use the system.

JIRA ticket.

  • Create a JIRA ticket and assign to owner of affected system.

Be Specific

These severity descriptions have been changed from the PagerDuty internal definitions to be more generic. For your own documentation, you are encouraged to make your definitions very specific, usually referring to a % of users/accounts affected. You will usually want your severity definitions to be metric driven.