Acronyms and Glossary
This page covers acronyms and terms not covered on the main website's glossary. This page goes into more technical details and will link to additional resources and reading when necessary.
Definitions
Contributing Factors
One of many potential issues that contributed to causing an incident. This is FireHydrant's recommended terminology and method for tracking what others may call "root causes."
Generally, systems are so complex that rarely, if ever, can a significant issue be boiled down to a single item gone wrong (ergo, "root cause"). For a more detailed essay, read John Allspaw's blog post (#2 below) in Additional Reading.
DORA Metrics
The DORA (DevOps Research Assessments) team was a Google research group that did a 7-year program analyzing tech companies and what separated high from low performers. Their research resulted in the 4 "DORA metrics" that the industry has adopted to gauge a company's performance/DevOps maturity:
- Deployment Frequency (DF): How often an organization pushes new features/changes. More is better.
- Lead Time for Changes (LT): How long it takes to make changes and deploy them. Shorter is better.
- Mean Time To Recovery (MTTR): How long it takes to recover from an outage. Shorter is better.
- Change Failure Rate (CFR): How often new changes & deployments break things. Less is better.
You may recognize MTTR, a common metric SRE teams use to gauge reliability and a metric we also offer in our Analytics.
Service Ownership
The principle or idea that "if you build it, you own it." Service Ownership refers to development or engineering teams owning their parts of the system throughout the entire lifecycle, including responding to incidents involving said services.
The idea runs somewhat contrary to a NOC, where a single team steps in for all issues and incidents regardless of whether they were involved in developing and deploying impacted services.
Acronyms
These are acronyms either commonly used in the industry or encountered by individual FireHydrant employees during calls or conferences. We've added these here to share the knowledge.
Acronym | Definition |
---|---|
CMDB | Configuration Management Database. A central repository or database storing information about your IT environment and infrastructure. Often used by larger enterprises that need deep auditing and logging of all settings and changes in their systems. |
DRI | Directly Responsible Individual. Term coined by Apple to refer to someone who ultimately owns a project or outcome. Has been rarely used to refer to Service Owners or Incident Commanders. |
IM | Incident Management. Sometimes used interchangeably with IR. FireHydrant considers Response and Management different things, where Response is reactive and Management is proactive. |
IMOC | Incident Manager On-Call. Individual who owns an incident and often leads and coordinates the response team through handling an incident. Often used interchangeably with Incident Commander. |
IR | Incident Response. Sometimes used interchangeably with IM. FireHydrant considers Response and Management different things, where Response is reactive and Management is proactive. |
KEDB | Known Error Database. A place to track problems found, including root causes of incidents. It could be sophisticated, like an actual issue tracker, or rudimentary, like a spreadsheet or text document. Often used by IT teams. |
NOC | Network Operations Center. Usually a specialized team or location for monitoring network and infrastructure performance. NOC teams function as an umbrella layer of defense, responding to all incidents when detected. This is a more traditional form of incident management and runs contrary to Service Ownership, where engineering teams will react independently to incidents involving their services. |
PIR | Post-Incident Review. Used interchangeably with Postmortems, Retrospectives, and sometimes RCAs. See the Retrospectives page. |
RCA | Root Cause Analysis. The process of identifying and analyzing what caused the incident. Sometimes used to refer to the incident review process overall (e.g., like a Retrospective). |
RFO | Reason For Outage. Another acronym equivalent to root cause. |
SEV | Severity. A shortening of the word "Severity", which is how many organizations measure "how bad" an incident is. Typically measured on a single-digit scale (e.g., SEV1, SEV2, SEV3, etc.) but can vary by organization. Sometimes, people colloquially refer to incidents as "SEVs". |
SLA | Service Level Agreement. Contract/agreement between a company and its clients about service delivery expectations. Typically, it includes uptime, availability, performance, and more. Suppose the company fails to deliver according to the contract (e.g., uptime falls below 99.9%). In that case, the company breaches the SLA, and there may be consequences such as refunds of credits to clients or even legal ramifications. See #3 below. |
SLI | Service Level Indicator. SLIs are the actual, measured values of some metric in your infrastructure. You can think of SLOs as the "ideal" level and SLIs as the "actual" level. If an SLI does not meet an SLO, there may be a breach of SLA. See #3 below. |
SLO | Service Level Objective. The specified goal or "ideal" level of some metric in your infrastructure. For example, you may define a 500ms response time on a particular query. If the SLI, or the measured response time at a particular moment, exceeds 500ms, then the SLO is not met. See #3 below. |
SME | Subject Matter Expert. Sometimes used to refer to key individuals who know a lot about, or own, specific services and functions in a system. Used more often than DRI. |
SOP | Standard Operating Procedure. Refers to any codified process. Sometimes used to refer to incident management/response processes. |
SRE | Site Reliability Engineering. A discipline and term coined by Google, who found that they needed an entirely new team/practice for managing their huge infrastructure at scale. See #1 below. |
Additional Reading
Updated 11 months ago