SRE Metrics and Security Measurement

Image: Google Pipes Datacenter by Jorge Jorquera, via Flickr (CC BY-NC-ND 2.0)

Why can’t IT and security get along better? Disciplined technology teams use data and metrics strategically. But security and risk teams think about metrics differently than the rest of IT. In this post, I’ll describe how the discipline of Site Reliably Engineering (SRE) stresses the importance of service levels, discuss the slightly different terminology used by security and risk teams, and then suggest how using the language of SRE is can help security and risk teams to build bridges to technologists and across the business.

Some background. For various reasons, I’ve been lately “sharpening the saw,” one of the seven habits Steven Covey described in his famous book. The saw that I am sharpening is The Cloud; more specifically, deepening my understanding of some of the core architectural patters that underpin modern cloud stacks such as Google and Amazon. One of Google’s well-known patterns is something called Site Reliability Engineering, or SRE for short. I had a pretty good idea of what SRE was about, but wanted a more formal indoctrination.

Google folk have written a book on SRE. The chapter on Service Level Objectives is invigorating because it delivers an opinionated jolt about metrics, and their importance to reliability:

It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product.
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.

Preaching to the converted! With a simple find-and-replace operation one could easily substitute security for service, and the result would closely capture the spirit of what I have been doing for the last five years inside several large financial institutions: namely, building measurement systems for security.

In order to explain the parallels between security measurement and SRE metrics, I’ll first explain what I mean by “security measurement.” Security measurement tend to measure two things: risks, using Key Risk Indicators, and controls, using Key Performance Indicators.

Key Risk Indicators attempt to quantify risk. They are often produced by estimation, using a defined methodology such as FAIR, which produces dollars and cents. They can also be based on objective measurements that we know are good proxies for risk. For example, many compromises are the result of attackers using software exploits to gain access to unpatched systems. So, the “percentage of Internet-facing systems that are missing high-severity security patches” is a pretty good proxy for risk. “High severity,” in this case, means that the patch fixes a vulnerability that is remotely exploitable without needing the target to take any active steps, and is being actively exploited by bad guys right now. There is healthy debate in the security community about whether economic estimates or “proxy” performance indicators is a better KRI strategy. I have always preferred the latter approach (using selected KPIs as proxies) because they are simpler, and scale well.

Key Performance Indicators, by contrast, measure the effectiveness of activities. There is a rich and varied set of definitions about what KPIs are, but in security, these are generally “control processes” that discover, constrain, or reduce risk. The controls are generally drawn from frameworks such as NIST Cyber-Security Framework, COSO, ISO 27001, PCI-DSS or other régimes. In the vulnerability example described above, the “control” activity involves patching vulnerable systems. Many companies that have developed patch KPIs express them as the “percentage of systems patched on time,” with different definitions of “on time” (the service level agreement) depending on the importance of each system, and the importance of the patch. Personally, I like three buckets: patch it now (72 hours or less), patch it soon (within 30 days), and patch it later (everything else), along with a glossary that describes in plain English what goes into each bucket.

KPIs and KRIs can have thresholds — upper and lower bounds — that separate goodness from not-so-goodness, and not-so-goodness from badness. They are essentially the numeric boundaries between green, red, and yellow. The collective set of KRIs, considered together with their thresholds, can form the basis of an oganization’s risk appetite. One institution I am familiar with expresses its cyber-security risk appetite with a small set of well-defined, compact “proxy” indicators as KRIs. The institution has two thresholds for each KRI: an “internal limit” (yellow!) that triggers immediate remedial action, and a “board limit” (red!) that absolutely nobody wants to get outside of, because it requires a command performance in front of the board.

With that background, let’s compare and contrast these security measurement concepts with Google’s SRE service level objectives. As quoted from Site Reliability Engineering:

Indicators. An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided. Most services consider request latency — how long it takes to return a response to a request — as a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile… Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret. For example, client-side latency is often the more user-relevant metric, but it might only be possible to measure latency at the server.
Objectives. An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results “quickly,” adopting an SLO that our average search request latency should be less than 100 milliseconds… Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service being slow. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service.
Agreements. Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial — a rebate or a penalty — but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask “what happens if the SLOs aren’t met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.

As you can see, one can easily draw an analogy from security measurement to SRE service level metrics. Security KRIs and KPIs neatly map to Service Level Indicators. Security measurement thresholds (and what security teams often call “SLAs”) are very similar to Service Level Objectives. And while Service Level Agreements (as defined in SRE) don’t have a perfect analog, the arrangements many GRC teams make about what to do when their they hit their thresholds (for example, report to the board) are essentially agreements.

SRE concept	Security measurement concept
Service level indicators (SLIs)	Key Performance Indicators, Key Risk Indicators
Service level objectives (SLOs)	Indicator thresholds, SLAs
Service level agreements (SLAs)	Escalation thresholds, board limits

The concepts are close enough that many organizations may wish to adopt SRE terminology across the board, instead of clinging to special jargon that only the security and risk teams, and (regrettably) the regulators, speak. Why? I know this verges on heresy, but the advantages are plain enough.

Although “SRE” originated at Google, having explicit service level indicators, objectives and agreements is a modern twist on a classic IT operations discipline. It is well-understood by the business.
Because most technology organizations are several orders of magnitude larger than their corresponding security and risk teams, having common language increases the understanding each team has of the other’s mission.
Last: embracing SRE-aligned definitions of measurement helps clarify what security teams mean when they say “measurement.” If you prefer that it refer explicitly to empiricial observations of “ground truth,” and not to other kinds of numbers — such as ordinal 1–5 scales or risk indices made using mystery math — the SRE philosophy should suit you just fine.

Bluntly: software is eating the world. Security and risk folks, you are surrounded by technologists. Why not use their language?

Image: Google Pipes Datacenter by Jorge Jorquera, via Flickr (CC BY-NC-ND 2.0)