Image credit: Amazon Web Services Office in Houston, Texas by Tony Webster, via Flickr [CC BY 2.0]

Author’s note: this is the first in a series of posts on cloud security.

A funny thing happened in many of the world’s most rarefied boardrooms. Over the last five years, CEOs realized they aren’t in the datacenter business any more. CIOs and Chief Digital Officers now have a new mandate: move their companies’ applications and infrastructure into public cloud services as fast as possible. The prospect exhilarates technologists, but scares security teams. Nobody wants to be the next Capital One, even though most companies would kill to have had the nine years of digital transformation experience under their belts that Capital One had at the time of its breach.

Solutions to the cloud security problem cannot be tackled in a single post, other than in a very generic way. That would be rather unhelpful. For this reason, instead of zooming out, I’d like to zoom in and talk about a critical point of leverage that every Chief Information Security Officer (CISO) should focus on: managing the drift related to their cloud-based applications and infrastructure. As I will explain, the wholesale movement of large workloads to cloud service vendors such as Amazon Web Services (AWS) brings with it new tools and processes. These tools and processes can be profitably exploited by savvy CISOs. By keeping a careful eye on the rate of drift of their technology assets, CISOs can provide evidence to their boards and auditors that their cloud deployments are well-managed. Well-managed cloud deployments are, by definition, less risky because they are easier to change in a controlled way, and more stable as well.

What is Drift?

What do we mean by “drift”? The Oxford English Dictionary defines drift as “a steady movement or development from one thing towards another that is perceived as unwelcome.” In finance, stochastic drift refers to the change in the average value of a random process. That works well for stock prices and interest rates, which seem random to the average investor.

In technology, the processes that build cloud-based applications and infrastructure — including those associated with agile development, continuous integration and continuous delivery — are not random. They are stateful and should (fingers crossed!) work the same way every time. So in cloud deployments, drift is more straightforward to understand: it is simply the divergence from an expected state. By “expected state”, we mean the intended configuration of an application and its supporting infrastructure, as encoded in configuration management scripts and tools.

Before the widespread adoption of cloud computing, the primary methods for documenting and deploying applications and infrastructure included an odd assortment of Visio diagrams, wiki pages, SharePoint sites, Word documents, READMEs and runbooks. In the cloud era, datacenters and all of their workloads are entirely software-defined. You can now declare your intended state in code. A variety of tools then provision, configure and deploy all of the resources needed to make that state a reality. The tools are absolutely integral to the process because they enable technology teams to scale, ensure consistent processes and eliminate manual errors. You can stand up an entire cloud-based compute environment from a single command in less than five minutes.

Here’s the important point. Because configuration tools are at the heart of all modern cloud-based delivery proceses, they can be profitably exploited as a data source to understand drift. Here’s how to do it.

Drift as a Metric

For instance, suppose you have an application that is provisioned by a cloud configuration management tool, such as Terraform, AWS CloudFormation, Ansible, Chef, Puppet, CFEngine, or SaltStack. These tools provision, configure and maintain the state of applications and their accompanying infrastructure. They are also sources of data you can mine to your advantage.

Here’s a real example. My securitymetrics.org test application runs in AWS and is configured using a Terraform plan and an Ansible playbook. The Terraform plan creates 38 AWS resources, including two servers (“EC2 instances”), a private virtual cloud, network routes, host firewalls (“security groups” in AWS-speak), identity policies, randomized passwords, Docker containers, and other resources. The Ansible playbook configures the two hosts created by Terraform, running 121 tasks to bring them into an expected secure state. In sum, these plans and playbooks make, test and fix (if necessary) 159 “promises” about the state of the application and its accompanying infrastructure.

Drift is straightforward to calculate: execute the plan, and run the playbook. Then, count the number of changes. This is the amount of drift. Divide by the number of tasks. The result, expressed as a percentage, is the drift rate.

More formally, the drift rate d is equal to the number of changes c divided by the difference between the total number of tasks t and the number of skipped tasks ts:

$$ d = \frac{c}{t - t_s} $$

Note that we remove “skipped” tasks from the denominator so that the results are not polluted by tasks that did not need to run for some reason or because they did not apply.

Here is a worked example:

  • The Terraform plan verifies correct state for 38 AWS resources, and indicates that 1 change is needed. The drift from the expected configuration is 1, and the drift rate is 2.6% (1 out of 38 resources).
  • The Ansible playbook runs 121 tasks, skipping 50 because they do not apply to these particular EC2 instances, and completes with 4 changes. The drift from the expected configuration is 4, and the drift rate is 5.6% (4 out of 71 tasks).

The total drift rate for the full stack is therefore 4.6% (5 out of 109 task and resource promises). Because I am working on the environment, that amount of drift is expected. In a steady-state situation, however, I would expect the drift to be zero.

Drift metrics are easy to create. In the case of Ansible and Terraform, the source data comes straight from tool output. Only a few lines of shell script are needed to dig out the data we need. And it is easy to wire up a batch job (for example, an AWS Lambda task) to re-run these tasks on a regular schedule, for example daily, so that the environment measures verifies its configuration (and self-heals!) once a day.

Using Measurement of Drift Strategically

The drift rate gives CIOs and CISOs a view into how consistently their technology assets conform to a desired state; in essence, how well the promises they have made about their configurations are being kept. Applications and infrastructure with higher rates of drift are less well-managed than those with lower rates.

Managers can use drift metrics strategically to identify opportunities to improve performance. Any of the standard analytical strategies are appropriate. For example, you can:

  • Group all assets by their owning business units, and compare business units’ drift rates to identify the best and worst performers
  • Track drift rates over time to identify which parts of the organization have less-consistently-well-managed technology assets than others
  • Keep separate drift metrics for each class of asset, for example, networks/infrastructure versus host/applications
  • Divide assets into “in development” and “steady state” populations, and create a tools adoption target for the former population, and a drift rate target for the latter

You need not go overboard. Pick one or two strategies to start with. The old chestnut about “what gets measured, gets managed” applies here.

What Drift Does Not Address

Now that you understand how to calculate the drift in your cloud environments, it is worth mentioning its limits. Drift is a crude metric, capturing only the aggregate number of changes to a technology stack. In addition, it is silent about whether particular state changes are actually important. For example, if Terraform detects that a firewall rule was somehow changed to “any-any” from its expected single-host/single-port configuration, that might be a big problem.

And lastly, drift per se does not help CISOs understand whether the desired state is actually any good — or complete. Yes, the CIO can move the organization to an “infrastructure as code” mindset, and then marshal the tools, people and processes to make it reality. Yes, this is awesome for agility and necessary for transformation. But these things in and of themselves do not ensure that the CISO’s control objectives are reflected in the configuration code. That takes work. I will discuss this topic in a future post.

Nonetheless, measuring the drift rates for your cloud-based applications and infrastructure is an effective proxy for understanding how well-managed they are. The drift rate is a concise and simple metric that you can cut, mash up and measure many different ways to understand where your trouble spots are.

Thanks for reading!