There is a deep vein of academic scholarship investigating what enables organizations to be 'highly reliable', that is to say functioning optimally even when conditions are severe. Examples are emergency rooms, fire departments etc.
As security threats to our systems mount, and as these systems are so embedded in how every company delivers value, it is important to understand how the large enterprises can learn from and emulate, not just the unicorns of the web industry, but these highly reliable organizations.
It turns out however, that the principles and practices of DevOps synthesize critical attributes of highly reliable organizations and provide a template from which enterprises may learn how to become highly reliable.
As much as we, as security professionals, charge our organizations to listen and act, so must we learn to enable our organizations to become reliable in the face of threat.
The Hedgehog and the Fox
Archilocus was a Greek lyric poet in the 7th Century BC. We have very little of his work remaining, and most of it is in the form of scraps and fragments. One aphorism that has endured is, "The Fox knows many things, but the hedgehog knows one big thing." For all we know this could have been a marginal doodle while watching an Aegean sunset but it has had quite an impact on a range of disciplines.
For the first section of this post, I will perform a whistle-stop review of some management thinking shaped by this idea and then explain why the results are pertinent for security professionals.
Isiah Berlin was an Oxford historian and philosopher in the mid 20th Century and wrote about the hedgehog and the fox.
He explains the difference in his book of the same name, thus: "For there exists a great chasm between those, on one side, who relate everything to a single central vision... on the other side, those who pursue many ends, often unrelated and even contradictory, connected, if at all, only in some de facto way, for some psychological or physiological cause, related by no moral or aesthetic principle..."
Jim Collins, in Good to Great adopted the metaphor to explain how the most successful and enduring companies operated, "Those who built the good-to-great companies were, to one degree or another, hedgehogs. They used their hedgehog nature to drive toward what we came to call a Hedgehog Concept for their companies. Those who led the comparison companies tended to be foxes, never gaining the clarifying advantage of a Hedgehog Concept, being instead scattered, diffused, and inconsistent".
But how effectively has this idea stood up to testing since 2001 when it was first published? Not well. An article by Steven Leavit of Freakonomics fame, tracked the performance of companies praised in "Good to Great" and found that many had performed poorly. Examples being, Fannie Mae (!), Circuit City and Wells Fargo. So is the idea of "the Big idea" (or Hedgehog concept) dead?
Phil Rosenzweig, author of "The Halo Effect" thinks so. He cautions against these pat cause and effect explanations of performance. For me, the core idea was that resilience is necessary since success is not absolute. The parameters of success will be a function of the market and that changes pretty rapidly. Bad things happen, or a "Black Swan" as Nassim Taleb has it. Taleb's idea of anti-fragility is also very powerful. Consider an organization that becomes improved through change. Like a leather satchel that is broken-in and improves with age, rather than a crystal wine glass that ceases to function after a small knock.
What do resilient organizations look like? How do we organize to enable us to improve under threat? Is resilience the new 'hedgehog concept' and where does security fit in?
My thesis is that if there is a 'hedgehog concept' in modern business it is velocity and not resilience. But that resilience (including with respect to security threats) requires fox-like behavior in order to produce reliable business performance.
"Reliability depends on the lack of unwanted, unanticipated, and unexplainable variance in performance" Eric Hollnagel said.
Highly reliable organizations
Karl Weick is an organizational theorist who has studied how organizations make decisions and process information with which to make those decisions. Much of this work has been in the area of highly reliable organizations.
A useful definition of reliability comes from another academic, Paul Schulman, "The major determinant of reliability in an organization is not how greatly it values reliability or safety per se over other organizational values, but rather how greatly it disvalues the mis-specification, mis-estimation, and misunderstanding of things."
Here are some examples of the kinds of organizations that promote this kind of behavior:
- Naval aircraft carriers
- Chemical production plants
- Offshore drilling rigs
- Air traffic control systems
- Incident command teams
- Wildland firefighting crews
- Hospital ER/Intensive care units
A famous study of a failure of reliability is the Space Shuttle Columbia explosion on re-entry into the Earth's atmosphere on Feb. 1, 2003. The explosion of the shuttle was caused by the breakage and collision of tiles on a wing of the shuttle. At launch, some damage to the tiles was noted. Some engineers at NASA believed that the damage to the wing could be catastrophic but their concerns were not addressed in the two weeks that Columbia spent in orbit because management believed that even in the case of major damage there was little that could be done to fix it. So how can an organization fail to respond to this kind of information?
Weick identifies some heuristics against which we can rate our capability to be reliable, in other words to respond effectively to experience and improve by it:
- How preoccupied with failure are you? Do you treat near misses as information for improvement or as evidence of your awesomeness as a security team?
- How much do you attempt to simplify? Do you solicit views from outside your security team?
- How sensitive are you to the whole operation? Do teams interact enough with each other to understand the other jobs being done and are able to form a whole picture of the operation? How much do you share a picture of the threat landscape with the people you are trying to influence?
- Are you committed to resilience? Do you invest in people's competence, especially in terms of informal contacts and networks that can be used to solve problems effectively?
- Do you respect expertise? Often a security team will feel that their expertise is not respected across the organization and that people do not listen. But even with a security team, does everyone know who has the expertise to respond to an issue rather than merely the hierarchical rank to do so?
In the case of the Columbia disaster, many of these questions yielded answers that pointed to a culture of hierarchy and deferred responsibility, "NASA's culture of bureaucratic accountability emphasized chain of command, procedure, following the rules, and going by the book....Allegiance to hierarchy and procedure had replaced deference to NASA engineers' technical expertise" CAIB report states.
The behavior characteristics of highly reliable organizations have great affinity with concepts actively promoted in DevOps oriented teams.
How high reliability today requires DevOps
I define technical reliability in a large enterprise as: The active solicitation of information that disproves rather than confirms organizational attitudes to security practice, infrastructure build quality and application performance and embeds these lessons in subsequent iterations of a system's deployment.
In order for an organization to be reliable from a security viewpoint therefore, we must embed security professionals in the process of designing, building and deploying applications, in a manner that is consistent with the behaviors listed above.
The practices below are pointers to how this might be done but also, without which, security professionals will continue to be reactive first responders rather than building an inherently safer more reliable organization.
Shift complexity left: Today, using tools such as Chef, we can write explicit tests for compliance with certain security controls. This means that these tests may be run and passed before production, but also that during the iterative development of a system that impact of new features on the security of the system may be well observed and discussed. A good example of this is in the application of CIS benchmarks by Joshua Timberman of CHEF here.
Blameless post-mortem: The role of the blameless post-mortem in building resilience into software teams can't be exaggerated. Although identification of the person responsible is important, if we assume that people seek to do their best we need to find out why they were unable to act safely. Our preoccupation with failure in DevOps is rooted to this idea.
Full stack development:Conway's Law is alive and well in most large enterprises. The creation of process, hierarchy and bureaucracy (see the Columbia example) makes it very difficult to gain a whole picture of a system and makes it all too easy to ascribe an issue as 'somebody else's job.' Within DevOps teams, the boundaries between silos is broken down, creating a more efficient information flow but also a more transparent approach to security and the various trade-offs made during development.
Product orientation: Many organizations espouse agile ideas while still perpetuating a project-centric view of the world. This is not a terminal problem; the agile principles will drive great value within this context. However, when the 'system' of which we require everyone to have a clear understanding is a product (i.e. something consumed by the customer) and our accountability is for that product we make decisions differently. An analogy would be the difference between being responsible for the heat resistant tiles on the shuttle versus seeing the success of the shuttle voyage as your focus. As an engineer you do the former to enable the latter rather than doing the former without regard or understanding of the latter.
Software is eating the world: Marc Andreesson's quote is now a truism. If we can code infrastructure and applications, we can code tests for these too. This means that we can begin to code compliance. In other words, we can write code that defines, sets and tests a system's 4 S's;
- state (enable SELinux?),
- sequence (authenticate before action),
- supervision (set logs to verbose for 3 third party logins) and
- scope (no global log access to Swiss log data)
Small batches: Here's the catch. All of these rest on one fundamental idea that you can't get it right first time ever. So the development of a platform that enables you to iterate and improve functionality (including security) in small increments over time is crucial. It improves the fit of your product with the customer certainly but from a security perspective, there is no point being sensitive to change if you can't act on it.
Fast iteration is key to embedding/applying the lessons we learn in the things we make. If anything, this is the 'hedgehog concept of modern business'.
But as security professionals we need to 'know many things' being responsive to changes in the threat space but also to the applications we seek to secure. The best way of achieving this is to begin to see and contribute to projects/products that we work on as needing to be 'highly reliable'. The behaviors that this approach embodies makes everyone responsible for the long-term success of the product... and that also means its security.
Arbuckle is vice president, EMEA, and chief enterprise architect at Chef. He tweets as @dromologue.