Dave Lewis

Build to Fail

Blog Post created by Dave Lewis Employee on Feb 9, 2015

In the course of my writing I often draw from past experiences. In this case I was able to weave some sage advice I received when I started working here at Akamai.

 

Previously I posted this article on Forbes column, "Network Security, Build To Fail."

 

"Early in my information security career I worked as a network security staffer for a large financial institution. While I was there I learned very quickly that a failure would cost a great deal of money for every second the systems were offline. When the Internet banking site went down, as it did on occasion, we would spring into action no matter the time of day and work like people possessed until the systems were back online. I found it strange that this was necessary in the first place. Why were there not redundant systems as part of the design? Why was the site not able to scale under load? This was back before distributed denial of service (DDoS) was in vogue.

 

The problem I noticed as I progressed along my career path was the planned obsolescence that was baked into network security. In addition to that there were far too often systems would be deployed without back ups. There would be missing aspects in the software that were requirements that would often land “on the road map” and rarely did we see come to fruition. By creating a product set that doesn’t adequately meet the needs of the consumer are we not putting forward a lifespan-limiting design? The truth of the matter was that it could all be reduced down to budgets. Network security has typically been the red headed step child of information technology and frequently suffered in the annual budget discussions as a result.

 

Limitations noted. Design issues are another aspect. How many times have we all seen a “castle” graphic on a slide presentation to denote the security of the corporate network. Last time I checked, a castle doesn’t scale. We have severe constraints that limit responses. If isolated, the castle will wither and die. Invariably we find that there is a tunnel under the wall either led their by the builders or by smugglers sneaking in and out of the city. As the populace grows the functional constraints of the castle as tested and ultimately the citizens would have to move outside the safety of the walls. A castle has limited utility and as a result we have planned obsolescence.

 

The castle analogy is no longer feasible in the modern network security landscape. It is a model that cannot scale appropriately to meet demands and opens up an organization to security exposures.

 

We need to assume that our systems will fail and fail hard. We need to build network security with failure in mind. There was once a notion of “bricks and clicks” that was meant to demonstrate a delineation between retail and online presence. This too has fallen by the wayside as online business is now just, the business.

 

A couple of companies come to mind regarding the ability to “build to fail” are Netflix NFLX -0.18% and my own day job, Akamai. Netflix for instance uses a script that they call, Chaos Monkey. This script will randomly knock systems offline.

 

From Netflix:

 

Failures happen and they inevitably happen when least desired or expected. If your application can’t tolerate an instance failure would you rather find out by being paged at 3am or when you’re in the office and have had your morning coffee? Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week?

 

Pretty solid way to ensure that you are building resilient systems. Having lived through those 3 am phone calls I can safely say, no. I need my coffee.

 

At my day job our CSO, Andy Ellis, pointed out that that we too use a version of the chaos monkey, the Internet. If you want to ensure that your systems and applications will survive you need to break them. In so doing you help to ensure that your code is consistent and that your configurations across your systems are are consistent. The days of having systems which were built by a summer intern and are held together with duct tape and hope are waning quickly. We need to recognize when the technical debt has become unwieldy.

 

In order to improve we need to have security systems that can stand up under scrutiny. They need to be able to withstand the chaos monkeys."

Outcomes