Thursday, May 5, 2016

Obscured by ECM Clouds

If your CTO says it is “impossible” for the hyperconverged cloud to go down, you know you and everyone else will be in for a long night at some point during the cloud's stabilizing period. Nothing is infallible, not even the cloud. If you are pushing the technology edge, then you need to own up to the inevitability of a confluence of issues. So you have to ask yourself, “What steps would have to be skipped, or overlooked, during the design, development, and implementation of a cloud system to get to the point of an emergency downtime of your fool proof network?”

Hypothetically, let’s say one bug in the software could blue screen all of the domain controllers in every redundant location at the same time. There are a few points to consider when reviewing this type of failure:

The inexperience of those in control at the technical and the blind faith managerial level

With new technology even the experts make mistakes. When the outage happens, are the persons caught in the headlights fully trained and part of the initial design and development, or are they the “B support team”? This is a critical mistake made over and over again, by IT leadership and financial stewards, where it is deemed okay to bring in experienced consultants to design and implement a new technology solution and then leave it to the less experienced support to team to maintain and upgrade, without proper training and onsite support.

Lack of resources to provide an acceptable factor of safety

In the rush to curtail costs, the system suffers. The “secure and agile IT services” cloud is not a one off capital expense. Cutting operational costs too drastically will show its shortcomings in emergency outages and other incidents over time. As with any system, the change must be methodical with a factor of safety that is understood by all business partners. It’s no excuse to cut corners because there’s “no budget.” Try saying that to a surgeon.

Make sure someone is always accountable

In many cases, the business is cajoled into taking what IT says for granted, but when the system goes down they might be surprised to find out that no one is ultimately held accountable. “Virtualizing and hyperconverging its data center” also could end up virtualizing the accountability of the system, which in turn means that a Root Cause Analysis will never fully explain what really happened, if it ever gets sent out…


Lack of decoupled, identical Test environment

If your company cannot afford a decoupled test environment that mimics the cloud set up, it is adding risk to the implementation. The vendor should at least provide a comparable test environment to test bug fixes and service packs. If you had this and the outage still occurred, this points to the infrastructure team, their manager, their director, and ultimately their CIO.

Cognitive Bias toward “If it runs, don’t upgrade”

There can be a bias with some CTOs to only fix bugs with bug patches, and to never upgrade the virtualization system software unless the infrastructure requires it. In the end, “hyperconvergence” is a term that is meant for theoretical analysis, not ROI, because the hidden costs of implementing this new technology are everywhere, you just have to know where to look. Also, the risks for implementing an internal cloud are greater than going with the established, large cloud services.


No comments: