If your CTO says it is “impossible” for the hyperconverged cloud to go down, you know you and everyone else will be in for a long night at
some point during the cloud's stabilizing period. Nothing is infallible, not even the
cloud. If you are pushing the technology edge, then you need to own up to the
inevitability of a confluence of issues. So you have to ask yourself, “What
steps would have to be skipped, or overlooked, during the design, development, and implementation of
a cloud system to get to the point of an emergency downtime of your fool proof
network?”
Hypothetically, let’s say one bug in the software could blue
screen all of the domain controllers in every redundant location at the same
time. There are a few points to consider when reviewing this type of failure:
The inexperience of those in control at the technical and the blind faith managerial level
With new technology even the experts make mistakes. When the
outage happens, are the persons caught in the headlights fully trained and part
of the initial design and development, or are they the “B support team”? This
is a critical mistake made over and over again, by IT leadership and financial
stewards, where it is deemed okay to bring in experienced consultants to design
and implement a new technology solution and then leave it to the less
experienced support to team to maintain and upgrade, without proper training
and onsite support.
Lack of resources to provide an acceptable factor of safety
In the rush to curtail costs, the system suffers. The “secure
and agile IT services” cloud is not a one off capital expense. Cutting
operational costs too drastically will show its shortcomings in emergency
outages and other incidents over time. As with any system, the change must be
methodical with a factor of safety that is understood by all business partners.
It’s no excuse to cut corners because there’s “no budget.” Try saying that to a
surgeon.
Make sure someone is always accountable
In many cases, the business is cajoled into taking what IT
says for granted, but when the system goes down they might be surprised to find
out that no one is ultimately held accountable. “Virtualizing and
hyperconverging its data center” also could end up virtualizing the
accountability of the system, which in turn means that a Root Cause Analysis
will never fully explain what really happened, if it ever gets sent out…
Lack of decoupled, identical Test environment
If your company cannot afford a decoupled test environment
that mimics the cloud set up, it is adding risk to the implementation. The
vendor should at least provide a comparable test environment to test bug fixes
and service packs. If you had this and the outage still occurred, this points
to the infrastructure team, their manager, their director, and ultimately their
CIO.
Cognitive Bias toward “If it runs, don’t upgrade”
There can be a bias with some CTOs to only fix bugs with bug
patches, and to never upgrade the virtualization system software unless the
infrastructure requires it. In the end, “hyperconvergence” is a term that is meant
for theoretical analysis, not ROI, because the hidden costs of implementing
this new technology are everywhere, you just have to know where to look. Also,
the risks for implementing an internal cloud are greater than going with the
established, large cloud services.
No comments:
Post a Comment