Saturday, July 10, 2010

Single Point of Failure

I’m going philosophical for a moment here. Why do we have a technical systems oriented concept of “single point of failure”? Aren’t we all “single points of failure” and the recovery from our absence is proportional to the much we shared of our experiences? The risk of losing a connection, process, services, invocation, etc is proportional to how much we know about system’s interactions and what their limits are.

Redundancy is nothing but an attempt to backfill a single point of failure. Redundancy is really just a double point of failure if you think about it. How many points of redundancy is really risk free? If you have one hot backup system to failover to, then what do you have if that fails? It’s better than nothing, but not fail safe. Enough of this sophomoric banter, let’s try to figure out a list of single points to check and test.

Any call for failover and redundancy is mandated by a Service Level Agreement between the business and IT. This agreement has flexibility factors built into it and is most likely an extension of a more comprehensive regulatory or industry standard. It usually happens when an old policy didn’t work according to plan and the C’s freaked out.

Can you perform an in-place upgrade a system without downtime? If not, what services in the application stack are not redundant enough? Here’s a common list of systems and issues to consider:

Application Servers
Should be clustered for failover, however do sessions really failover gracefully or do Users get errors during the failure?

Database Servers
Should be setup with a hot failover like an Oracle RAC system. This would allow for backup recovery. Also, as storage expensive as transaction logs are some cases these could be a life saver.

Service accounts: Local vs. Domain
I understand that applications need to be secured between services and file systems, but to only allow domain accounts for an application sets up a potential for one account to run services that could all fail if there is an account lockout of that one domain account, or Active Directory. Local admin accounts serve the same purpose and in most cases can be substituted for the domain account.

Storage Backup/Restore
Storage itself is relatively straight forward; backing it up and restoring it is a whole different story. Backups are usually done on a rolling basis: full on the weekend, incremental on the weekends. But, what happens when a restore from tape is needed? The normal backups get delayed because the same system is used for both. Do you know how long it takes to restore from the backup? Are the backups redundant? How close is the backup machine to the application? Is there a whole DR off sight solution?

2 comments:

esar said...

DepositPhotos: #1 resource for buying and selling Royalty-free photographs and vector images, Stock Images from 0.1$ (it is very cheap) and Subscription plans from 19$ so visit now by clicking- stock photos | stock images | royalty-free images

mahak said...

Single point of failure is the problem in many systems.And we think that redundancy is the solution to this problem but in some cases its not.There are other things and concept which are given in this post are much better solution to the problem.
records management