Thursday, September 22, 2011

The risks for service outage


The risks for service outage can be broken down into three categories:

Server: each server has services which are vulnerable to outage. These servers are the Content Server, Index Server, Application Server, Database Server, and the Storage Server.
Systemic: The dependency of each server’s integration(s) with each other is vulnerable to outage. For example, if the content server goes down, the application will be out; if the database or storage goes out, the content server is down, etc.
Disaster: This would mean that the whole server room is down. The disaster scenario would cause the DR system to synch and start up.

The risks of services going down are real and happen most often at the server level. User complaints occur during times when performance is slow which may be a sign that a service is in trouble. Many times integration between DCTM and other services are risky because it is assumed that the other services are always up. If a company is growing, the network will be changing, databases will stumble, even electricity circuits will blow, so keep all of this in mind and in your recovery plans regardless of assurances that this "will never happen".

Risk Matrix by Server


Scope
Server 
Outage
Description
Integration Dependency
Risk Level
Monitoring
Systemic
Storage App
Storage Services
Database, Content, Index, App
Low (If HA, redundancy)
monitoring scripts
Systemic
Oracle
Database Server
Content Server
Low (If HA, redundancy)
monitoring scripts
Systemic
LDAP Server
LDAP
App/Content Server
Low (If HA, redundancy)
monitoring scripts
Systemic
DNS Server
DNS
All Servers
Low (If HA, redundancy)
monitoring scripts
Server
DCTM
Repository Services
App Servers, Index Servers
Med (If standalone)
monitoring scripts
Server
DCTM
Java Method Server (JBoss)
Index agents, Jobs, workflow
Med (If standalone)
monitoring scripts
Server
Application
Tomcat
Med (If standalone)
monitoring scripts
Server
Index
xPlore Servers and Agents
App Server Search
Med (If standalone)
monitoring scripts

Disaster Recovery systems are replicated systems which constitute a low but viable risk.

No comments: