Tuesday, March 17, 2009

Documentation: What and When

In software development we test everything but the project’s documentation. I can’t tell you how many times I’ve had to scour a project’s documentation for information that should be organized for quick reference and be up-to-date with the latest configuration and customizations. Instead the documentation is usually missing some crucial bit of detail that forces me to search for answers and waste time, mine and the client’s.

So, to get back to putting more emphasis on verify or using documentation: how do we do this, besides test scripts and installation docs, or design and requirements docs?

One approach is to log more diligently all of the issues that happen during the development and deployment of the project. These logs have vital setup information and deployment hurdles that will never get documented formally. The testers and developers are the keepers of this knowledge and need to document it as they work through problems encounters.

The problems lead to the most important aspects of the project’s success. The problem’s solutions will suffice for the time being, but they will strike again in a similar fashion, in a pattern. These patterns are what need to be understood.

For example, you deployed a workflow with an auto activity that timed out during the QA testing. The timeout setting was increased, but no one documented it. When the workflow gets deployed to Production the same thing happens, but users see the workflow has paused and are now concerned and annoyed. The first thing you do is read through the documentation which has no reference to timeout changes. Then you look at logs and see that a method has failed with no reason why. The workflow supervisor’s inbox is filled with errors but you don’t know that because no one documented how to occasionally check that user’s inbox. No one even considered a fast system with a few workflows timing out.

I think the point here is that documenting is not only writing about the design of the system, its configuration and customizations, but detailing the pitfalls and hurdles of the process as well. There could be two sets of documents, one for the client and one for your sanity when things are wrong, which they will, it’s just a matter of time. Next time you'll be more prepared with a cheat sheet and quick references to previous issues and complex configuration and deployments.

Monday, March 2, 2009

Documentum Maintenance/Procedure Checklist

After the initial Documentum installation and rollout of the first phase, it is essential to
follow a maintenance/procedure checklist to assure maximum system performance and stability.

Documentum Administrator
Many of the maintenance procedures and jobs are configured or accessed through Documentum
Administrator (DA):
  • Server and Repository configurations
  • LDAP configuration
  • Users, Groups, Roles
  • Security (ACLs)
  • Storage (Locations, Storage, and Filestores)
  • Index Agent’s failed index list should be understood and resubmitted if necessary
Maintenance

Logs to Monitor
It is highly recommended to check all logs periodically for errors and warnings.

Application Server
Name: stdout_yyyymmdd.log (example: stdout_20090218.log)
Location: \Program Files\Apache Software Foundation\Tomcat 6.0\logs
Purpose: shows warnings and errors from Webtop and TBOs.

Content Server Repository Log
  • Name: DocbaseName.log
  • Location: C:\Documentum\dba\log
  • Purpose: Shows the repository startup output and any warnings or errors.
Java Method Server Log
  • Name: access.log and DctmServer_MethodServer_DocbaseName.logLocation:
    C:\Documentum\bea9.2\domains\DctmDomain\servers\DctmServer_MethodServer\logs
  • Purpose: tracks access and status of the Java Method Server
Index Server Log
  • Name: access.log and DctmServer_IndexAgent.log
  • Location: C:\Documentum\bea9.2\domains\DctmDomain\servers\DctmServer_IndexAgent\logs
  • Purpose: tracks access and status of index agent
Disk Space Management

The Content Server has a state of the docbase job (dm_StateOfDocbase) which monitors
this. Also the data drive should be monitored.
  • The SQL Server transaction log should be monitored
  • The Webtop cache files should be monitored
  • The Index data drive should be monitored
  • Database Maintenance and Logs
  • Disk space should be monitored
  • Transaction logs should be monitored
  • CPU and RAM usage patterns
Jobs

Some of the jobs below are not active OOTB. They have to set to active and started on a schedule. Be sure to set the run times so that they do not conflict other jobs and backup
schedules.

dm_ContentWarning
  • Purpose: Warnings for low availability on DM content/fulltext disk devices
  • Method args: -window_interval 720, -queueperson, -percent_full 85
    dm_DMClean: Executes dmclean on a schedule Method args: -queueperson, -clean_content TRUE, -clean_note TRUE, -clean_acl TRUE,
    -clean_wf_template TRUE, -clean_now TRUE, -clean_castore FALSE, -clean_aborted_wf FALSE, -window_interval 1440
  • Note that the "-percent_full" value is "85" which you may want to lower for a more lead time to deal with diskspace.

dm_LogPurge
  • Purpose: Removes outdated server/session, and job/method logs Method
  • args: -queueperson, -cutoff_days 30, -window_interval 1441
  • Note the "cutoff_days" parameter should be set to a reasonable number of days, balancing compliance and trouble shooting issues.
dm_StateOfDocbase
  • Purpose: Lists docbase configuration and status information
  • Shows: Number of docs and Total size of content, among many other stats.
dm_AuditMgt
  • Purpose: Removes old audit trail entries A key parameter is the cutoff in days, basically how many days worth of audits to keep.
  • args: -queueperson, -custom_predicate r_gen_source=1, -window_interval 1440,
    -cutoff_days 1
  • Note the "cutoff_days" parameter should be set to a reasonable number of days, balancing compliance and trouble shooting issues.


dm_QueueMgt

  • Purpose: Deletes dequeued items from dm_queue
  • args -queueperson, -cutoff_days 90, -custom_predicate, -window_interval 1440

dm_UpdateStats

  • Purpose: Updates RDBMS statistics and reorgs tables (if RDBMS supports)
  • args: -window_interval 120, -queueperson, -dbreindex READ, -server_name SQL2\SQL2005

dm_ConsistencyChecker

  • Purpose: Checks the consistency and integrity of objects in the docbase

dm_DataDictionaryPublisher

  • Purpose: Publishes data dictionary information

dm_LDAPSynchronization

  • Purpose: One-way synchronization of LDAP users and groups to Docbase Method
  • args -window_interval 1440, -queueperson, -create_default_cabinet true, -full_sync
    false

dm_FTStateOfIndex

  • Purpose: State of Index dm_FTIndexAgentBoot Boot Index Agents Method
  • args -window_interval 12000, -queueperson dmadmin, -batchsize 1000,
    -writetodb_threshold 1000000, -serverbase F, -usefilter F, -dumpfailedid F,
    -matchsysobjversion F, -matchallversion F


dm_GwmTask_Alert

  • Purpose: Sends email alert if task duration is exceeded

dm_GwmClean

  • Purpose: Cleans all the orphan decision objects

DQLs to run to check on audit trails and dmi_queue_items

The following statements are some of the DQLs that EMC support had us run to determine the
number of audit trails and queue items that were in the repository:


Select count(*) from dmi_queue_item

Select count(*) from dm_audittrail

Backup Procedures

Ideally, the Content Server should be shutdown prior to running the back up of the SQL Server database and started back up afterward. This will reduce any likelihood of the repository becoming out of synch with the database and the content files.

OS and Software Upgrades/Patches

Before applying any patches or upgrades to any of the Documentum suite and supporting applications, be sure to check for compatibility. Apply any patches or upgrades to the dev and QA environments and test them first.

Network Connectivity Interruption

If any network interruption occurs, then service logs should be checked for compromised activity. The Content Server and Tomcat server may need to be restarted. The logs of the application and content servers should be periodically monitored for errors and warnings.


Performance


RAM and CPU Utilization Maxed Out

If RAM is filled or CPU utilization is maxed out then the service responsible should be checked. If the service is a Documentum service, it should be restarted and root cause should be determined. Utilization should be monitored and any anticipated spikes in use or
additional services need to be load tested and analyzed. What should you do if Tomcat performance slows? If the concurrent users reach EMC’s limit of 20, EMC will recommends adding a second Tomcat server.


Further Java Memory Allocation settings to consider.

EMC Support gives the basic JVM settings to cover for common exceptions and crashes. There
are a number of other settings to add as more traffic occurs on the Tomcat server. From the
DCM Installation Guide:“To achieve better performance, add these parameters to the application server startup command line:

  • -server-XX:+UseParallel01dGC

Document caching can consume at least 80MB of memory. User session caching can consume approximately 2.5 MB to 3 MB per user. Fifty connected users can consume over 200 MB of VM memory on the application server. Increase the values to meet the demands of the expected user load.”

Monitor Sessions

DA

  • Location: Under Administration > User Management > SessionOrDQL: execute show_sessions (to show all active and inactive sessions)


DQL

  • execute list_sessions(to show active sessions)

Via docbasic ebs script

  • Purpose: set this script at a command line prompt to output how many active and inactive sessions are current on the content server. Set the interval between output and how many loops to run.


Troubleshooting Max Sessions error

Before restarting Tomcat:
Try logging into the content server from docweb using the Doc App Builder
application. If you can, then this isolates the max session error to the Tomcat/Webtop
server.

  • Using DA, look at how many “active” users sessions are currently in the repository.
    How many “inactive” sessions.
  • Try reducing the session timeout value in the web.xml on the Tomcat server to see if
    the inactive sessions get cleared out faster.

Security and Server Access Maintenance

  • Test users and test content should be deleted out of Production
  • The database schema owner account should be locked down
  • The Documentum install owner, “dmadmin” should be locked down
  • Only scheduled, authorized access to the Production should be allowed for all
    servers of the system.
  • Repository audit trails should be configured for certain events, such as deleting of
    content.

Long Term High Availability and Scaling Recommendations

  • As more users access the system, it may become necessary to create a second Tomcat
    (clustered) instance to ease the load on just one application server.
  • As more content get added to the system, more disk space will need to be added to
    the filestore drive.
  • Set up failover services for all key components
  • Add more Java Method Servers if lifecycle processing overwhelms the existing one.
  • A comprehensive content archiving plan will need to be designed and implemented.
  • Setup a disaster recovery site if the system’s service level agreement (SLA) is
    sooner than a new system could be built with backups.