Saturday, February 9, 2008

Polluting the ECM Ecosphere

You know those email spams that fill up your inbox? Well what about the trail of junk that Content Management Systems leave behind as they forge ahead solving complex business problems?

As I’ve work on large, small, and medium sized CMS’s, I’m always amazed at how polluted they are with logs, audit files, orphaned work items, queue items, reports, ACLs, versions, etc. The out-of-box clean up jobs focus on getting rid of unwanted versions, orphaned content, logs, and queue items, to an extent.

The problem is, with some business reporting requirements, they are too good at deleting files that may be of use for historical analysis. Business users get nervous when you say “we have to clean up things up to maintain performance”. They say, “Can we wait for a while until we really need to do this? What are the risks? What if you delete something that we need at the end of the quarter, or year, or in ten years?”

M. Scott Roth’s “Seven Jobs Every Documentum Developer Should Know and Use” article details the use of seven Documentum jobs: DMClean, DMFilescan, LogPurge, ConsistencyChecker, UpdateStats, QueueMgt, and StateOfDocbase. There’s a job to trim versions, but for whatever reason Scott didn’t include it in his job list. These jobs are all essential to keeping your repository clean and performing the way you expect, but what do you do about ACLs, workflow history, and versions if the deletion is not specific enough?

Content pollution is rampant in all industries and is a direct result of rushed design and over ambitious technical solutions to relatively simple business problems. Take a regulated content management system for example. This system most likely creates new versions of content for every change to its file or its metadata. There also could an audit trail which records every version’s change, a backup of the file system and the database for nightly and weekly data security, disaster recovery with off-site replication, multiple renditions, and multiple language versions.

The upshot is that the proliferation of versions and logs, and backups is great for storage “archive” companies, but can lead to confusion and a false sense of security. Who’s making the design decisions? Most likely it’s a business user who doesn’t want change, thus forcing an over worked IT Manager and ECM Architect to work out the solution which puts garbage control on the back burner. “We’ll deal with logs and versions later, right now we have to roll out the project on time and within budget.”

So how do we design with conservation in mind? For one, we think a year or two in the future and try to extrapolate the effects of thousands or millions of scraps of content floating around in the CMS, slowing done queries and filling up the more expensive hard disk space. Here are some more ideas:

Design to recycle:
For each log and object type ask how will this be created, versioned, and disposed of. What is the purpose of this content? How long will it be useful? Efficiency Think of conservative approaches to logging events, to versioning content (regardless of OOTB functionality). Upgrades, New Development, and Performance Testing Logs, database temp space, temporary migrated content files can pile up everywhere during special testing and migrations. These files are often “hidden” and sometimes move along to production systems only to clog things up later.

Site Cache temp files and orphaned site files:
Site Cache Services is notorious for leaving stray temp files, logs, and orphan folders all over the place especially during fail publishing attempts.

Docapp messes:
Docapps, when not performed carefully, can leave references to old lifecycles, workflows, object types, and attributes. These orphaned objects can not only clog the system, they can corrupt production environments with hardcoded references to filestores and none existent owner names.

Repository and LDAP synch logs:
Every time an LDAP synch job runs logs get stored in the repository and on the Content Server file system. Every time a repository starts up a new log starts for it. These logs fill up the server file system which is usually not a large disk.

DFC traces:
During development and testing, trace logs are essentially to tracking down bugs and slow performance. These files are usually forgotten and build into huge space choking surprises when you least expect it.

Environments such as Sandbox, Dev, Test, Performance, Staging, Prod, DR, Off shore, Business Continuance:
All these environments double, triple, xduple the amount of disk space needed for solutions. Think about ways to migrate subsets of needed content without versions perhaps. Reduce logging in environments not used very often or that are dormant for a period of time.

Integrations with other applications:
Many integrations of systems require multiple renditions of content for presentation. For example, email messages from Outlook get saved as .msg files in Documentum. Even when EMC’s email Xtender is installed, an integration with Outlook requires copies of the original email to be imported into Documentum’s repository.