Litigation Automation: An Introduction to Automated Case Management Approaches and Related Technology
- Identifying the Need for Litigation Automation
- Document Management
- Collect Documents
- Create Unique Identity for each Document
- Uniquely Identify each Page
- Prepare Documents for Scanning
- Scan Documents and Store them in a Database
- OCR the Scanned Documents/Images and/or create a Bibliographic Database for Coding the Documents
- Digital Evidence
- Digital Work Product
- Trial Preparation
- Configuration Management
- Hardware Configuration Management
- Software Configuration Management
- Backup Procedures
- Disaster Recovery
- Designation of Alternative Operating Facility
- War Room Designation
- Trial Presentation
- Display Types
- Presentation Software
- Associated Trial Presentation Software
- The Judge and Jury
- War Room Management
- Collecting Documents
- Document Naming Convention
- Image Naming Convention
- Document Indexing
- Optical Character Recognition (OCR)
- Scanning of Documents
- Recognition of Text
- OCR Voting Technology
- Processing of Forms
- Content Recognition
- Other OCR Information
- Document Retention Planning
- Electronic Document Retention
- Joint Application Design (JAD)
- Glossary of Terms
DISCLAIMER: The analyses set out in this white paper represent the opinion of the author only. The information in this white paper should be considered solely as starting points for the litigation professional who is responsible for his or her own pleading. The author denies all responsibility and liability for any inaccuracies or misinterpretation or precedental value.
What is Litigation Automation? A great question! Often when someone asks a question, they also deliver their perception in the form of another question.
"Oh, you mean like a laptop computer at an accident scene recording all the facts from the victim and associated witnesses prior to the ambulance whisking them away?" ...well not really, but from a capability I guess it's doable.
From the other end I get . . .
"You mean like the Computer playing the Devils Advocate utilizing Artificial Intelligence within the capacity of a Neural Network, systematically arguing a case?" ...my impression is that either this person is way more advanced than I am and into a specialized version of litigation technology or this person is simply trying to impress upon the fact he or she knows a couple of buzz words after reading the latest copy of Law Technology News. The later often lands that person into the "a little bit of knowledge is dangerous" category.
In reality, Litigation Automation can encompass all the above and quite a bit in between, which is the focus of this white paper. "Litigation Automation . . . A Practical Guide to Legal Technology and Case Organization" is written from a technologist point of view. The attempt here is to inform you the reader of the many different approaches to technology solutions that are available and/or coming. The audience we're writing for is basically anyone with a legal information system interest from both the private and public sector points of view. Many of the technical approaches apply to a number of other businesses however, my specialty centers around Litigation Automation.
The basic setup of almost every office in the 1990's will include at least a few computers. Word processing, database, e-mail and accounting applications are resident on the majority of these offices and many have Internet access, and some sort of local area network(LAN). At this point a generic office with minimal computing power has been established. This is referred to as Office Automation. Some of the larger offices will have all the above and include a wide area network (WAN . . . this is still Office Automation).
First and foremost, we haven't begun to refer to Litigation Automation yet. It should also be noted that many cases (mostly small non-document intensive cases) will not benefit from Litigation Automation (standard Office Automation will manage small amounts of information with great ease). For instance, an un-contested divorce or a DUI case is most likely not to be good candidates for Litigation Automation. For these small cases, the related document collections are typically small and in turn very manageable in a hard copy form.
Where Litigation Automation comes into to play is in the management of large evidentiary collections. Evidentiary items include documents, video, audio and magnetic media. Each type of evidentiary information management will be addressed in this book. Entire books exist on each of the above topics/subtopics/aspects mentioned. Our intent is not to make you an expert in every aspect but give you somewhat of an intimate feeling of confidence in each area so that desired evidentiary information is managed and accessible in a timely, organized and understandable manner.
The most sought after Litigation Automation Solution is typically document management. A one hundred page document collection is fairly manageable in a manual capacity. An attorney or support staff person can easily read one hundred pages and identify all the important documents.
When the number of pages gets into the thousands or millions, finding and/or digesting what is in the collection can become a daunting task. Enter Litigation Automation. Document control is a very critical part of any legal proceeding. The majority of an evidentiary attorney work product more often than not is documents. Documents are gathered during the discovery process of a case. Chain of custody issues must be addressed. This is the "audit trail" behind a document of where it came from and who apparently belonged too (because of location). If authenticity of a document is challenged, the chain of custody will attest to the validity of the document in question. Document naming conventions can be utilized that also address chain of custody (usually from an original location point of view). The process for organizing a large document collection does have a number of options based of what you want to get out of the collection and what kind of shape the collection is in. Here is the process in the form of a high level outline with a corresponding paragraph.
Collect all the case documents & establish the chain of custody. The chain of custody refers to where the document came from. If a document is challenged in court, the chain of custody defines exactly where the document came from and thus justifies its authenticity (building and room location...this could be a four character code AA01, building AA, office 1). When collecting documents, take everything that seems relevant, to include potential duplicates. It is better to have four copies of a document than no copies. Also, often times a document has unique marginalia, making what seems to be a duplicate document, a unique document.
niquely identify each document in some type of sequence & index to chain of custody. It's a good idea to incorporate the chain of custody information into the document identification. A number of different techniques are used to apply document identification such as:
A single barcode placed on the lower portion of the margin of the first page,
or a single barcode on the back of the last page (consistency applies),
or manual Bates stamping to the physical first page,
Electronic Bates stamping applied during the scanning process.
The electronic Bates stamp is popular because of a limited chance for error. The ability to digitally shrink a scanned page 95% creates a larger margin so that the electronic Bates number can be applied without infringing on any part of the page.
Interior pages often represent desired evidence. With that in mind, a single page needs to stand on its own for searching and identification purposes. Each page needs to be indexed/related to the document identification. The same identification options for document identification applies to page identification. Enter the document identification and index the correlating image identification into a database (a relational database is preferred).
Print out a copy of the document identification with the listing of correlating pages for each document. This list represents guidelines for the scanner technician on how the documents are to be scanned and clearly defines what document each page belongs to. Documents are normally processed by the banker box. Within each banker box documents are located in folders and assembled by staple, clip or bound together in some other fashion (colored slip sheets will often be used to identify document breaks). Where it is possible, documents are unassembled and prepared as single sheets of paper for scanning (documents are often timed required to be reassembled so as to keep their original integrity in tact). Multiple pages can be loaded into a sheet feeder but must be free of any clips or staples.
Scan the documents to some sort of mass storage computer medium. Make sure the naming convention is unique for both the document and page identifications and import the images into a relational database. In many cases, the most efficient way to scan documents is through a sheet feeder. Most sheet feeders will handle standard or legal size pages (one or the other in batch form), a software setting usually needs to be selected for one or the other. Pages can be scanned one at a time on the flat bed part of a scanner (much like the multi-sheet or flat bed portion time of a copier).
If the documents are in good shape (as in 2nd or 3rd generation Xerox copies) they will OCR with a high level of accuracy. Even documents in fair shape offer value with the advances in OCR technology (and tunable fuzzy search capabilities). If the documents are in bad shape (from an OCR standpoint...such as, 10th generation skewed copies, forms, handwritten or an onion skin style document from the late 1940s) it is often better to create a bibliographic database and in some cases do both. Bibliographic information of interest depends on what the case requirements are. Fields such as Author, Document Date, and say Addressee are some of the more popular items coded (as in code the information from a document and enter the data in a corresponding database). If the documents are fairly readable, but lousy OCR candidates, another option is to have the documents re-keyed (creating searchable full text). This is a fairly expensive alternative but can be justified depending on the potential loss or gain.
Cost justifications for organizing documents are two fold. Obviously a certain value is placed on being able to find a "smoking gun" document (good or bad) in time and to leverage that information for the maximum benefit of the client. Other value is gained by knowing that certain parts of a document collection are free and clear of any helpful or harmful documents.
igital evidence is best defined as any type of magnetic/computer oriented media that is part of a discovery or potential exhibit collection. Hands down, the most popular legacy media is old Email. Legacy/Historic Email systems range from archaic mainframe computer systems from the 1960s up through the Email that was sent in the past year. From a historic point of view, old systems that would possibly present a problem because they were created with antiquated technology are now very accessible. There are companies/organizations that specialize in restoring legacy/historic information from numerous proprietary and non-proprietary formats and converting this information into usable modern file formats (such as ASCII). One such company is Convert-It located in Cambell, CA. Stanford University also has a substantial facility for reading and converting historic digital information.Other forms of digital information include:
- floppy disks (8, 5.25 and 3.5 inch),
- cassette tapes,
- DAT (8mm and 4mm),
- hard disks (you name it...too many to list),
- Zip and Jaz disks,
- platter disks,
- magneto optical disks,
- and 9 track tapes.
Another very interesting aspect of digital evidence is the use of undelete utilities/programs. A disk that has deleted files on it can often be restored quite easily (even disks that have been reformatted are sometimes capable of being salvaged). Sorting through digital media is very similar to digesting a document collection. The first requirement is to address/create a list of hot topics and/or names of interest. The person sorting through the information will have to be familiar with case requirements and recognize (from disk labeling) the difference between data files (which are of great potential interest) and program files (which have limited significance). Data files will consist of email, word processing, spread sheet and database information. Program files will be the actual working software programs and operating system information that read the data files. For a small collection the approach is fairly simple. Look at everything (just as in a small document collection).
Digital work product needs to be managed in a manner that allows manipulation via the various conversions performed but still accounts for the original file format. A very important part of processing digital evidentiary items is the chain of custody issues. Chain of custody is the path back to the original location of a case evidentiary item and when and how it was acquired during the discovery process. In accounting terms this would be considered an "Audit Trail". Chain of custody is essential in defending the authenticity of a challenged evidentiary item. If any of the digital evidence is challenged in court for authenticity, all the steps from where the file originated (to include the medium the original file was contained on) to the various steps performed to produce the current file format have to be demonstrated.
File formats need to be defined prior to populating a computer system with digital evidence. The most popular for images (both text and graphics) is TIFF Group IV. TIFF Group IV has long been established as a very stable format. Numerous other image formats do exist that handle photos and graphics better, such as JPEG or GIF but they do a lousy job on text. Other formats (into 3 figures exist) such as Adobe's PDF handle text and graphics very well but are proprietary in nature (to include a 4 cent per copy royalty per each PDF created by an Authorized Adobe Developer). Most COTS will accept the most standard formats such as TIFF Group IV, JPEG, GIF and BMP. The most popular digital video standard format is MPEG. This is a compressed 10 : 1 version of a larger AVI format. MPEG2 has started to take on popularity as being slightly larger in size than MPEG but has yet to be proven better than MPEG. Tests I performed (to include fellow colleagues) of MPEG2 have produced equal or lesser quality digital video. Drawbacks of MPEG2 is that the software and hardware used is somewhat cumbersome to "tweak" for the best quality and it's not readily available like the MPEG format.
Trial preparation is very much a means to an end. As in most things in life, the better the preparation, the higher the chances for ultimate success become. There is a definite correlation between the volume of evidentiary items and the amount of organization needed. Successfully managing and leveraging the power of these evidentiary items is the main goal in establishing a potential digital advantage/upper hand. Digital evidentiary items include:
- discovery documents,
- deposition transcripts with related exhibits,
- trial transcripts,
- prior pleadings (if they exist),
- expert materials,
- audio tape,
- video tape,
- and all related magnetic media (ie: word processing documents, e-mails and electronic spread sheets).
Support staff should to be identified as early as possible so that their roles and responsibilities are clearly defined, realistic and attainable. Each primary support person will need a designated backup. At a minimum a technical person and paralegal (or law clerk) need to be identified. The technical person will be responsible for all the computer equipment and the paralegal will be the operator of the computer equipment. For smaller cases and/or smaller firms, one person will performs both of these jobs.
Other considerations in trial preparation include designating a courthouse War Room (either in the court house or very near, with limited resources), and a War Room located at a larger facility like the firms closest office (with substantial resources).
Configuration Management is essential for establishing and maintaining standardization through the life of a case. Configuration Management is a software engineering term/methodology that is part of the "Lifecycle" approach to software development. Configuration Management is initiated my defining a baseline. This baseline can be a general overview of acceptable tools or go to a deeper granular level such as establishing file formats and master repositories. The Configuration Management Baseline is a document and/or software package that reflects the current status as of the baseline and all approved changes. Configuration Management's deepest impact is derived from Change Control. Simply stated, Change Control allows for approved changes to an established baseline. Either a single person or group is identified to that approves changes such as the Configuration Management Manager, or the Configuration Management Board.
Once a change is approved, it is implemented and documented. The advantages to utilizing Configuration Management are that it creates an "Audit Trail" to all changes. In computer systems a single change can affect multiple areas, this is referred to as a one to many relationship. What repairs a single problem has the potential for creating another problem and potentially multiple problems. The ability to retrace changes and restore previous configurations can be the difference between a short fix for a known problem or a long disruption for an unknown problem.
Hardware Configuration Management should set forth guidelines that include duplicating the make and model of hardware used for a trial. Different manufacturers will enhance a Pentium 400Mhz MMX computer in somewhat unique ways. Standardization within hardware will help alleviate potential configuration problems. PC make and model to include: processor size, media drives (sizes, capacity and manufacturer), RAM population, keyboard, modem, speakers, sound card, mouse, monitor, network card, any associated peripherals such as a printer, scanner, or projector. If any other "exotic" hardware is utilized, they should also be duplicated (for example a RAID system, magneto optical drives, large removable hard drives).Configuration Management of hardware implies using exact configurations of multiple computers. In a typical scenario, if such a thing exists...well lets say a large trial situation, the computer count would be six:
- 2 Systems in the courtroom (optional laptop systems for portability),
- 2 Systems in the local war room (at or near the courthouse) and,
- 2 Systems at the firm war room (closest local office),
A minimal computer count would be three:
- 2 Systems in the courtroom (optional laptop systems for portability) and,
- 1 System in the local war room (at or near the courthouse or at closest local office).
All COTS (Commercial off the Shelf Software) utilized should be the exact same release and version. Specific applications need to be identified such as Power Point, MS Word and MS Access. Multiple versions of the same software should be discouraged. If WordPerfect 5.0 and 8.0 are used by two different attorneys working on the same case, a compatibility problem will arise because WordPerfect 5.0 (a 16 bit application) attempting to read a WordPerfect 8.0 document (a 32 bit application) will not work. This also presents a support problem for the Information Systems Staff. Supporting one iteration of software is difficult. Supporting multiple releases of the same software is next to impossible. A "rule of thumb" when it comes to selecting software is to utilize the latest stable version. Often times new releases can be a bit volatile because the vendor neglected to test for certain real world conditions. Waiting 6 months to a year before implementing a new version of software is highly recommended for a few reasons. Initial problems/bugs are usually identified and fixed within this period (especially the major problems). COTS advance from being state of the art/cutting edge to proven state of the art, meaning that the latter has been thoroughly tested and is considered more stable than when it was initially introduced. Upgrading to a stable new version is also recommended because the manufacturer does normally not support the legacy versions of software after a period of time. All efforts are usually put into the newest release and a small amount of staff might be assigned to supporting a predecessor version. Trying to get support for WordPerfect 5.0 when WordPerfect 8.0 is the most current release will be fruitless. For instance if WordPerfect Suite 8 is identified as the word processor of choice by the Configuration Management board or Systems Manager, then that will be the only version of that word processor that should be used. No legacy versions of WordPerfect should be used and no other word processors should be utilized.
The Configuration Management strategy from a software point of view is very similar the hardware approach. Identical software, to include versions, need to be identified. Each system needs to have the exact same operating system, configured in the same way (to include the autoexec.bat and config.sys files). It is a growing trend in Law Offices, both private and government to establish firm wide use of a particular version of COTS software. The upside is that only one version will need technical support. The downside is that people become attached to the particular version of a piece of software they are most used to. To incorporate a new piece of software adds a learning curve that many people are not comfortable with...GET OVER IT...GET USED TO IT. The computer business is ever changing and evolving (for the most part in a positive direction). Computer professionals continue to learn throughout their entire careers, building on the foundation of knowledge they have developed through education and/or on the job training. The legal field is also ever evolving especially in the field of associated technology. A law firms leverage and commitment to technology along with the required legal expertise offers a clear competitive edge in delivering their services. In order to keep costs at a minimum and maximize profits many law firms are exploring new ways to do business via technology.
he first three rules of computers in general is:
What is the value of information? In a legal proceeding it is potentially everything. If information is lost, is it easily retrievable? If the systems you're working on haven't been backed up, you take the risk of never having that information again. The most popular industry standard backup strategy used combines daily, weekly and monthly backups. On a daily basis, new and changed files are backup. On a weekly, new and changed program files are backed up, as well as all data files. On a monthly basis all files, new and old are backed up. For a business that is open every day that would be 429 backup sessions (365, 52 and 12). For a business that is open during regular business hours the backup sessions will total 324 (260, 52 and 12). Backup media should be saved for a year at a time and the monthly backup media should be saved forever as legacy support. Typically the best time to run a network backup is in the middle of the night when little to no processing is going. In a 24 hour a day operation, a time does need to be designated for system backup and again the most appropriate time is when the least amount of processing is going on. Most backup systems utilize DAT (digital audiotape) media and can be run from a preprogrammed script that is time activated.
The basic concept behind Disaster Recovery is being able to recover operating functionality after a disaster. A disaster can be as simple as a hard drive going bad to an entire office building and/or city being blown up. Another important aspect to system backup is designating an alternative off site storage facility. In a worst case scenario, where everything at the entire office is destroyed, backup tapes included, restoring information systems becomes a matter of starting from scratch. If backup tapes, such as the weekly and monthly are stored at an offsite facility, information lost could be as little as a few days. The point here is to store you backup tapes at an offsite facility. If your firm has multiple offices located in other cities, it would be ideal for one of those offices to retain your backup tapes (and vice versa).
The designation of an alternative operating facility has origins in military operations. In certain situations, such as war, bases of operations are often changing. Having an alternative site set up in advance, to run the associated computer systems is very valuable. The acronym that the US Army used was COOP (coordination of offsite operations). The same holds true for a law office. A large ongoing court case has generally an on site/near site to the court house war room and the same information is tracked back at the closest supporting law office. Often times law offices have multiple offices/locations for the same firm. It is a very good idea to set up a contingency plan with the next closest law office/facility in the firm. With the idea that backup tapes are stored off site, restoration of the computer systems, even at limited availability, is better than going without.
War Room designation is often at or near the courthouse. This is usually a small room that is used for trial preparation and/or any type of reactionary production of work product based on trial requirements and proceedings. Depending on the size of the case, the ideal situation is to have all potential and actual evidence stored in the War Room. In a large case this might not be possible because of limited space. A second War Room is sometimes set up at the closest Firm office to the courthouse for larger processing of related work product. The Firm War Room can not only house case specific evidence but also a master repository of case related information from other trials and decisions.
Trial Presentation can be the difference between winning and losing. With all things equal, a case is decided on how well the particular side presents it's view/perception of the facts. It is rare that all things are equal and as a matter of fact the most persuasive (in a close case) is often times the victor. Technology has taken a huge leap in the courtroom over the last decade. Paperless Trial, is a term that is often used (but far from reality). The courts are still a ways away from accepting only electronic exhibits and evidence, but the day will probably come in the near future. Extremely high profile, technology savvy/driven cases such as the OJ Simpson and the Oklahoma City Federal Building Bombing Cases are starting to set a standard for future trials. Judges across the country are accepting/embracing the new technology.
There are a number of different display types available. The most common include computer monitors, video monitors, front projection and rear projection. Considerations include the size of the courtroom, what the judge will allow, and the budget concerns. The sky is the limit on what is available. Larger displays do get the point across in a more impressive fashion. Whatever is selected should be first approved by the court and second, thoroughly tested prior to using it in court.
The two most popular COTS programs for presentation are PowerPoint and Corel Presentation. Both have excellent capabilities and are worthy choices. If at all possible, only use a single program and version. The advantages to using a single presentation package include only having to install, support and train on one program.
There are a few software packages designed specifically for trial use. Document Director/Trial Director is a COTS program from InData is one of the most comprehensive. This program is fairly robust in what it has to offer. Document Director is the part of the software that organizes the multi media portion of a court presentation (to include documents, pictures, video and audio, an Access database backbone and Summation for full text searching capabilities). Trial Director is the actual presentation part of the software. Because of the potential complexity of a trial presentation, a designated operator is recommended (such as a technical paralegal).
Graphics are a very effective tool for getting evidentiary points across. The biggest mistake in utilizing graphics is making one that's too busy...as in information overload. In trying to draw in a Judge and/or Jury's attention, a graphic needs to offer clear and concise points (and not too many at once). A good rule of thumb is to limit a graphic to four major points. If more than four points need to be made, additional graphics need to be generated that focus in on additional information. Typically the big picture is first explained, followed by snapshots of the big picture.
One of the biggest mistakes is trying to display a single time line with twenty major points. Considerations to attention spans and the importance of following an understandable critical path while getting to a particular point need to be considered when preparing graphics.
Not too long ago, Judges in general would limit the amount of high tech equipment allowed inside of their respective courtrooms. For the most part that has changed (although some judges still do not allow many of the available technology tools into their courtrooms). Often times the clerk of the court will advise each side as too what technology is acceptable and will also specify certain formats and programs that each side will adhere to. The thought behind all of this new technology is to better entertain the Judge and Jury (if one is present), in an attempt to emphasize/clarify a particular point or fact. An expert with only a verbal testimony (be it long winded or not) will often over time bore a Judge and Jury. Key points and facts could very well not make the impact that they were intended to make because of how they were presented. A multi-media presentation with video, audio, markup of documents, and associated graphics leave a lasting impression because they are presented in an entertaining fashion.
A War Room is the strategic focal of the support staff (and the attorneys during breaks to include evening and night support if needed). Some of the work performed in a war room includes, transcript processing, exhibit tracking, daily trial preparation, replication of all designated trial computer systems, and reactionary damage control.
Transcript processing essentially is making the most up to date trial completed trial session accessible via full text searching. This is accomplished by taking an electronic copy produced by the court reporter and installing/indexing it in some type of full text repository. Longer trials are typically broken into AM and PM sessions. This is an effective tool for tracking trial proceedings, and most useful for tracking possible discrepancies between what has transpired compared to what is happening at that particular time.
Exhibit tracking tracks the all the pertinent information concerning the exhibit relationships. The relationships tracked include exhibit ID, date of the exhibit, description, associated witnesses, issues related too, relevance and other related evidentiary items as identified by the trial team.
Daily trial preparation includes preparing for the days expected witnesses, all associated evidentiary items related to those witnesses, trial transcript processing and completing all tasking that the attorneys had requested for the upcoming day in court.
Each of the trial designated computer systems will need their files changed to reflect the changes made (this will be mostly the updated index of the associated trial transcripts and exhibit tracking information).
Reactionary damage control is pretty much as it sounds. As problems arise, often times strategies change. New document search requests and summaries of existing information are examples of the type of desired information and a quick response is usually part of the request.
rule of thumb when it comes to collecting documents is too grab everything. Documents can later be sorted for relevance. Duplicate documents are also welcome because 2 or more copies of a document are much better than no copy. Often times documents that appear to be duplicates are not because they have unique marginalia.
Depending on the size of a document collection, a document naming convention is either established in the document identification (DOC ID) or associated in larger collections (over 100,000 images) via an index from a document database.
mage naming conventions need to identify each page of a collection as unique, align some type of sequence and be simple to understand so that non-technical types can follow the logic. Most filename conventions are based around the operating system and program limitations. For example Windows 95/98/NT requires a 3-character file extension which denotes the type of program it is associated. An example would be 12345678.tif. In the MS-DOS world the file extension could be utilized for sequence. The problem is that programs are designed to run on the newer operating systems and for the most part legacy versions (such as the latest versions of MS-DOS and Window3.11...both being 16 bit operating systems) are not maintained. Many of the newer applications require 32 bit compatibility which requires Windows 95 or higher, to include Windows NT.Image naming conventions can account for extremely large collections. The use of sequential numbering makes the most sense but the file naming can also utilize alpha characters as well. An 8 character combination offers the possibility of 2,821,109,907,456 unique images. That would be 10 numeric plus 26 alpha characters for a total of 35 possibilities in each bit, from the most significant to the least significant. That is represented by:
36 x 36 x 36 x 36x 36 x 36 x 36 x 36 = 2,821,109,907,456.
The numeric possibilities would be only 99,999,999. This is more than sufficient considering the document collection is static and 100% error free, which is rarely the case. Another consideration is to use reserve the three least significant bits as after the insertion option pages. Re-scanning large documents is very time consuming and expensive. Now that you've read this, let me tell you that you'll probably never approach numbers such as these. Cases in general routinely have no where near this amount of documents. In the rare case that you take on a case involving a document selection this large, you will have an understanding of how to uniquely identify each image.
If a document collection is less than 10,000 pages (9,999) The best approach is to utilize a unique numeric image ID sequential numbering scheme. This is simple and easy to understand. If a collection is over 10,000 pages, utilize the numeric possibilities first (this will create less confusion in the long run).
Some of the very best computer applications in use today are simple and easy to understand. Applications in general should only add sophistication as needed. Simply stated, a computer receives information, processes it and then distributes the processed information. As a system requires more sophistication, it should be added. Removing sophistication often times causes more problems than it solves. This is very similar to cooking. It is easy to add oregano to spaghetti sauce. The best approach is to add a little at a time to the sauce. If too much oregano has been added, it's almost impossible to remove oregano from that same sauce (I could have used garlic...but I don't believe in too much garlic).
The same is often true with computer applications. System design is based around defining the information problem, developing a solution and making that information available to all that require it, in a fashion that the lowest level user (also known as the lowest common denominator) is able to use and understand the information the system delivers.
Document indexing is very important. Every document will be assigned a unique identification number. This unique number is related to other information about the document such as the images contained within, bibliographic coding and the associated full text. Once an index becomes corrupted, it is virtually useless until it has been rebuilt. Indexing is the reason that an electronic document collection is so much better than a manual collection (especially in larger collections). To find a document manually, you have to know its unique identification, go to the room where it's kept, pull it out of the file cabinet, copy it and then return it to it's correct location. The biggest potential is putting the document back into the wrong place. Electronic indexing will maintain all this information and in an efficient system with perform all of this in a fraction of the time it takes to do manually.
s with many computer-related applications, OCR is relatively young. At the age of 40 something, OCR has developed into a very effective tool for processing text based documents/bit mapped images into searchable ASCII data/text. The basic process is to take hard copy documents, and scan them into bit mapped images using the industry standard format (and most widely accepted) of TIFF Group IV. To become usable/searchable text information the bit map images need to be converted into ASCII data (or some other readable format) and the OCR software performs this.Optical character recognition in it's purest sense recognizes machine printed characters. The better the shape of a document, like a freshly printed Email or an article out of yesterday's newspaper, the higher the accuracy. Documents that are deemed ugly, because of poor quality, like a copy onion skin typed letter from say 1940 (or a newspaper article form the same era), is potentially still of value and often times worthy of processing, with the right tweaking of the OCR software. The sophistication and power of OCR technology available today is comparable to a middle aged triathlon champion...bring on the power bars. OCR technology has made great strides in the late 90s and is considered to be essential for converting documents into a searchable format. Another option to OCR technology is to re-type the entire collection...let's not even go there.
The use of neural networks (which is a form of artificial intelligence), topological analysis (deciphering a mapped version of an image) and tunable fuzzy logic (the ability to recognize portions of words and thus take advantage of less than perfect text...a.k.a. dirty text) all make for a very powerful tool/application. The larger the Litigation, (in terms of document population) the more important the OCR application becomes. Depending on the Litigation, the use of OCR involves primarily the recognition of text. Other possible uses include voting technology, content recognition and processing of forms. First and foremost, the documents need to be scanned.
Before any OCR processing starts, documents have to be scanned and stored in some type of database (by all means if you have the option to choose your own database, select a relational one...you'll thank me later...I'm not kidding). The options for scan rates (what level the scanner software should be set at in preparation to create the bit map images of the document collection) for TIFF Group IV files typically ranges from 200 DPI to 400 DPI (higher settings are available and are sometimes used for court exhibits...up to as much as 1400 DPI). The current thought is to scan at 300 DPI. The quality is slightly better at 300 DPI than 200 DPI, and 400 DPI tends to scrutinize, at times, lower quality images to the point that the OCR engine becomes confused. The reason for the confusion is that at 400 DPI the pixelation (each dot represents a pixel, be it black, white, or a shade of gray) is more defined thus questioning a character, that at 200 or 300 DPI would tend to be more filled in. 300 DPI is also a better choice when it comes to reproduction (in hard copy form or blown up for courtroom display). The difference from 200 to 300 DPI is quite noticeable but 400 DPI isn't that much better than 300 DPI. The average file size for a 200 DPI scanned image is about 25k. The average file size of a 300 DPI is about 35k and 400 DPI images are only slightly larger and in some instances smaller (the larger the instance of white in an image, the smaller it is). Reasons for scanning a document collection at 200 DPI would include extremely large collections that take up a extraordinary large amount of disk space (in situations like this, often subsets of the document collection are often prepared, at a higher DPI rate for trial presentation purposes). If the documents are in pristine shape, 200 DPI is quite acceptable. Many OCR applications have scanning options built in and can be setup to scan and then OCR at the same time/session. OCR processing is often times performed during off-hours (evenings, nights and weekends). The main requirements are computer processor time and the instructions of what needs to be performed (this is accomplished by a batch file/program which contains a set of instructions which include start time, the actual OCR processing instructions, storage of the results and end point at the completion of the job).
Recognition of text is often utilized in a Litigation Automation environment. One of the biggest obstacles for OCR software is the number of fonts it has to interpret. Non-typical fonts can be a processing nightmare. For the most part a single OCR engine will isolate a character at a time and attempt to interpret what it believes it sees (followed by the characters next to it, and so on...). For example a number 5 and letter S could be confused in a font that is typically non-standard. A standard font would be defined as a firm wide font of choice that is in most cases is very readable/conservative at a height between 10 and 12 points (examples include, Times New Roman, Arial, or Courier). OCR functionality tends to dwindle at character heights between 6 to 8 points.
Most periodicals have adopted a standard font to be used, although an occasional article is printed in a hand print type font ...this is like cryptonite to the normally Super human like OCR engine...more of a job for an ICR application (intelligent character recognition...of hand print... great for processing hand written forms). Not to be confused with HRS (handwriting recognition software, a methodology directed at recognizing cursive...that's neat cursive by the way...you sloppy writers will probably never get the recognition you deserve, at least not in a machine sense).
OCR Voting Technology is a relatively new approach to OCR processing. OCR engines are all based on somewhat unique algorithms so they all interpret what they see in a slightly different manner. The combination of 3 or more engines can have somewhat different results when it comes to deciphering a problem character. A voting technology algorithm will combine the results of what each engine thinks it sees, along with a confidence factor, basically casting a weighted vote. The result is a consensus of what the combined engines have voted on. Voting technology can increase recognition results dramatically. Some companies claim 60% to 85% when certain algorithms are combined with particular engines. These numbers are applied when an expert has analyzed the documents and applied the appropriate engines. Realistically, a 30% increase is more likely, when applied by a competent data processing person, and still considered significant. With the advancement in "fuzzy search capabilities" dirty text (OCR generated text that has not been cleaned up...as in correcting the apparent errors) has become much more valuable.
The downside to voting technology software is that it's fairly expensive and often requires a dedicated file server. Tests of Prime Recognition, one of the originators of OCR voting technology, between Prime 3 (a 3 OCR engine combination) and a Prime 5 (a 5 OCR engine combination) produced only slightly better results with Prime 5 (less than 5% better). The cost per license was about a third more per seat. Prime 3 is still pricey, but deemed a better value.
Large document collections are excellent candidates for the use of voting technology. In a million plus image population, it is extremely difficult (and expensive) to have each piece of paper read/scanned by human eyes for potential value. In theory, "smoking gun documents" are often known at the beginning of a case and the relative words and phrases that apply are also defined. Searches on the full text of a case document population/enterprise can often reveal the majority (if not all) of related documents.
rocessing of forms via OCR deals with predictable locations of structured information. All forms have defined areas/zones that often contain very specific information. The data fields are designated in the OCR software as zones by their various locations and attributes. The attributes can also be predictable to include alpha, numeric and alphanumeric. Zip codes (5 to 9 numeric characters, social security numbers (9 numeric numbers), dates (day and/or month and/or year) and country are all examples.
Form removal software is required to remove the various boxes lines and extraneous text leaving the unique machine printed information for capture and subsequent processing. The accuracy of forms processing is often times greater than text recognition because the fonts used are very readable/standard/conservative (as mentioned earlier, examples include, Times New Roman, Arial, or Courier). Form templates define the parameters that the OCR engine should anticipate. Validation routines can essentially compare interrelated information fields and this can help ensure a higher quality work product.
Verification tables can also assist in information validation for things such as part numbers or state abbreviations. If the forms being processed represent a distinctive or unique type of application, a relevant dictionary can be developed to assist in the assurance of data integrity. Machine printed forms that might be captured include bank statements, time cards, medical records, invoices and receipts.
The objective of content recognition is to reproduce a very similar document to the original while allowing all the associated text to be converted into a machine-readable format. The upside is that once an image is converted, it takes up much less space than the original bitmap version. This can be performed utilizing the PDF file format (Developed by Adobe Capture Inc.). The PDF format converts all of the characters that it can and creates small bit map representations of characters it can't understand (and large bit map versions of pictures or graphics, which it also doesn't comprehend as a character of some type).
There are many different OCR products on the market with a multitude of features and capabilities. OCR software is capable of interpreting text in photos, illustrations and graphs. OCR software also has the capability of retaining the original document format. Overall OCR accuracy will depend on the shape of the document collection. This is not simply the job of the OCR engine but actually the combination of the user interface and supporting application software in concert with the OCR engine. Document collections can often times be somewhat unique. Before choosing an OCR application define the type of performance that will best compliment the collection. Do your homework! The options are numerous. Some of the features to consider in choosing an OCR application package include:
- user definable page zones,
- page segmentation,
- image cleanup tools/filters like de-speckling (de-speckling is the removal of extraneous subtle markings on a page) and de-skewing (correcting the angle of the text orientation), image enhancement for sharpening the overall quality of the bit mapped image, network capability,
- auto-page rotation,
- multiple language capability,
- grayscale filters (this is a better representation of a color bit map),
- automatic font training (the OCR application interprets a different font),
- barcode recognition and,
- preservation of table objects.
Other considerations include output formats. The most common are first, ASCII (ASCII character set, the most widely used), followed by TXT (generic text). Various cases come across unique/standard file formats to include:
- PDF (Program Data File, The Adobe file format),
- HTML (web based),
- RTF (rich text format),
- XLS (Excel Spreadsheet Format),
- DOC (Microsoft Word)
- WPD (Word Perfect)
- Microsoft Office compliant,
- and all other major word processing formats.
The good news is that most of these file formats can be converted into ASCII or TXT formats. Processing with one common file format for full text is a very good idea. The full text should always have a pointer to the original document so that chain of custody and authenticity issues are preserved.
Document Retention Planning is the process by which documents are identified to be saved, with the remaining documents to be discarded in a proper fashion. The driving force behind a Document Retention Plan is often times an Inadvertent Disclosure. An Inadvertent Disclosure is where one side of a case turns over a privilege and/or confidential document over to the opposing counsel. An Inadvertent Disclosure can often lead to the reopening of Discovery. In an effort to reduce the impact on reopening Discovery a, Document Retention plan is often implemented. Only essential documents are retained (both paper and electronic copies). Legal and ethical issues do apply as to what can be destroyed. If an audit trail exists that will be or was turned over in discovery exists, all associated documents and work product need to be retained.
Electronic document retention refers to any computer-generated document such as, e-mails, word processing documents and databases. Electronic Discovery is subject to the same scrutiny and rules that Paper Based Discovery must abide by. There are number of major concerns about electronic documents in terms of retention. An electronic document can exist in multiple locations. When an electronic document is created it is often distributed to others in electronic form. Documents are often saved to network drives, local drives, diskettes and to back up systems. A deleted document in an electronic form is quite easy to recover, as long as the disk hasn't been reformatted or defragged. Reformatting any type of medium (floppy disk, hard drive, and digital tape) is the best way to remove legacy information.
Joint Application Design (JAD) is an approach to system design that teams together both the users and the technical staff in an attempt to create robust solutions. JAD is traditionally applied to software design applications but is also starting to be applied in the customization of COTS products. The idea behind JAD is to capture as many of the desired software capabilities of the users as possible, within a realistic design, that the technical staff will be able to produce.
Copyright © 2001 B. Bruce Barton.All rights reserved. No portion of this article may be reproduced without the express written permission of the copyright holder. If you believe you may lawfully use a quotation, excerpt or paraphrase of this article under the Fair Use exception to copyright law, except as otherwise authorized by the author of the article, you must cite this article as a source for your work and include a link back to the original article from any online materials that incorporate or are derived from the content of this article.