By Edward Sohn, Esq., Assistant Vice President, Litigation Solutions, Pangea3
Predictive coding has become a buzzword in e-discovery, but when it comes to taking a close look at implementing it in your e-discovery processes, it can become a daunting topic to tackle. If you have not yet become acquainted with predictive coding, there is no shortage of articles and blogs available to get you caught up to date. For a highly comprehensive framework of the topic, Maura Grossman and Gordon Cormack have published a glossary of technology-assisted review, “The Grossman-Cormack Glossary of Technology-Assisted Review,” updated as of January 2013. To help get your bearings in the brave new world of predictive coding, here are ten important predictive coding concepts.
1. Technology Assisted Review: More Than Predictive Coding
“Technology-assisted” review (or sometimes referred to as “computer-assisted” review) describes the integration of technology into the process of human document review. Although sometimes predictive coding and technology-assisted review are used interchangeably, predictive coding is actually just one manifestation of review that is assisted by technology. Technically, even keyword search terms could be cleverly deployed into a technology-assisted review workflow. But technology-assisted review also includes tools that identify near-duplicate documents and latest-in-thread emails. Sometimes the term “advanced analytics” is used for some or all of these tools as well.
2. Seed Set: The Reference for Relevance
In predictive coding workflows, a seed set is a sample of the document universe. This sample set of documents is reviewed by subject matter experts. The determinations made on the seed set comprise the primary reference data to teach the predictive coding machine how to recognize patterns of relevance in the larger document set. Based on the calls made in the documents in the seed set, the computer will be able to predict categorizations for the remaining documents in the larger universe.
Different service providers take different approaches to the seed sets. Some believe that the most representative seed set is a completely randomized sample, constructed without regard to specific fact issues or other knowledge about the documents. Other technology providers take the approach that focuses the machine learning on specific concepts highlighted by the subject matter experts and assembled using criteria from metadata or search term results. Still other providers will take a hybridized approach, employing both. There is debate as to which approach is superior, and the different approaches often are best suited for different technologies.
If predictive coding is being used for other determinations outside of general relevance, like responsiveness to specific issues or assertions of attorney-client privilege, more seed sets may be required. Some tools are capable of learning multiple topics as applied to a single document in the seed set, while other tools require multiple sets.
The important thing to know about seed sets is that they are how the computer learns. It is critical that a seed set is representative and reflects expert determinations.
3. Iterative Learning: On-The-Fly Adjustments
Iterative learning refers to the predictive machine’s ability to continue its learning even as the main review begins. This can happen through adjusted sets, like new seed sets, that refine the learning, or it could happen that the machine takes live review data and incorporates it on an ongoing basis. The ability for a machine to learn iteratively is especially helpful when the document dataset is not static. If time pressures require that the document review and production commence before all of the documents have been collected and processed, which is often the case, a predictive coding machine’s ability to learn iteratively allows for adjustment to a dynamic document universe.
4. “Garbage In, Garbage Out”
This has become something of a maxim in predictive coding circles. It simply means that if the seed set receives erroneous judgment, the predictions that are generated will multiply the errors. It is critical to have access to subject matter experts for the seed set. A robust predictive coding workflow has ways to compensate for seed set errors found throughout the review, but correcting errors introduces inefficiency to the process.
5. Predictive Metric 1: Precision
There are two major metrics used to measure the efficacy of a predictive coding tool. The first is precision, which refers to the fraction of responsive documents found in the set identified by the computer. If the computer, in trying to identify relevant documents, identifies a set of 10,000 documents, and after human review, 7,500 out of the 10,000 are found to be relevant, the precision of that set is 75%. Currently, the precision of search-term driven document sets is much lower—most of the documents in most traditional document reviews are ultimately non-responsive.
6. Predictive Metric 2: Recall
Recall refers to the number of relevant documents ultimately discovered from the entire dataset. In a data universe of 200,000 documents, assume 30,000 documents are selected for review as the result of search terms or predictive coding technology. If 20,000 documents are ultimately found within the 30,000 to be responsive, the selected set has a 66% precision measure. But if another 5,000 relevant documents are found in the remaining 170,000 that were not selected for review, that means the set selected for review has a recall of 80% (20,000 / 25,000).
If this seems confusing, consider this rough fishing analogy. Think of the predictive coding tool as casting a net into the waters of the document universe. Recall is how well you captured the majority of relevant documents from the universe in your net. Once you dump out the contents of the net, precision measures how many documents caught inside the net actually ended up being relevant.
7. Statistical Sampling
Both precision and recall can be measured through statistical sampling. This can reduce the number of documents reviewed while still providing a sense of how effectively the predictive workflow is working. If after substantial review of documents from within a predictive workflow, 500,000 unreviewed documents remain. A statistically randomized sample of the 500,000 reveals that less than 1% of the sample is responsive. At that point, counsel may declare the review ended and deem the corpus of remaining unreviewed documents as an unlikely source to contain discoverable evidence.
Of critical importance in statistical sampling is the sample size. A formula receives inputs of your desired confidence level (statistical certainty) and interval (margin of error in the statistical sample size) and returns a sample size statistically sufficient to achieve the confidence. The confidence levels and intervals should be tailored for maximum defensibility. Even with a confidence interval of + or - 1%, a sample size of 20,000 documents or less can be sufficient on a universe of 400,000 documents or more. However, the exigencies of high-stakes litigation and the non-uniform or intentionally biased universe of documents may require a larger sample size than what is calculated as a matter of statistical integrity.
8. Concept Clustering
Concept clustering refers to a set of tools that examines when certain words appear together in documents. The computer can generate concept clustered without any human input, finding documents that share the same combinations of words. Concept clusters can be used for smart prioritization of documents for review, and some predictive coding tools have effectively employed concept clustering at the heart of their algorithms.
Do courts endorse predictive coding as a defensible approach to e-discovery? The answer to that question is still being worked out, but so far, both federal and state courts have generally approved the use of predictive coding technology as a means of fulfilling discovery obligations and maintaining costs proportional to the scale of the controversy. However, in the face of severe sanctions for discovery misconduct and “discovery about discovery,” you should always have a process expert track the process’s transparency and maintain a strong audit trail that will help establish a highly defensible predictive process.
10. Developments in Litigation Strategy
Keep an eye out for emerging strategic nuances in litigation strategy. For instance, a federal court recently opined that a combination of keyword search terms and predictive technology was sufficient to meet the responding party’s discovery obligations (See In re Biomet M2a Magnum Hip Implants Products Liability Litigation, No. 3:12-MD-2391 (N.D. Ind. April 18, 2013). Predictive coding is often described as a cost-saver for defendants responding to sweeping discovery requests, but in the Biomet case, it was actually the plaintiffs that were pushing for more predictive coding to be used in order to realize higher recall of responsive documents. The court found that the defendants’ review process was defensible but plaintiffs’ argument is one to watch for in the future. In the world of keyword searches, counsel could negotiate a list of words that connect directly to documents that are being produced. The list of keyword search terms was critical to defining the agreed-upon scope of discoverable evidence. When complex linguistic algorithms in predictive coding are replacing the deployment of simple keyword searches, the nature of litigation strategy in discovery will invariably be impacted.