Ediscovery White Paper US. The Ultimate Predictive Coding Handbook. A comprehensive guide to predictive coding fundamentals and methods.

Ediscovery White Paper US The Ultimate Predictive Coding Handbook A comprehensive guide to predictive coding fundamentals and methods.

2 The Ultimate Predictive Coding Handbook by KLDiscovery Copyright 2018 LDiscovery, LLC. All rights reserved. All other brands and product names are trademarks or registered trademarks of their respective owners. This document is neither designed nor intended to provide legal or other professional advice, but is intended to be a starting point for research and information on the subject of electronic discovery. While every attempt has been made to ensure the accuracy of this information, no responsibility can be accepted for errors or omissions. Recipients of information or services provided by KLDiscovery shall maintain full, professional and direct responsibility to their clients for any information or services rendered by KLDiscovery.

The Ultimate Predictive Coding Handbook by KLDiscovery 3 Contents 4 What Is Predictive Coding? 6 Training the Predictive Coding System 8 Predictive Coding Workflows in Ediscovery 14 Validating Predictive Coding with Sampling 18 Conclusion

4 The Ultimate Predictive Coding Handbook by KLDiscovery What Is Predictive Coding? In simple terms, predictive coding is the use of a computer system to help determine which documents are representative of a defined category. The system performs this classification based on training it receives via human input (i.e., machine learning ). By utilizing machine learning, the system can classify documents with remarkable accuracy - even documents humans have not yet seen. In legal matters, predictive coding is most commonly used to identify documents that are relevant to a legal proceeding. TRAIN PREDICT Predictive Coding in Ediscovery For decades, litigants have relied on combinations of text/metadata searching and costly attorney review as a means for dealing with large volumes of data in ediscovery. As the growth of big data continues to exceed the economic feasibility of such approaches, the legal industry has turned to computers for assistance. More recently, predictive coding has been a secret weapon for advanced legal teams looking for an edge. Predictive coding is now widely accepted as a critical tool in the ediscovery process. Predictive coding works for ediscovery by solving the following key problems Finding the right document as fast as possible Sorting and grouping documents more efficiently Validating work performed before production Train Non-responsive Evaluate Predict Responsive

The Ultimate Predictive Coding Handbook by KLDiscovery 5 Busting Predictive Coding Myths There is still apprehension toward the use of predictive coding by a minority of legal professionals. Below are the most common myths associated with predictive coding: Myth: Predictive coding requires large time investment from expensive subject matter expert(s) (SMEs). Busted: Relying on SMEs is a highly effective approach to predictive coding, but it is far from the only approach. In fact, some alternative approaches do away with SMEs entirely. Myth: You need at least 10,000 (or 25,000 or 50,000...) documents to use predictive coding effectively. Busted: There is no minimum data set size required for modern predictive coding systems. The benefit of predictive coding can be seen even on extremely small data sets. Myth: Document review and production must wait until the predictive coding training is complete. The concept of complete is something of a myth in and of itself. The benefit of predictive coding can be realized almost immediately and when the training process should end is a matter of costbenefit, not a technical requirement. Myth: Using predictive coding is expensive. A properly managed predictive coding workflow utilizing modern systems will always yield a net savings over the technology cost. A vastly improved work product should also be considered, as that leads to substantial indirect savings over the life of a case. Myth: Predictive coding cannot be used without understanding algorithms, learning strategies and everything else that happens inside the black box. Busted: While a detailed understanding of technicalities may boost some users confidence, it is not necessary in order to have a successful project. Predictive coding works and there are very simple ways to prove it on every case. Myth: All data must be present at the outset of predictive coding training. Busted: The machine learning process will adjust to changing data sets without issue. Modern predictive coding systems are designed to be flexible and agile to the fluid nature of ediscovery.

6 The Ultimate Predictive Coding Handbook by KLDiscovery Training the Predictive Coding System For a predictive coding system to make accurate categorizations on its own, the system needs direction from humans. Understanding how a system gets smart is vital to proper use of predictive coding in ediscovery. This process is commonly referred to as the training phase. During training, humans will review documents and select the issues that apply. Documents that are reviewed during training impact the quality of the classifications made by the system. It is therefore important that the trainers understand basic principles of how the system works. Focus on the Text Most predictive coding systems only work with text; therefore, trainers should only consider the text of the document when performing review. Consistency Counts Not all documents have to be coded perfectly; however, more consistent training will always yield better results. The Four Corners Rule Trainers should make decisions based on the contents of a single document and nothing else. Content contained in other documents should not be considered, even if those documents are related. Selecting Training Documents Using contextually valuable documents for training is key to a successful predictive coding project. This is also known as developing an effective learning strategy. The most optimal learning strategy will maximize machine learning with the least possible amount of human effort. Training documents can be identified either by humans or by the system itself. It is generally not recommended to rely solely on human selection; however, both options can be used together with great success. User Selection (Seeds) What: Training documents that are identified by humans outside of the predictive coding process. Often referred to as seeds or seed sets. When: Generally used at beginning of training to kick-start (i.e., seed ) the process, but can be utilized at any time. Considerations: Not usually a random sample, contrary to popular belief. Machine Selection What: Training documents that are identified by the predictive coding system. When: Almost always. When used in conjunction with seeds, machine selection helps to find the documents that seeds did a poor job of classifying. Considerations: In general, there are four different types of machine selection. These are detailed in the next section.

The Ultimate Predictive Coding Handbook by KLDiscovery 7 Understanding the Types of Machine Selection In general, there are four different types of machine selection used for predictive coding sampling. While these go by different names depending on the conversation, we will refer to them as: Active Learning, Prioritization, Stratified Sampling and Simple Random Sampling. Active Learning Prioritization Stratified Sampling Simple Random Sampling Focuses training on documents that have a high degree of uncertainty. AKA: Focus Training, Uncertainty Sampling, TAR 1.0 Designed explicitly to reduce the burden of labeling (i.e., training) Rankings/classifications are generally not adjusted once training completes Advanced systems may incorporate automatic error correction and blind verifications The highest scoring documents are escalated to maximize the value of ongoing review. AKA: Continuous Active Learning (CAL); TAR 2.0+ The easiest method to understand; takes the emphasis off statistics and difficult concepts Rankings/classifications are updated on a regular basis Advanced systems may incorporate some degree of Active Learning or Random Sampling Selective random sampling among a manually created subset of documents. AKA: Advanced Random Sampling Less efficient than Active Learning or Prioritization and complicated to understand Requires an understanding of how strata should be created Useful for specific probability sampling, such as Quality Control and targeted validations Random sampling among the entire data set or a randomly selected subset of the data set AKA: Simple Passive Learning/SPL, Random Sampling The least efficient type of sampling, requiring a large time commitment from the trainers When positive documents are rare, this method can fail completely Useful for basic estimation, control sets and validation processes Provides estimations on subsets of data; useful for quality control and targeted validations

8 The Ultimate Predictive Coding Handbook by KLDiscovery Predictive Coding Workflows in Ediscovery There are many effective ways to improve ediscovery processes through the use of predictive coding. Depending on the requirements of the matter, however, certain approaches may be better suited than others. In this guide, we will focus on four applications of predictive coding that are not only common, but also extremely effective. SME Training Prioritized Review (aka CAL) Predictive Coding Workflows Hybrid Multimodal Review Quality Control

The Ultimate Predictive Coding Handbook by KLDiscovery 9 SME Training SME (Subject Matter Expert) Training is the traditional approach to predictive coding in ediscovery and most effectively utilizes Active Learning as the sampling methodology. This is what most people think of when they hear predictive coding or TAR 1.0. PROS Has the highest ceiling for efficiency and cost savings; Creates a small yet very accurate training set, which can yield very good results with minimal effort; Helps to quash logistic and cost concerns associated with very large databases; Can be used as an alternative to review, a supplemental culling mechanism, or for more targeted review; High benefit potential from seeding. CONS Most effective with one or more Subject Matter Experts (SME), who generally have limited bandwidth and high associated cost; Moderate upfront effort can delay the start of a larger process; Return on investment (ROI) can be an issue for small-tomedium matters; Validation is strongly recommended and can add to the SME burden. Privileged + Relevant Predictive Coding Privileged + Irrelevant Non-priviledged + Relevant Confirmed and logged Non-privileged + Irrelevant Production Sampling Training A SME trains a number of documents until comfortable with the results. Typically, Active Learning will be used to select training documents; however, it is not uncommon to supplement with seeds or even random/stratified samples, especially when richness is low. Process The system will build a model to identify documents that have a high probability of correct classification into pre-defined categories. Once the model is created, legal teams incorporate the downstream process that best suits the case. This is often additional culling and/or automated/semi-automated review. Validation Validation is strongly recommended when using this approach in most, but not all, cases. Generally, a control set is used to get visibility into the efficacy of the model. Post-categorization null set sampling is also common to ensure the existence of false negatives is minimal.

10 The Ultimate Predictive Coding Handbook by KLDiscovery Prioritized Review (aka CAL) Prioritized Review (aka CAL) is becoming a very popular predictive coding workflow in ediscovery. Its simplicity and very low learning curve essentially eliminates the barrier to entry often encountered with SME Training. Prioritized Review is very popular in scenarios where there is a desire or requirement to put human eyes on all or most documents at issue in a case. This is occasionally referred to as TAR 2.0 since it is a more recently adopted workflow. PROS Escalates important documents very rapidly and keeps irrelevant documents at bay; Is the most efficient way to organize a human review of any data set, assuming that data set needs at least some amount of human review; Can be used more effectively on multiple issues simultaneously than SME Training; Enables even small review teams to be extremely productive; Almost no barrier to entry - any legal team can take advantage at any time. CONS Potential for cost savings is reduced compared to SME Training; The nature of legal review often requires that families (emails with attachments) be reviewed together, which inevitably interferes with a purely prioritized review; Validation is strongly recommend if review stops before all documents are reviewed, which introduces SME burden to the workflow; May create an unwieldy model comprised of an excessive amount of documents. Highest rated documents escalated to front review Optional REVIEW ALL DOCUMENTS No validation required Review Known relevant documents System Update STOP EARLY Production Sampling

The Ultimate Predictive Coding Handbook by KLDiscovery 11 Training The vast majority of training is simply the act of reviewing prioritized documents over time. The model is rebuilt at specified intervals to take advantage of new knowledge gained since the previous classification. Training usually commences with some combination of Active Learning, Random Sampling and seeds, but the impact of those samples diminishes as more documents are trained. Process A legal team starts review and the system learns from human decisions when the model is built during specified intervals. The highest scoring documents are placed at the front of the review, so they get assigned earlier. The ratio of responsive documents to nonresponsive documents is tracked and reviewed (usually daily) to determine when relevant documents have been exhausted. Most teams stop review once the rate reaches a point where further review is not justifiable. Validation Validation is strongly recommended only if a legal team decides to stop before all documents are reviewed. In this scenario, a random sample of all remaining documents should be taken, often referred to as the null set or elusion set, to confirm there is no significant percentage of relevant documents remaining. Validation passes when a strong proportionality (diminishing returns) argument can justify discarding the remaining documents.

12 The Ultimate Predictive Coding Handbook by KLDiscovery Hybrid Multimodal Review Hybrid Multimodal Review combines a Prioritized Review with ongoing SME input. The SME in this workflow utilizes many tools at her disposal to drive important and beneficial documents to review in an effort to minimize the review of irrelevant and redundant documents. There is no pre-defined script with this approach; instead, it is a fluid process that adjusts based on SME leadership. PROS Has most of the benefits of Prioritized Review, plus a heightened ceiling for efficiency; When led by an experienced SME, this approach yields the highest quality of output of any method (as defined by Recall & Precision) on most cases. CONS A highly skilled SME - beyond the skill of a traditional SME - is required to drive the process for most of the project s duration; ROI can be an issue for small matters. Optional Null Set Optional 80 85 90 95 99 Document escalated to front of review by SME Review STOP REVIEW All Predictive Coding and TAR tools have been utilized Proof further review might be needed Production Training Training is done through the entire review, much like in a Prioritized Review. A SME determines which documents are trained by using his or her best judgment and a combination of tools. Process Whether working alone or with a team, the SME is in charge of the process. Since the process will vary from caseto-case and SME-to-SME, it is both variable and subjective, but highly effective when done correctly. Generally, the SME will halt review before all documents are reviewed. Validation Validation is strongly recommended when using this approach to permanently eliminate documents from review. In this scenario, a random sample of all remaining documents should be taken, often referred to as the null set or elusion set, to confirm there is no significant percentage of relevant documents remaining. Validation passes when a strong proportionality (diminishing returns) argument can justify discarding the remaining documents.

The Ultimate Predictive Coding Handbook by KLDiscovery 13 Quality Control Predictive coding can also be used as a Quality Control (QC) mechanism, either exclusively or in conjunction with another workflow. Because the model will create a probabilistic ranking for every document, Review Managers can use that information to assess human performance and take remedial action when appropriate. Predictive coding can also help identify widespread problems, providing an opportunity to evaluate instructions and consistency between teams. Responsive QC Any human-coded documents drastically conflicting with the system s suggestion are sent to a defined QC stage in the review workflow. Document scores allow Review Managers to pinpoint which documents are most likely incorrectly coded. Examination of these documents can also lead to identification in an ongoing QC check that can be invaluable for ensuring complete productions. Privilege QC Review Managers can utilize predictive coding to help identify privileged documents at risk of being inadvertently produced. Additionally, predictive coding can help reduce the burden of privilege logging by reducing the number of false positives. Individual reviewer QC Reporting on reviewer decisions measured against predictive rankings can help Review Managers identify problematic individuals on their teams. This is a useful tactic for both remediation and prevention, and is especially effective when used in conjunction with existing QC methodologies designed to flag performance issues.

14 The Ultimate Predictive Coding Handbook by KLDiscovery Validating Predictive Coding with Sampling Although sampling concepts and practices can be confusing, they are often necessary for the validation of the predictive coding project. Understanding sampling methods and derived metrics is an essential skill for evaluating success. Sampling can also be used in a number of other scenarios in ediscovery, even those unrelated to predictive coding. Statistical Sampling vs. Judgmental Sampling For predictive coding, there are two types of samples that are most commonly used: A statistical sample is where everything was selected at random. A judgmental sample is where any factor was used in selecting the sample set. If a sample is not a statistical sample, then it is a judgmental sample. Legal teams can draw conclusions on an entire document population from a random sample but not from a judgmental sample. Statistical Sample Documents picked at random from entire batch Reflective of population Judgemental Sample Group of relevant documents picked from batch Not reflective of population

The Ultimate Predictive Coding Handbook by KLDiscovery 15 Key Terminology To understand the benefits of sampling, a user must first grasp a few key terms. Learning these terms can help interpret results and assess how well a document review is progressing. The following terms are helpful to know for an effective review. Margin of error Percent chance result falls into circle Estimate Population Size is the total number of items from which a random sample is taken. Margin of Error (MOE) is the maximum amount by which an estimate based on the sample results might deviate from the actual amount. Note that Confidence Interval is a related measurement and is equal to 2 x MOE. Confidence Level refers to the probability that the estimate from the sample, along with the margin of error, include the actual resulting amount. Prevalence (also called Richness) is the number of positive items in a sample divided by the total number of items in the same sample. Applications of Sampling Sampling has numerous applications in document review. The methods discussed below can be used to check quality, compare methods of coding and verify that all relevant documents are being queued for review. Sample Method 1 Control Set Proof further review may be needed Method 2 A Point Estimate is created by applying the percentage of certain documents in a sample to the entire population. The number of positive (i.e., relevant) documents in the sample are used to estimate the number of positive documents in the population. Control Sets are random samples that are used to conduct an efficacy assessment that is representative of the entire population. Control Sets are extremely useful for determining success of a classification effort without reviewing a larger portion of the population. Null Sets (also called Elusion samples) are random samples taken from the entire population of negatively classified items.

16 The Ultimate Predictive Coding Handbook by KLDiscovery Deriving Metrics to Assess Predictive Coding Results There are specific metrics that can help quantify the efficacy of a predictive coding project. Many predictive coding systems generate a report with these metrics to provide visibility into classifier performance. To understand these metrics, legal teams must first understand how they are derived. Consider the scenario below showing classification of responsive vs. non-responsive documents (a common task in ediscovery). Legal teams can understand classifier performance by analyzing the accuracy of these decisions across a sample that is representative of the larger data set. TRUE POSITIVE document correctly suggested responsive FALSE POSITIVE document incorrectly suggested responsive TRUE NEGATIVE document correctly suggested non-responsive FALSE NEGATIVE document incorrectly suggested non-responsive Actually responsive Actually non-responsive Suggested responsive

The Ultimate Predictive Coding Handbook by KLDiscovery 17 Precision, Recall, F-Measure The most common metrics used in predictive coding projects are recall, precision and f-measure. Precision is a measure of exactness. Precision measures the percentage of truly positive items within the subset of items classified as positive by the model. 1. EXACT, BUT NOT COMPLETE Recall is the measure of completeness. Recall measures the percentage of all truly positive items in the sample that were classified as positive by the model. 2. COMPLETE, BUT NOT EXACT F-measure is the weighted average between Recall and Precision. 3. EXACT AND COMPLETE Actually responsive Actually non-responsive Suggested responsive This image shows 100 percent precision, but low recall. The only documents classified as positive were, in fact, positive, but many positive documents were left out. This image shows 100 percent recall, but low precision. All documents were classified as positive, but many of these documents are actually negative. These results are not perfect, but represent very good performance based on high recall and precision.

18 The Ultimate Predictive Coding Handbook by KLDiscovery Conclusion Predictive Coding is a great tool for any document review in some form or fashion. Users of Predictive Coding are able to minimize costs and maximize efficiency during document reviews. The goal of this guide was to dispel the common myths around Predictive Coding and present the knowledge needed to use Predictive Coding in an effective manner. With data continuing to grow at rapid rates, the use of Predictive Coding will play a rising role inside litigation and other document review. Organizations and individuals with systems in place to utilize Predictive Coding in anticipation of litigation will best thrive, while unprepared organizations will face consequences.

The Ultimate Predictive Coding Handbook by KLDiscovery 19