Predictive Coding: The Next Sorta/Kinda/Maybe Big Thing
So You Won’t Get Fooled Again
Predictive Coding: The Next Sorta/Kinda/Maybe Big Thing
by Leonard Deutchman
Much has been written about predictive coding, a number of companies, LDiscovery included, offer it, it has been pushed at trade shows like Legal Tech and, recently, in Da Silver Moore et al. v. Publicis Groupe & MSL Group, 2012 U.S. Dist. LEXIS 23350 (S.D.N.Y. Feb. 23, 2012), Judge Peck wrote an interesting opinion, well worth the read, discussing it. Here, I will make observations about predictive coding as a methodology and as an industry phenomenon, take issue with but ultimately agree with Judge Peck, and, as I plan to do in all of my blogs, plant an obscure reference to The Who.
Predictive coding arose in Da Silver Moore because plaintiffs had asked defendants to search through some three million records from agreed-upon custodians. Plaintiffs were women seeking class status to sue defendants on the basis of systemic gender discrimination. Plaintiffs sought two types of information: that concerning H.R. practices both specific to plaintiffs and generally involving women; and, information regarding male employees similarly situated, to see whether the “comparator” set of male employees had been treated the same way.
The parties in Da Silver Moore had been discussing an e-discovery protocol but had had several disagreements. The Court told the parties that it favored predictive coding, citing to a recent, widely-read article it had published on the topic, but warned defendants that if they were to use it, they would have to disclose to plaintiffs their “seed set,” i.e. the set of records which had been tagged so as to teach the application how to predict correctly how the reviewer would code the remaining documents. After further discussions between the parties, plaintiffs, at their next judicial conference, objected to defendants’ plan to review and produce only the top 40,000 documents (defendants’ calculated the cost of such a production to be $200,000, assuming $5 per document). The court agreed with this objection. It noted that many predictive coding applications, including the one in use in the instant matter, rated documents with a “confidence level,” i.e. a quantitative measure of how confident the application was, based upon the parameters supplied by the coders in their manual code of the seed set, that the records coded by the application were coded consistent with the way the manual coders would have coded them. It then found that, while defendants certainly could employ a cost/benefit analysis to justify not producing documents that fell so far below a certain confidence level that the cost of production greatly exceeded the benefit, choosing 40,000 documents was arbitrary. In choosing a cut-off, the Court ruled, defendants would have to pick the point where cost exceeded benefit. “Proportionality requires consideration of results as well as costs,” the Court noted, so “if stopping at 40,000 is going to leave a tremendous number of likely highly responsive documents unproduced,” the cutoff “doesn’t work.”
In the same vein of determining what does or doesn’t work, the Court reviewed how the parties resolved disagreements regarding the predictive coding protocol and how the Court resolved them when the parties could not. The parties agreed to a random sample of 2,399 documents to be reviewed to create the seed set. A problem arose, however, because one defendant had reviewed the set before the parties had agreed to use two additional review tags to flag records pertaining to two additional issues. The parties resolved this issue by agreeing that plaintiffs could code using the two additional tags.
The parties further agreed as to how to search to “create the seed set to train the predictive coding software.” Defendant coded certain documents through “judgment sampling,” i.e. by a senior attorney doing traditional review by judging that the record was responsive, non-responsive, privileged, etc. The remainder of the seed set was created by using Boolean searches, e.g., using keywords such as “training” found in a record with other keywords, such as “Da Silva Moore,” and reviewing all records that fell within the top fifty hits with those searches. Defendants agreed to provide all hits, save for records it deemed privileged. As well, defendants used additional keywords, supplied by plaintiffs, to identify an additional 4,000 records, which senior counsel then reviewed individually and coded. The resultant seed set, documents deemed by defendants to be privileged having been removed, would then be disclosed to plaintiffs.
The parties contemplated that this iterative process would take several rounds before the “training of the software” would be “stabilized,” i.e. the parties would agree that the application was, in its predictions, tagging the records as the senior reviewers would have. Defendant proposed seven iterative rounds of reviewing at least 500 documents each round. After the seventh round, defendant would review a random sample of 2,399 records marked “non-responsive” to ensure that the software had stabilized, disclosing all of these records to plaintiff, save the privileged ones. The Court accepted the protocol, modifying it to require that, if after seven rounds of review the software had not stabilized, the process would have to be repeated until it did.
The parties’ predictive coding protocol, and the Court’s discussion of it, provides a primer for those interested in how to use the new technology. Some observations are in order.
1. The Court and the parties understand how the iterative process of creating a seed set must work. It requires good-faith cooperation and a great deal of back-and-forth. It also requires that the Court be mindful not just of the value of predictive coding as a means of saving the producing side money (if this series of iterations can allow the producing side to review 40,000 records in order to code 3,000,000, I’d call that a bargain), but also that such savings does not come by setting boundaries arbitrarily. Thus, defendants cannot simply decree that they will pick the top 40,000 hits if the confidence level in the first 100,000 hits is very high, or end the iterative process of shaping the seed set at seven iterations if coding is not yet stable. Reliable results guide the process.
2. In focusing on relying upon the parties’ judgment to shape the seed set until it produces reliable results, the Court describes, without so announcing, a version of how all verification of digital forensics and e-discovery processes is done. All such processes involve “black boxes,” i.e. proprietary applications whose inner workings their developers will not reveal to the world at a Daubert hearing since they want to continue to profit from those boxes. To be able to use these applications – here, EnCase and FTK immediately come to mind – organizations such as NIST or law enforcement organizations have tested them by comparing the results obtained through use of the applications with those obtained using other search applications, including manual ones (such as hexadecimal readers, where the examiners goes to the sector on the drive and reads what is there sector by sector). The only way to test a black box is to compare its results to known and trusted results and to have sufficient favorable comparisons to gain confidence in the application. That same process is what the parties, in using the predictive coding application, plan to do in the instant matter.
3. In discussing this process, the Court asserts that Daubert, in which the Supreme Court first interpreted F.R.E. 702, which governs the admissibility of expert testimony, and Kumho Tire Company v. Carmichael, 526 U.S. 137 (1999), which held that Rule 702 applied to any type of expert testimony, not just scientific evidence, was inapplicable because Daubert and Kumho applied solely to situations where the Court, as “gatekeeper,” was determining what evidence could be brought before a jury, whereas here there was no jury and, in fact, no trial. Instead of looking to Rule 702, the Court stated, it “would be interested in both the process used and the results,” i.e. it would “want to know what was done and why that produced defensible results,” and would be “less interested in the science behind the ‘black box’ of the vendor’s software than in whether it produced responsive documents with reasonably high recall and high precision.”
Here the Court is both right and wrong. It is wrong that Daubert and Kumho are inapplicable because there is no jury. By its own terms, Rule 702 applies when “scientific, technical, or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue.” There is no requirement that the evidence be presented to a jury. Here, in the motion, the Court is the trier of fact, so Rule 702 would apply.
The Court, however, regardless of its ostensible rejection of Daubert and Kumho, nevertheless has de facto applied Rule 702 in its analysis. Because the Court is both gatekeeper and trier of fact, application of Rule 702 is hard to see, but it is even harder to see because the “area of expertise” at issue here is the determination of what records are responsive to a discovery request, and so here pretty much everyone – the Court and all counsel – are experts. By demanding “defensible results” as it did, the Court did, in fact, apply Rule 702, i.e., it demanded that the coding application produce records deemed “responsive” by the experts at issue, that is, counsel for the parties. The Court did not see itself in that role because it is so used to thinking of expert testimony as touching upon something outside of its own particular branch of expertise that it did not recognize how it had internalized Rule 702 and applied it, sub silentio, to the predictive coding protocol.
4. Drawing upon a wide body of knowledge, including United States v. O’Keefe, 37 F. Supp. 2d 14, 24 (D.D.C. 2008) (Facciola, M.J.), Equity Analytics, LLC v. Lundin, 248 F.R.D. 331, 333 (D.D.C. 2008) (Facciola, M.J.), and, Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251, 260, 262 (D. Md. 2008) (Grimm, M.J.), the Court makes the crucial point that predictive coding, or any search application, need not reach the unattainable goal of perfection (i.e., find all responsive records and no non-responsive ones) to be defensible. Traditionally, the Courts have accepted the judgment of review counsel (absent some demonstrable reason to question such judgment) when reviewing whether a party has selected and produced all responsive, non-privileged documents from those reviewed. Thus, predictive coding produces “stable” results when it can replicate the coding judgment of review counsel; whether that judgment is perfect is another question and, as the court recognizes, not a relevant one.
5. It is worthy of notice that plaintiffs as well as defendants in the instant matter used experts, referred to as “vendors” by the Court since that is how they are referred to in the industry and because, technically, neither was offered as an expert witness within the meaning of Rule 702. Plaintiffs’ expert was of tremendous assistance in helping plaintiffs craft and understand the technical aspects of the predictive coding protocol. Often the party seeking e-discovery – plaintiffs, usually – do not draw upon vendors because it sees the vendor’s role solely as aiding in the production of e-discovery. Vendors, however, can be of great help in assisting plaintiffs in crafting intelligent discovery requests and in responding intelligently to defense objections to requests. Indeed, I have elsewhere argued that defense counsel are usually much better prepared to make smart discovery requests because they have learned from working with vendors in producing e-discovery. I have no idea whether plaintiffs had to produce discovery here and so had engaged the vendor principally for that purpose, or whether they sought the vendor solely to advise them regarding discovery requests. Regardless, readers should take note that plaintiffs’ use of an expert here allowed it to negotiate a protocol under which three million records would be searched. While in other circumstances, the mere number of records at issue would likely have led to endless motions to reduce the review set as too burdensome under Rule 26(b)(2)(B), here plaintiffs were able to negotiate review of the entire set by agreeing, in an informed manner, to a protocol which made that search far less expensive for defendants.
6. Finally, it must be noted that defendants sought to use predictive coding because three million records were involved. Despite all of the hype involving predictive coding, the fact is that it is used infrequently. Like its older sibling, Concept Searching, predictive coding is a sophisticated feature that vendors like to tout as a way of showing that they are the Real Deal and clients want to see offered to know that the vendor they have chosen is that same Real Deal, but like the unoccupied luxury suites in hotels offered to show the world that the resort at issue is the top of the top of the line, so in the world of e-discovery, predictive coding is offered and requested far more as a way of measuring the vendor than because the service is wanted.
There are other reasons predictive coding is more inquired about than used. The cost saving of predictive coding over manual review does not arise until the number of records to review is large enough, and the value grows as does the size of the data set. Three million records is well beyond the size the data set needs to be for predictive coding to be a cost saver, but few cases involve three million records, or even half that. The typical case is much smaller, where the value of predictive coding may not outweigh its cost.
Even with a large data set at issue, however, clients still hesitate to choose predictive coding, as they do with Concept Searching, because it is an additional expense to which the client must commit in full as soon as e-discovery begins to be produced, even though underneath every step that all parties take in civil litigation is the understanding that it is about the money and so it may settle at any point. Commit to $200,000 in predictive coding and, if the matter settles tomorrow, you still owe the vendor $200,000, but if you commit to the number of review hours by review counsel whose fees will total $200,000 and more, you can always stop that review the second there is a hint of settlement and so incur only the cost of the review done so far. It takes a very large data set, and a client convinced beyond any doubt that there is no hope of settlement at that time, for a client to commit to such up front costs. Such rarely happens, so despite the virtues of predictive coding, it is a prudent prediction that it will be more talked about than used.
Predictive coding is a tool every litigator should have in his or her belt. Da Silver Moore explains in great detail how the application works technically and in litigation, and so is well worth the read.