TREC 2007 Legal Discovery Track: Main Task Guidelines

The objective of the main task of the TREC Legal Discovery Track is to evaluate the efficacy of automated support for review and production of electronic records in the context of litigation, regulation and legislation.

Revision History

Prerequisites

New participants are welcome! This is the second year of the TREC Legal Discovery Track, and this year's main task is the successor of the only task of last year's track. This guidelines document assumes familiarity with TREC and last year's legal track. In particular, all participants are requested to have read the following:

All participants should also be subscribed to the track mailing list to participate in any track discussion and to be informed of any late-breaking announcements. Contact oard (at) umd.edu to be added to the list. Archives of the mailing list are available from the track web site at http://trec-legal.umiacs.umd.edu/.

Optionally, participants may join a 2nd, cross-track, mailing list for discussing evaluation issues, both for the Legal track and other tracks, called IREval:

It became clear at the TREC 2006 meeting that several different tracks are trying to address the problems associated with evaluating retrieval systems when traditional pooling is no longer viable (because the document set is too large, or there is not sufficient diversity in the runs, or there is not enough resources for sufficiently deep pools, etc.). To avoid having discussions of the issues splintered across multiple track mailing lists, NIST has set up a new mailing list for these discussions. Since the problem is broader than just TREC, the mailing list is named IReval.
To join the list, send an email message to listproc@nist.gov such that the body of the message contains only the line subscribe IReval <Firstname> <Lastname> You must subscribe to the list to post to the list, but otherwise the list is public. In particular, both the list archives and list membership will be publicly available from the NIST mailing list web site. "Reply"s to a message on the list will be routed to the entire list to facilitate discussion.

The archives of the IREval mailing list are available at http://cio.nist.gov/esd/emaildir/lists/ireval/maillist.html.

Overview of Differences From Last Year's Task

Documents

The document collection is the same as last year:

The set of documents for the track will be the IIT Complex Document Information Processing test collection. This collection consists of roughly 7 million documents (approximately 57 GB of metadata and OCR text uncompressed, 23 GB compressed) drawn from the Legacy Tobacco Document Library hosted by the University of California at San Francisco. These documents were made public during various legal cases involving US tobacco companies and contain a wide variety of document genres typical of large enterprise environments.
The metadata and OCR can be obtained by FTP at no charge. For teams unable to transfer this quantity of data by FTP, the collection will also be available by mail as a set of DVD's from NIST.

To download the collection, please fill out the form at the bottom of http://www.ir.iit.edu/projects/CDIP.html and you will be contacted with the ftp information.

Topics

The topics will be "production requests" like last year:

Participants in the track will search the IIT CDIP collection for documents relevant to a set of production requests. The production requests will be designed to simulate the kinds of requests that parties in a legal case would make for internal documents to be produced by other parties in the legal case. Each production request includes a broad complaint that lays out the background for several requests, one specific request for production of documents, and a negotiated Boolean query that serves as a reference and is also available for use by ranked retrieval systems.
Participating teams may form queries in any way they like, using materials provided in the complaint, the production request, the Boolean query, and any external resources that they have available (e.g. a domain-specific thesaurus). Note in particular that the Boolean query need not be used AS a Boolean query; it is provided as an example of what might be negotiated in present practice, and teams are free to use its contents in whatever way they think is appropriate.
Queries that are formed completely automatically using software that existed at the time the evaluation queries were first seen are considered automatic; all other cases are considered manual queries. Automatic queries provide a reasonably well controlled basis for cross-system comparisons, although they are typically representative of only the first query in an interactive search process. The most common use of manual queries is to demonstrate the retrieval effectiveness that can be obtained after interactive optimization of the query (which typically results in excellent contributions to the judgment pool and are thus highly desirable), but even interventions as simple as manual removal of stopwords or stop structure will result in manual queries.

Last year's topics and relevance judgements are available for training. (topics: http://trec-legal.umiacs.umd.edu/trec_legal_eval_topics.v1.3.zip; relevance judgments (qrels): http://trec.nist.gov/data/legal06.html).

New to 2007: For each topic, we intend to make available the list of documents which match the final negotiated boolean query, for optional use by participants. The B values (the number of documents matching the boolean query) will be included as part of the topic statement in the topic file. The format of the list of matching documents will be the same as the standard submission format for TREC ad hoc runs (described below); however, the documents will just be listed in alphabetical document-id order with all rsv and rank values set to a constant (i.e. no boolean-based ranking will be provided, just the list of matching documents).

Update May 2/07: For training, the lists of documents matching last year's boolean queries are now available by downloading the "refL06B" run from the Active Participants / Tracks Homepage at http://trec.nist.gov/act_part/tracks.html (direct link: http://trec.nist.gov/act_part/tracks/legal/input.refL06B.gz).

There are also two new training versions of last year's topics (fullL06.xml and shortL06.xml) available in http://trec-legal.umiacs.umd.edu/trec_legal_eval_topics.v1.3.zip:

It is anticipated that this year's reference boolean run will be made available (July 1) in the same format as refL06B, and that this year's new topics will be made available (July 1) in both the fullL06.xml and shortL06.xml formats.


Evaluation Approach

Background on Estimating Precision and Recall

The Accuracy Issue of Simple Random Sampling: Even though simple random sampling produces unbiased estimates of a parameter, the individual estimates are usually too inaccurate to be useful for comparing retrieval systems. For example, suppose the target collection has 7 million documents, and for a particular topic 700 of these are relevant. Suppose we have the resources to judge 1000 documents. If we pick the 1000 documents from a simple random sample of the collection, most likely 0 of the documents will be judged relevant, producing an estimate of 0 relevant documents, which is far too low. If 1 of the documents was judged relevant, then we would produce an estimate of 7000 relevant documents, which is far too high.

The Traditional Pooling Solution: TREC evaluations have typically dealt with this accuracy issue by using an extreme variant of stratified sampling. The primary stratum, known as the pool, is typically the set of documents ranked in the top-100 for a topic by the participating systems. Traditionally, all of the documents in the pool are judged. Contrary to proper stratified sampling, typically none of the unpooled documents are judged (these documents are just assumed non-relevant). For the older TREC collections of ~500,000 documents, [Zo98] found that the results for comparing retrieval systems are reasonably reliable, even though it also found that probably only 50%-70% of relevant documents for a topic are assessed, on average.

Traditional Pooling Can Be Too Shallow for Larger Collections: As the judging pools have become relatively shallower, either from TREC collections becoming larger and/or the judging depth being reduced, concerns have been expressed with the reliability of results. e.g. [Bu06] recently reported bias issues with depth-55 judging for the 1 million-document AQUAINT corpus, and [To06] found that less than 20% of the relevant documents were judged on average for the 7 million-document TREC 2006 Legal collection.

Extending the Pooling Depth with Sampling: The TREC 2006 Terabyte Track [Bue06] recently experimented with taking random samples of 200 documents from (up to) depth-1252 pools, and estimated the average precision score for each run based on this deeper pooling by using the "inferred average precision" (infAP) measure suggested by [Yi06]. The infAP scores correlated highly with the MAP scores based on traditional depth-50 pooling.

The L07 Method

The L07 method for estimating recall and precision is based on how these components are estimated in the infAP calcuation. What distinguishes the L07 method is support for much deeper pooling by sampling higher-ranked documents with higher probability.

For legal discovery, recall is of central concern. It was found last year by [To06] that marginal precision exceeded 4% on average even at depth 9000 for standard vector-based retrieval approaches. Hence we propose to use depth-25000 pooling this year to get better coverage of the relevant documents. (For topics with B>25000, depth-B pooling will be used.)

A simple random sample of a depth-25000 pool, however, would be unlikely to produce accurate estimates for recall at less deep cutoff levels. Hence we propose to sample higher-ranked documents with higher probability in such a way that recall estimates at all cutoff levels up to max(25000,B) should be of similar accuracy.

An outline of the L07 method is below (followed by a worked through example). We intend to do a simulation study (by May 31) to verify that we have the software needed to carry out the method and determine how much judging effort is needed per topic for the estimated scores to be sufficiently accurate for comparing retrieval systems. The simulation approach will be to create a mock complete qrels based on last year's submitted legal track runs, then investigate if the sampling approach accurately estimates the scores from the complete qrels.

The preferred location for discussion of the evaluation approach is the cross-track IREval mailing list, rather than the track mailing list.

Detailed Outline:

Selection of Documents to Judge for a Topic

Let D be the set of documents in the target collection. For the legal track, |D|=6910912.

For each topic:

Let m be the pooling depth, i.e. the number of top-ranked documents to include the pool from each submitted run. We will use m=max(25000,B).

Let M be the pool, i.e. the set of documents retrieved in the top-m by at least one submitted run. All submitted runs will be included in the pooling. Note: m <= |M| <= |D|.

Let v be the number of documents we are willing to judge, e.g. v=800.
Let v' be the number of unpooled documents we are willing to judge, e.g. v'=35.

Define h(d) for all documents d in M as the highest rank at which any submitted run retrieved document d.
i.e. h(d) is a number from 1 to m, where 1 is the highest rank possible. Note: h(d) is not defined for d not in M.

Let p(d) be the probability that we include document d in our judging sample.
For d in M, p(d) will be min(C / h(d), 1),
where C is chosen so that the sum of all p(d) (for d in M) is v-v'.
For d not in M, p(d) will be min(v' / (|D|-|M|), C/m, 1).

Notes:

Let V be the set of documents chosen to be judged based on p(d).

Estimation of the Number of Relevant Documents for a Topic ("estR")

Let S be a subset of D.

Define JudgedRel(S) to be the set of documents in S which were judged relevant.
Define JudgedNonrel(S) to be the set of documents in S which were judged non-relevant.

Notes:

Define estRel(S) = min(sum of 1/p(d) for all d in JudgedRel(S), |S| - |JudgedNonrel(S)|)
(or 0 if |JudgedRel(S)| = 0).
Note: the min operator ensures that judged non-relevant documents are not inferred to be relevant. In particular, if all of the documents in S are judged, then estRel(S) will equal the actual number of relevant documents in S.

Let estR be the estimated number of relevant documents for the topic.
estR = estRel(D).

If estR is 0 for a topic, the topic will be discarded. Otherwise, estR is guaranteed to be at least 1.0.

Estimation of Recall@k for a Run

The L07 approach to estimating recall is similar to how infAP estimates its recall component, i.e. it's based on the run's judged relevant documents found to depth k compared to the total judged relevant documents. In our case, we have to weight the judged relevant documents to account for the sampling probability.

For a particular ranked retrieved set S:

Let S(k) be the set of top-k ranked documents of S.
Note: |S(k)| = min(k, |S|)

Define estimated Recall@k as
estRecall@k = estRel(S(k)) / estR
Note: estRecall@k will always be in the 0..1 range. Also, estRecall@k1 >= estRecall@k2 if k1 >= k2. Also, estR * estRecall@k <= k. Furthermore, if all of the documents in the collection are judged, estRecall@k would be equal to the actual Recall@k.

Estimation of Precision@k for a Run

The L07 approach to estimating precision is similar to how infAP estimates its precision component, i.e. it's based on the precision of the run's judged documents to depth k. In our case, we have to weight the judged documents to account for the sampling probability. We also multiply by a factor to ensure that unretrieved documents are not inferred to be relevant.

Define estNonrel(S) = min(sum of 1/p(d) for all d in JudgedNonrel(S), |S| - |JudgedRel(S)|)
(or 0 if |JudgedNonel(S)| = 0)
Note: the min operator ensures that judged relevant documents are not inferred to be non-relevant.

Note: anomalously, it is possible that (estRel(S) + estNonrel(S)) > |S|.
However, if all of the documents in S are judged, then (estRel(S) + estNonrel(S)) = |S|.

Define estimated Precision@k as
estPrec@k = (estRel(S(k)) / (estRel(S(k)) + estNonrel(S(k)))) * (|S(k)| / k)
or 0 if both estRel(S(k)) and estNonrel(S(k)) are 0.
Note: estPrec@k will always be in the 0..1 range. Anomalously, k * estPrec@k might exceed estR.
However, if all of the documents in S(k) are judged, then estPrec@k would be equal to the actual Precision@k.

Anomalously, estPrec@k and estRecall@k are not in general guaranteed to favor the same system for a particular topic (though they will usually correlate). Note that infAP has analogous issues; e.g. if a system happens to retrieve all the judged relevant documents at the top, then all the judged nonrelevant documents, then stops, the system would get an infAP score of 1.0, implying perfect recall, even though there are very likely additional unsampled relevant documents which were not retrieved. (Straying from the infAP model in favor of making the estPrec@k and estRecall@k measures perfectly consistent can lead to worse anomalies or inaccuracies. e.g. define estRecall2@k = (estPrec@k * k) / estR. Then estRecall2@k and estPrec@k always agree on each topic, but estRecall2@k could decrease for larger k (unlike estRecall@k) and estRecall2@k could exceed 100% (unlike estRecall@k), which are worse anomalies. Alternately, define estPrec2@k = (estRecall@k * estR) / k. Then estRecall@k and estPrec2@k always agree on each topic, but estPrec2@k seems likely to be a less accurate estimate of precision than estPrec@k.)

A Completely Worked Through (Small) Example

Let D be a collection of 100 documents, |D|=100.

Suppose we allow runs to submit the top-5 documents for a topic, i.e. m=5.

Suppose we are willing to judge v=6 documents for a topic, including v'=1 document that is not pooled.

Suppose there are 2 runs:

 run1 submits  d1, d2, d4, d6, d8
 run2 submits  d2, d3, d5, d7, d4

Then M (the pool) consists of 8 documents (d1 .. d8).

The h(d) values for the pooled documents are

   d :  d1   d2   d3   d4   d5   d6   d7   d8
 h(d):   1    1    2    3    3    4    4    5

We solve the following equation for C:
"sum of min(C / h(d), 1) over all d in M" = v-v' = 5,
finding that setting C to 1.6 will cause the probabilities to sum to v-v' (5) for this pool.
Note: since this C is so small, we know our estimated scores could be very inaccurate, but for this example we will proceed.

The p(d) values for the pooled documents are hence (from min(C / h(d), 1))

   d :   d1    d2    d3    d4    d5    d6    d7    d8
 h(d):    1     1     2     3     3     4     4     5
 p(d): 1.00  1.00  0.80  0.53  0.53  0.40  0.40  0.32

For the 92 unpooled documents (from |D|-|M| = 100-8) the p(d) value will be 1/92 (from min(v' / (|D|-|M|), C/m, 1)).

Suppose our draw process, based on the p(d) values, selects the following 6 documents for judging:
V = {d1, d2, d3, d5, d7, d51}

And suppose the judging results are as follows:

    JudgedRel(V)    = {d2, d5}
    JudgedNonrel(V) = {d1, d3, d7, d51}

The estimated number of relevant items for the topic (estR) is hence
estR = min(1/p(d2) + 1/p(d5), |D| - |JudgedNonrel(D)|) = min(1/1.00 + 1/0.53, 100 - 4) = min(2.9, 96) = 2.9

Suppose the B value for this topic is B=3, and we wish to estimate Recall@3 and Precision@3 for run1 and run2.

For run1:

 S(3)               = {d1, d2, d4}
 |S(3)|             = 3
 JudgedRel(S(3))    = {d2}
 JudgedNonrel(S(3)) = {d1}

 estRel(S(3)) = min(sum of 1/p(d) for all d in JudgedRel(S(3)), 
                    |S(3)| - |JudgedNonrel(S(3))|)
              = min(1/p(d2), 3-1) = min(1/1.00, 2) = 1.0
 estRecall@3 = estRel(S(3)) / estR
             = 1.0 / 2.9 = 0.3

 estNonrel(S(3)) = min(sum of 1/p(d) for all d in JudgedNonrel(S(3)),
                       |S(3)| - |JudgedRel(S(3))|)
                 = min(1/p(d1), 3-1) = min(1/1.00, 2) = 1.0
 estPrec@3 = (estRel(S(3)) / (estRel(S(3)) + estNonrel(S(3)))) * (|S(3)|/3)
           = (1.0/(1.0+1.0))*(3/3) = 0.5

For run2:

 S(3)               = {d2, d3, d5}
 |S(3)|             = 3
 JudgedRel(S(3))    = {d2, d5}
 JudgedNonrel(S(3)) = {d3}

 estRel(S(3)) = min(sum of 1/p(d) for all d in JudgedRel(S(3)), 
                    |S(3)| - |JudgedNonrel(S(3))|)
              = min(1/p(d2) + 1/p(d5), 3-1) = min(1/1.00 + 1/0.53, 2)
              = min(2.9, 2) = 2
 estRecall@3 = estRel(S(3)) / estR
             = 2 / 2.9 = 0.7

 estNonrel(S(3)) = min(sum of 1/p(d) for all d in JudgedNonrel(S(3)),
                       |S(3)| - |JudgedRel(S(3))|)
                 = min(1/p(d3), 3-2) = min(1/0.8, 1) = min(1.25, 1) = 1
 estPrec@3 = (estRel(S(3)) / (estRel(S(3)) + estNonrel(S(3)))) * (|S(3)|/3)
           = (2/(2+1))*(3/3) = 0.67

References

[Ba05] Jason R. Baron. Toward a Federal Benchmarking Standard for Evaluating Information Retrieval Products Used in E-Discovery. In 6 Sedona Conference Journal 237 (2005).

[Ba06] Jason R. Baron, David D. Lewis and Douglas W. Oard. TREC-2006 Legal Track Overview. In Proceedings of TREC 2006.

[Bu06] Chris Buckley, Darrin Dimmick, Ian Soboroff and Ellen Voorhees. Bias and the Limits of Pooling. In Proceedings of ACM SIGIR'06.

[Bue06] Stefan Büttcher, Charles L. A. Clarke and Ian Soboroff. The TREC 2006 Terabyte Track. In Proceedings of TREC 2006.

[To06] Stephen Tomlinson. Experiments with the Negotiated Boolean Queries of the TREC 2006 Legal Discovery Track. In Proceedings of TREC 2006.

[Yi06] Emine Yilmaz and Javed A. Aslam. Estimating Average Precision with Incomplete and Imperfect Judgments. In Proceedings of ACM CIKM'06.

[Zo98] Justin Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of ACM SIGIR'98.


Submitting Results

Like last year, participating sites are invited to submit results from up to 8 runs for official scoring. To facilitate comparisons, we're asking groups this year to include at least one standard condition run which is just based on the (typically one-sentence) request text field. (The reference boolean results should not be used for the standard condition run, as they are based on a different topic field.)

For each run, the greater of 25000 or B results will be accepted for each topic. The format is the same as last year:

topicid Q0 docno rank score tag

topicid - topic identifier
Q0 -	  unused field (the literal 'Q0')
docno -	  document identifier taken from the <tid> field of the document
rank -	  rank assigned to the segment (1=highest)
score -   numeric score that is non-increasing with rank.  It is the
          score, not the rank, that is used by the evaluation software (see
          the "TREC 2007: Welcome to TREC" message from Ellen Voorhees
          for more details)
tag -	  unique identifier for this run (same for every topic and segment)

Participating sites should each adopt some standard convention for
tags that begins with a unique identifier for the site and ends with a
unique run identifier. Tags are limited to 12 letters or numbers with
no punctuation.  

For each submitted run, the following information will be collected:

Schedule

   Jul 1	      Evaluation topic release
   Aug 1	      Runs submitted by sites
   Oct 1	      Results release
   mid-Oct	      Working notes papers due
   Nov 6-9	      TREC 2007, Gaithersburg, MD

For Additional Information

The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page. For additional questions, please contact one of the track coordinators:

Jason R. Baron      jason.baron (at) nara.gov
Douglas W. Oard     oard (at) umd.edu
Paul Thompson       paul.thompson (at) dartmouth.edu
Stephen Tomlinson   stephent (at) magma.ca