TREC 2008 Legal Track: Ad Hoc and Relevance Feedback Task Guidelines

The objective of the Ad Hoc and Relevance Feedback tasks of the TREC Legal Track is to evaluate the efficacy of automated support for review and production of electronic records in the context of litigation, regulation and legislation.

Revision History

Prerequisites

New participants are welcome! This is the third year of the TREC Legal Track, a forum for objective study of e-discovery techniques. Like last year, the track will feature 3 tasks: Ad Hoc, Relevance Feedback and Interactive. This document describes the Ad Hoc and Relevance Feedback tasks. For information on the Interactive task, please see the track web site at http://trec-legal.umiacs.umd.edu/.

All participants must be registered with TREC (in particular, access to the information on the Active Participants page and the ability to submit results is limited to registered participants). If you are not registered, please respond to the Call For Participation as soon as possible at http://trec.nist.gov/call08.html.

All participants should also be subscribed to the track mailing list to participate in any track discussion and to be informed of any late-breaking announcements. Contact oard (at) umd.edu to be added to the list. Archives of the mailing list are available from the track web site at http://trec-legal.umiacs.umd.edu/.

While this document is intended to be self-contained, participants are requested to be familiar with past runnings of TREC and the Legal Track. In particular, there are links to past Legal Track overview papers, guidelines, participant papers and presentations on the Legal Track web site at http://trec-legal.umiacs.umd.edu/. More background on TREC is available from the TREC web site at http://trec.nist.gov/.

Optionally, participants may join a 2nd, cross-track, mailing list for discussing evaluation issues, both for the Legal track and other tracks, called IREval:

It became clear at the TREC 2006 meeting that several different tracks are trying to address the problems associated with evaluating retrieval systems when traditional pooling is no longer viable (because the document set is too large, or there is not sufficient diversity in the runs, or there is not enough resources for sufficiently deep pools, etc.). To avoid having discussions of the issues splintered across multiple track mailing lists, NIST has set up a new mailing list for these discussions. Since the problem is broader than just TREC, the mailing list is named IReval.
To join the list, send an email message to listproc@nist.gov such that the body of the message contains only the line subscribe IReval <Firstname> <Lastname> You must subscribe to the list to post to the list, but otherwise the list is public. In particular, both the list archives and list membership will be publicly available from the NIST mailing list web site. "Reply"s to a message on the list will be routed to the entire list to facilitate discussion.

The archives of the IREval mailing list are available at http://cio.nist.gov/esd/emaildir/lists/ireval/maillist.html.

Overview of the Ad Hoc Task

The Ad Hoc task studies search of a fixed document collection using queries that the system has not seen before. This is the 3rd year of running the Ad Hoc task (sometimes known as the "main task" in past years). This year's task is enhanced to evaluate not just a system's ranking ability but also its ability to define a "set" of documents for review, balancing the requirements of recall and precision:

Overview of the Relevance Feedback Task

The Relevance Feedback task studies two-pass search in a controlled setting with some relevant and non-relevant documents manually marked after the first pass. This is the 2nd year of running the Relevance Feedback task. This year, we anticipate that considerably more topics will be assessed and we hope to see increased participation in the task:

Interactive Task

Participants who have developed systems for the Ad Hoc and/or Relevance Feedback tasks are encouraged to additionally enter the track's Interactive task this year. It is much like the Ad Hoc task but substantially fewer topics are expected to be required (e.g. perhaps just 3 or so), enabling participants to apply interactive techniques (e.g. some manual review of documents for relevance feedback, enhancing the queries or determining a good cutoff for the F1 measure) which might be too time-consuming for some participants to use for the full set of Ad Hoc or Relevance Feedback topics.

For the details of the Interactive task, please find the guidelines from the track home page at http://trec-legal.umiacs.umd.edu/.

Documents

The document collection is the same as the past 2 years:

The set of documents for the track will be the IIT Complex Document Information Processing test collection. This collection consists of roughly 7 million documents (approximately 57 GB of metadata and OCR text uncompressed, 23 GB compressed) drawn from the Legacy Tobacco Document Library hosted by the University of California at San Francisco. These documents were made public during various legal cases involving US tobacco companies and contain a wide variety of document genres typical of large enterprise environments.
The metadata and OCR can be obtained by FTP at no charge. For teams unable to transfer this quantity of data by FTP, the collection will also be available by mail as a set of DVD's from NIST.

To download the collection, please fill out the form at the bottom of http://www.ir.iit.edu/projects/CDIP.html and you will be contacted with the ftp information.

Topics

The topics will be "production requests" like the past 2 years:

Participants in the track will search the IIT CDIP collection for documents relevant to a set of production requests. The production requests will be designed to simulate the kinds of requests that parties in a legal case would make for internal documents to be produced by other parties in the legal case. Each production request includes a broad complaint that lays out the background for several requests, one specific request for production of documents, and a negotiated Boolean query that serves as a reference and is also available for use by ranked retrieval systems.
Participating teams may form queries in any way they like, using materials provided in the complaint, the production request, the Boolean query, and any external resources that they have available (e.g. a domain-specific thesaurus). Note in particular that the Boolean query need not be used AS a Boolean query; it is provided as an example of what might be negotiated in present practice, and teams are free to use its contents in whatever way they think is appropriate.
Queries that are formed completely automatically using software that existed at the time the evaluation queries were first seen are considered automatic; all other cases are considered manual queries. Automatic queries provide a reasonably well controlled basis for cross-system comparisons, although they are typically representative of only the first query in an interactive search process. The most common use of manual queries is to demonstrate the retrieval effectiveness that can be obtained after interactive optimization of the query (which typically results in excellent contributions to the judgment pool and are thus highly desirable), but even interventions as simple as manual removal of stopwords or stop structure will result in manual queries.
Reference Boolean run: For each topic, we intend to make available the list of documents which match the final negotiated Boolean query, for optional use by participants. The B values (the number of documents matching the Boolean query) will be included as part of the topic statement in the topic file. The format of the list of matching documents will be the same as the standard submission format for TREC ad hoc runs (described below); however, the documents will just be listed in alphabetical document-id order with all rsv and rank values set to a constant (i.e. no Boolean-based ranking will be provided, just the list of matching documents).

Note: The topics, relevance judgments and reference Boolean run for last year (2007) are archived at http://trec.nist.gov/data/legal07.html. An updated version of the l07_eval evaluation software (which includes the new F1@K and F1@R measures being investigated in 2008) is posted at http://trec-legal.umiacs.umd.edu/l07_eval_v20.zip.

Evaluation Methodology

Like last year, a deep sampling method will be used to estimate R (the total number of relevant documents for a topic) and to estimate the recall, precision and F1 of a run for a topic to depths of up to 100,000. The method will follow this outline:

Step 1: Pooling

Like last year, we intend to form a pool of documents for each topic consisting of all of the documents submitted by any run. Also, like last year, we intend to randomly select some documents from the rest of the collection for adding to the pool.

Last year, for the Ad Hoc task, there were 68 submitted runs which retrieved up to 25,000 documents per topic, and the resulting pool size was typically 300,000 documents for a topic. This year, runs may retrieve up to 100,000 documents per topic, hence we anticipate that the typical pool size could be more than a million documents this year.

Step 2: Assigning of p(d) values (judging probabilities)

Tentatively, we plan to set p(d), the probability of judging pooled document d, as follows:

  If (hiRank(d) <= 5) { p(d) = 1.0; }
  Else { p(d) = min(1.0, ((5/100000)+(C/hiRank(d))));

where hiRank(d) is the highest rank at which any submitted run retrieved document d, and C is chosen so that the sum of all p(d) (for all submitted documents d) is the number of documents we can judge (last year this number was typically between 500 and 1000).

This approach would, like last year, give us a combination of top-5 document judging from all runs (which typically contributed between 100 and 200 documents to the judging sample last year) and deeper sampling (typically between 300 and 400 documents last year). Measures at depth-100,000 would have the accuracy of approximately 5+C simple random sample points, while measures at other depths would have the accuracy of approximately (at least) C simple random sample points. Last year (with a similar p(d) formula) the C values were fairly low (between 0.3 and 2.4 when just 500 documents were judged), limiting the accuracy of the estimates for individual topics. The mean scores (over multiple topics) should be somewhat more reliable.

Step 3: Document Assessing

Based on the p(d) values, typically (at least) 500 documents will be drawn from each topic's pool for review by the assessors. (We anticipate that the assessing will be done by volunteers from the legal community, like last year.)

Note that the assessors review the scanned document images (such as http://legacy.library.ucsf.edu/tid/hdz83f00/pdf) rather than the xml-formatted text which is typically used by the automated systems.

Each document will be judged as "not relevant", "relevant", "highly relevant" or, in rare cases, left as "gray". (Gray documents typically are documents which are too long to fully review (e.g. more than 300 pages) and the assessor could not find a reason to judge them relevant after a partial review. Also, this category is used when a technical problem prevents the document from displaying properly in the assessor system.) Last year's guidelines for assessors are available at http://trec-legal.umiacs.umd.edu/TRECLega2007l_HowToGuide_Version1.1.doc (but there will be some updates this year, e.g. to define the "highly relevant" category).

Step 4: Estimating R, the Number of Relevant Documents for a Topic

The same formulas as last year will be used to estimate the number of relevant documents in the pool for each topic (along with the number of non-relevant, gray and (new this year) highly relevant documents):

Let D be the set of documents in the target collection. For the legal track, |D|=6,910,912.
Let S be a subset of D.
Define JudgedRel(S) to be the set of documents in S which were judged relevant (including those judged highly relevant).
Define JudgedNonrel(S) to be the set of documents in S which were judged non-relevant.
Define JudgedHighlyRel(S) to be the set of documents in S which were judged highly relevant (this is a subset of JudgedRel(S)).
Define Gray(S) to be the set of documents in S which were which were presented to the assessor but not judged relevant (including highly relevant) nor non-relevant.

Define estRel(S) = min(sum of 1/p(d) for all d in JudgedRel(S), |S| - |JudgedNonrel(S)|)
(or 0 if |JudgedRel(S)| = 0).
Note: the min operator ensures that judged non-relevant documents are not inferred to be relevant. In particular, if all of the documents in S are judged, then estRel(S) will equal the actual number of relevant documents in S.

Let R be the estimated number of relevant documents for the topic.
R = estRel(D).

If R is 0 for a topic, the topic will be discarded. Otherwise, R is guaranteed to be at least 1.0.

Define estNonrel(S) = min(sum of 1/p(d) for all d in JudgedNonrel(S), |S| - |JudgedRel(S)|)
(or 0 if |JudgedNonrel(S)| = 0)
Note: the min operator ensures that judged relevant documents are not inferred to be non-relevant.

Define estGray(S) = min(sum of 1/p(d) for all d in Gray(S), |S| - (|JudgedRel(S)| + |JudgedNonrel(S)|))
(or 0 if |Gray(S)| = 0)

Define estHighlyRel(S) = min(sum of 1/p(d) for all d in JudgedHighlyRel(S), |S| - (|JudgedNonrel(S)| + (|JudgedRel(S)| - |JudgedHighlyRel(S)|)))
(or 0 if |JudgedHighlyRel(S)| = 0).

Let Rh be the estimated number of highly relevant documents for the topic.
Rh = estHighlyRel(D).

Last year, the average R value for the Ad Hoc task was 16,904.

Step 5: Estimating Recall@K, Precision@K, F1@K and F1@R for a Topic

The formulas for recall and precision will be the same as last year:

For a particular ranked retrieved set S:

Let S(k) be the set of top-k ranked documents of S.
Note: |S(k)| = min(k, |S|)

Define Recall@k = estRel(S(k)) / R

Define Precision@k = (estRel(S(k)) / (estRel(S(k)) + estNonrel(S(k)))) * (|S(k)| / k)
or 0 if both estRel(S(k)) and estNonrel(S(k)) are 0.

Anomalously, the estimated Recall@k and Precision@k are not in general guaranteed to favor the same system for a particular topic (though they will usually correlate). Alternate definitions to make the estimates perfectly consistent can lead to worse anomalies or inaccuracies. e.g. define Recall2@k = (Precision@k * k) / R. Then Recall2@k and Precision@k always agree on each topic, but Recall2@k could decrease for larger k (unlike Recall@k) and Recall2@k could exceed 100% (unlike Recall@k), which are worse anomalies. Alternately, define Prec2@k = (Recall@k * R) / k. Then Recall@k and Prec2@k always agree on each topic, but Prec2@k seems likely to be a less accurate estimate of precision than Precision@k.)

Define F1@k = 2 * Precision@k * Recall@k / (Precision@k + Recall@k)
or 0 if both Precision@k and Recall@k are 0.

The K and B values are integers and hence can be substituted for k in the above formulas. R, however, can be fractional, hence we provide the following additional definition:

Define F1@R = F1@Rceil
where Rceil is the ceiling of R (i.e. the smallest integer greater than or equal to R).

Note: To support training based on last year's Ad Hoc test collection, an updated version of last year's l07_eval utility has been made available which includes the new F1@K and F1@R measures for 2008. Please see http://trec-legal.umiacs.umd.edu/l07_eval_v20.zip.

More background on the evaluation methodology: Last year's method (described in last year's guidelines and in the 2007 overview paper) was referred to as the "L07 Method". Conceptually it turned out to be very similar to the "statAP" method evaluated by Northeastern University in the TREC 2007 Million Query Track. (The common ancestor was the "infAP" method, which also came from Northeastern.) Both methods associate a probability with each document judgment. Our approach used deeper sampling and more judgments per topic, but used many fewer test topics than the Million Query Track. The methods also assigned probabilities differently (though this should not matter on average) and targeted different measures. Please consult the Northeastern work for a more detailed discussion of the theoretical underpinnings of measure estimation.

Submitting Results

For the Ad Hoc task, participating sites are invited to submit results from up to 8 runs for official scoring. For each run, 100,000 results will be accepted for each topic.

For the Relevance Feedback task, participating sites are invited to submit results from up to 8 runs for official scoring. For each run, 101,000 results will be accepted for each topic.

The run submission format for both tasks is the standard TREC format used in past years:

topicid Q0 docno rank score tag

topicid - topic identifier
Q0 -	  unused field (the literal 'Q0')
docno -	  document identifier taken from the <tid> field of the document
rank -	  rank assigned to the segment (1=highest)
score -   numeric score that is non-increasing with rank.  It is the
          score, not the rank, that is used by the evaluation software (see
          the "TREC 2008: Welcome to TREC" message from Ellen Voorhees
          (Feb 26/08) for more details)
tag -	  unique identifier for this run (same for every topic and segment)

Participating sites should each adopt some standard convention for
tags that begins with a unique identifier for the site and ends with a
unique run identifier. Tags are limited to 12 letters or numbers with
no punctuation.  

Note that at least one document must be submitted for each topic; otherwise, the submission system will not accept the run. If you wish to evaluate a system that does not retrieve a document for a particular topic, we suggest just having it submit the 1st document of the collection (aaa00a00) as a placeholder.

For the run's .K file (i.e. file of K values), specify a separate line for each topic. Each line must consist of the topic number, followed by whitespace, followed by the K value. The K value must be an integer between 0 and 100,000 inclusive for the Ad Hoc task and between 0 and 101,000 inclusive for the Relevance Feedback task. The lines must be in increasing order by topic number. (For an example, please see the provided refL08B.K file for the Ad Hoc task or refRF08B.K file for the Relevance Feedback task.)

For the run's .Kh file (i.e. file of Kh values), the format is the same as for the .K file.

Note: for run submission, the .K and .Kh data must be appended to the run file. The submitted run file must

For example, if there were 3 topics (103, 104, 105) an example combined file could be

103 Q0 aaa00a00 0 0.9999 runL08
104 Q0 aaa00a00 0 0.9999 runL08
105 Q0 aaa00a00 0 0.9999 runL08

103 16000
104 12000
105 18000
103 8000
104 7000
105 9000

If you do not wish to study how to set K values, you could just append the provided constL08.append file for each Ad Hoc task run and refRF08B.append file for each Relevance Feedback task run.

Note: it's strongly recommended that you check your runs' formatting with the NIST perl check script for your task before attempting to submit.

For each submitted run, the following information will be collected:

Schedule

   June 1             Relevance Feedback topic release.
   June 15	      Ad Hoc topic release (new topics).
   July 20-24         SIGIR 2008 conference in Singapore (some track and NIST
                       coordinators may be unavailable around this time)
   Aug 6	      Runs submitted by sites (both Ad Hoc and RF tasks)
   Oct 1	      Results release
   mid-Oct	      Working notes papers due
   Nov 18-21	      TREC 2008, Gaithersburg, MD

For Additional Information

The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page. For additional questions, please contact one of the track coordinators:

Jason R. Baron      jason.baron (at) nara.gov
Bruce Hedin         bhedin (at) h5technologies.com
Douglas W. Oard     oard (at) umd.edu
Stephen Tomlinson   stephent (at) magma.ca