TREC 2009 Legal Track: Batch Task Guidelines

The objective of the Batch task of the TREC Legal Track is to evaluate the efficacy of automated support for review and production of electronic records in the context of litigation, regulation and legislation.

Revision History

2009 Jun 1: st: clarified that evaluation would not be residual; deferred topic release to June 15
2009 Mar 28: st: posted draft outline to track mailing list

Prerequisites

New participants are welcome! This is the fourth year of the TREC Legal Track, a forum for objective study of e-Discovery techniques. This year the track will feature 2 tasks: Batch and Interactive. This document describes the Batch task. For information on the Interactive task, please see the track web site at http://trec-legal.umiacs.umd.edu/.

All participants must be registered with TREC (in particular, access to the information on the Active Participants page and the ability to submit results is limited to registered participants). If you are not registered, please respond to the Call For Participation as soon as possible at http://trec.nist.gov/call09.html.

All participants should also be subscribed to the track mailing list to participate in any track discussion and to be informed of any late-breaking announcements. Contact oard (at) umd.edu to be added to the list. Archives of the mailing list are available from the track web site at http://trec-legal.umiacs.umd.edu/.

While this document is intended to be self-contained, participants are requested to be familiar with past runnings of TREC and the Legal Track. In particular, there are links to past Legal Track overview papers, guidelines, participant papers and presentations on the Legal Track web site at http://trec-legal.umiacs.umd.edu/. More background on TREC is available from the TREC web site at http://trec.nist.gov/.

Optionally, participants may join a 2nd, cross-track, mailing list for discussing evaluation issues, both for the Legal track and other tracks, called IREval:

It became clear at the TREC 2006 meeting that several different tracks are trying to address the problems associated with evaluating retrieval systems when traditional pooling is no longer viable (because the document set is too large, or there is not sufficient diversity in the runs, or there is not enough resources for sufficiently deep pools, etc.). To avoid having discussions of the issues splintered across multiple track mailing lists, NIST has set up a new mailing list for these discussions. Since the problem is broader than just TREC, the mailing list is named IReval.

To join the list, send an email message to listproc@nist.gov such that the body of the message contains only the line subscribe IReval <Firstname> <Lastname> You must subscribe to the list to post to the list, but otherwise the list is public. In particular, both the list archives and list membership will be publicly available from the NIST mailing list web site. "Reply"s to a message on the list will be routed to the entire list to facilitate discussion.

The archives of the IREval mailing list are available at http://cio.nist.gov/esd/emaildir/lists/ireval/maillist.html.

Overview of the Batch Task

The Batch task of the TREC 2009 Legal Track is a successor to the Ad Hoc and Relevance Feedback tasks of past years.

What are the goals of the Batch task?

The Batch task supports researching whether substantial improvements in e-Discovery result set quality can be achieved by making use of a set of existing relevance judgments or by other novel ad hoc search techniques.

What document set will be used?

The task will use the same IIT CDIP collection as the past few years (~7 million documents). More information is below.

What test queries will be used?

The task will re-use some of the production request topics from previous years, including the 3 Interactive Task topics from last year which have larger than usual numbers of relevance judgments (up to 6500 for one topic).

Will the evaluation just re-use the old relevance judgments?

(updated June 1/09) No, the old judgments (which are an allowed input to the submitted runs) will not be used at all in this year's evaluation. And unlike the feedback tasks of the past couple of years, residual evaluation will not be used this year. The samples for assessment will be drawn from all of the submitted documents. This year's assessors will not be advised whether any of the drawn documents happen to be documents that were previously judged. Hence it is quite possible that documents judged non-relevant in the past will be judged as relevant for this year's evaluation, and that documents judged relevant in the past will be judged as non-relevant for this year's evaluation. This approach may be seen as modeling the real-world issue that the e-Discovery service provider's internal judgments used to train the system may not perfectly match the final authority's view of relevance.

How many documents may be submitted for a topic?

A key difference from last year is that we will allow up to 1.5 million documents per topic, which is 15x more than last year. This number is large enough to cover the highest estimated number of relevant documents in any past topic (786,862 for topic 103), and it is also large enough to permit any of last year's Boolean runs (including the plantiff Boolean run which matched 1,194,522 documents for topic 133).

How many test topics are used?

This year's task will just use 10 topics (down from the 40-50 typically used in past query sets). A submitted run will still contain more than 3x as many documents as last year's Ad Hoc runs; its size will likely be up to ~250MB compressed and ~825MB uncompressed. While this number of test topics is likely too few to establish small gains in average performance, it is hoped that it is enough to indicate whether a technique is providing the large, consistent gains over past baseline techniques that the track is shooting for. The smaller number of topics may also encourage more exploration of manually-aided techniques (e.g. custom Boolean queries). Also, as the submission pools will be much larger than in past years, we may require multiple assessors for a topic (like used in the Interactive task last year) to attain enough judged documents for accurate score estimates.

How many submitted runs are allowed per research group?

This year just 3 submitted runs per participant are allowed (down from 8 in past years). This number is similar to some other tracks, and it keeps the total footprint of a full submission set to only 25% more than last year's Ad Hoc task. Also, a standard condition run is not requested this year.

What is the main evaluation measure?

We will use the same F1@K measure as last year. (There was talk of inventing a new utility measure accounting for financial cost at the workshop last year, but we did not receive a specific proposal.) Like last year, some traditional rank-based measures such as R-Precision (or equivalently, Recall@R or F1@R) will also be reported.

Will the 'highly relevant' judgment category be used this year?

Yes, we will collect Kh values (in addition to K values) for each run again. However, some of the topics may not have highly relevant documents marked in the existing judgments.

What training data is there?

The richest training data is the 3 Interactive test topics from last year (numbers 102, 103 and 104) which have from 2500 to 6500 judgments per topic. They also went through an adjudication process last year (as described in the track overview paper for last year at http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf ).

Is tuning or training allowed on the same topics that are in the test set?

Yes, in fact feedback is an encouraged technique for this task! One possibility for tuning or training would be to use half of the existing judgments for developing the system, and then the other half for self-evaluating it. (If there is interest, we could perhaps define a standard training set and evaluation set of judgments for each test topic.) For most topics, of course, past sampling was fairly sparse for a lot of the collection, so the accuracy of a self-evaluation may be limited, particularly for novel systems, though it likely would still be worthwhile.

Is it allowed to make use of submitted Interactive runs from last year?

No, use of past Interactive runs is not allowed. e.g. It wouldn't be very interesting if runs produced an F1 score of 0.7 on topic 103 by just copying one of the Interactive runs from last year. We'd like to see if runs can achieve this level of performance from just using the topic statement and/or the 6500 official judgments for this topic.

Is using the known judgments a required technique?

No, we hope to receive a variety of ad hoc and feedback approaches, both established baseline runs and novel techniques.

What is the proposed schedule?

Please see below.

Documents

The document collection is the same as the past 3 years:

The set of documents for the track will be the IIT Complex Document Information Processing test collection. This collection consists of roughly 7 million documents (approximately 57 GB of metadata and OCR text uncompressed, 23 GB compressed) drawn from the Legacy Tobacco Document Library hosted by the University of California at San Francisco. These documents were made public during various legal cases involving US tobacco companies and contain a wide variety of document genres typical of large enterprise environments.

The metadata and OCR can be obtained by FTP at no charge. For teams unable to transfer this quantity of data by FTP, the collection will also be available by mail as a set of DVD's from NIST.

To download the collection, please fill out the form at the bottom of http://www.ir.iit.edu/projects/CDIP.html and you will be contacted with the ftp information.

Topics

The topics will be "production requests", a subset of those used in the past 3 years:

Participants in the track will search the IIT CDIP collection for documents relevant to a set of production requests. The production requests will be designed to simulate the kinds of requests that parties in a legal case would make for internal documents to be produced by other parties in the legal case. Each production request includes a broad complaint that lays out the background for several requests, one specific request for production of documents, and a negotiated Boolean query that serves as a reference and is also available for use by ranked retrieval systems.

Participating teams may form queries in any way they like, using materials provided in the complaint, the production request, the Boolean query, and any external resources that they have available (e.g. a domain-specific thesaurus). Note in particular that the Boolean query need not be used AS a Boolean query; it is provided as an example of what might be negotiated in present practice, and teams are free to use its contents in whatever way they think is appropriate.

Participants are also encouraged (but not required) to make use of the past relevance judgments for the topics. The past judgments for this year's topics will be collected and provided in one file for convenience.

We also note that it may be possible to take advantage of the collection's rich metadata. As noted at http://www.ir.iit.edu/projects/cdipdesc1-0.txt: "Particularly intriguing are the many levels of physical proximity (Bates numbers, file boxes, etc.) of documents that can be suggestive of related meanings."

Reference Boolean run: For each topic, we also will make available the list of documents which match the final negotiated Boolean query, for optional use by participants. The B values (the number of documents matching the Boolean query) will be included as part of the topic statement in the topic file. The format of the list of matching documents will be the same as the standard submission format for TREC ad hoc runs (described below); however, the documents will just be listed in alphabetical document-id order with all rsv and rank values set to a constant (i.e. no Boolean-based ranking will be provided, just the list of matching documents).

Evaluation Methodology

As noted in the overview, the primary measure will be the same F1@K measure as last year. (F1@K = 2 * Precision@K * Recall@K / (Precision@K + Recall@K) or 0 if both Precision@K and Recall@K are 0.) K is chosen by the systems separately for each topic; it is an integer between 0 and 1,500,000 inclusive, representing the threshold at which the system believes the competing demands of recall and precision are best balanced as the per the F1@K measure.

Like last year, a deep sampling method will be used to estimate R (the total number of relevant documents for a topic) and to estimate the recall, precision and F1 of a run for a topic at depth K (or any other depth). Please see last year's guidelines at http://trec-legal.umiacs.umd.edu/adhocRF08b.html for more information on sampling approaches. This year's approach may assign judging probabilities differently than last year (e.g. we're not guaranteeing to judge the first 5 documents retrieved by each run for each topic this year).

As noted in the overview, this year's evaluation is not residual, so past judged documents should be included in the submitted result sets (if the system considers them possibly relevant). If this year's sampling happens to draw documents that were previously judged, this year's assessors will re-assess them (with no knowledge of the previous assessment), and just this year's assessments will be used in this year's evaluation. Past judgments hence should not be considered authoritative, but as just one opinion of relevance for a past sample of documents.

Like last year, the new assessments may include "highly relevant" judgments (in addition to "relevant" and "not relevant"). Hence Kh values for targeting measures which just count highly relevant documents as relevant will be collected again this year.

For possible use in training, the most recent version of the l07_eval evaluation software (version 2.1) is available and documented in the resultsRF08.zip file which is posted (along with other resources from last year) at http://trec.nist.gov/data/legal08.html.

Submitting Results

For the Batch task, participating sites are invited to submit results from up to 3 runs for official scoring. For each run, 1,500,000 results will be accepted for each topic.

The run submission format for the task is the standard TREC format used in past years:

topicid Q0 docno rank score tag

topicid - topic identifier
Q0 -	  unused field (the literal 'Q0')
docno -	  document identifier taken from the <tid> field of the document
rank -	  rank assigned to the segment (1=highest)
score -   numeric score that is non-increasing with rank.  It is the
          score, not the rank, that is used by the evaluation software (see
          the "TREC 2009: Welcome to TREC" message from Ellen Voorhees
          (Feb 25/09) for more details)
tag -	  unique identifier for this run (same for every topic and segment)

Participating sites should each adopt some standard convention for
tags that begins with a unique identifier for the site and ends with a
unique run identifier. Tags are limited to 12 letters or numbers with
no punctuation.

Note that at least one document must be submitted for each topic; otherwise, the submission system will not accept the run. If you wish to evaluate a system that does not retrieve a document for a particular topic, we suggest just having it submit the 1st document of the collection (aaa00a00) as a placeholder.

The .K and .Kh data must be appended to the submitted run file. The submitted run file is of the same format as last year:

start with the ranked list,
followed by a blank line,
followed by the 10 lines of K values (topicId whitespace K-value),
followed by the 10 lines of Kh values (topicId whitespace Kh-value).

For example, if there were 3 topics (103, 104, 105) an example combined file could be

103 Q0 aaa00a00 0 0.9999 runL09
104 Q0 aaa00a00 0 0.9999 runL09
105 Q0 aaa00a00 0 0.9999 runL09

103 16000
104 12000
105 18000
103 8000
104 7000
105 9000

If you do not wish to study how to set K and Kh values, you could just append the provided refL09B.append or estRelL09.append file for each run.

Note: it's strongly recommended that you check your runs' formatting with the NIST perl check script for your task before attempting to submit.

For each submitted run, the following information will be collected:

Judgment pool priority (1=highest, 3=lowest). Note: we expect all runs to be pooled, but just in case, assign highest priority to your most important experiments. The estimated scores will be most accurate for runs included in the pool.
Topic fields used. Specify one or more of the following:
- Request Text
- Initial Boolean Query Proposed by the Defendant
- Rejoinder Boolean Query by the Plaintiff
- Final Negotiated Boolean Query (only mark if you did your own processing of this field; just incorporating the reference Boolean run does not count)
- B value (the number of documents matching the final negotiated Boolean Query)
- Instruction
- Definition
- Complaint
Was the Reference Boolean Run Used? (only mark if you somehow used the provided list of documents matching the reference Boolean query; do not mark if you just used the B value field)
Indexed document fields (OCR, Metadata or both). (We expect most runs will search both.)
Automatic or Manual search? (list "Manual" for runs with any human intervention, including any changes to the system implementation after reading the evaluation topics, any manual editing of the contents of the topic fields that are used, and any adjustments made to system output based on human examination of the system results). Bug fixes to the system after examining the topics result in manual runs. Note that making use of the past official relevance judgments does not by itself disqualify a run from being automatic.
- If manual: did this manual run involve any reviewing of documents? If so, how many documents did you manually review for responsiveness (relevance) before finalizing this run? (Please list the average, lowest, and highest of all requests.)
Did the run make use of past relevance judgments for the topics? (Yes/No)
Did the run make use of the number of relevant documents (or estimated number of relevant documents) from the previous judging of the topic (e.g. for setting the K values)? (Yes/No)
Please give a short description of the techniques used to produce the run (or outline how it differs from another of your submitted runs). Max 450 characters (ASCII text, no colons).

Schedule

   June 15            Topic release.
   July 19-23         SIGIR 2009 conference in Boston (some track and NIST
                       coordinators may be unavailable around this time)
   Aug 4	      Runs submitted by sites
   Oct 1	      Results release
   mid-Oct	      Working notes papers due
   Nov 17-20	      TREC 2009, Gaithersburg, MD

For Additional Information

Please see the track Web site at http://trec-legal.umiacs.umd.edu/ which contains links to resources and background information, including the track mailing list archives. For additional questions, please contact one of the track coordinators:

Jason R. Baron      jason.baron (at) nara.gov
Bruce Hedin         bhedin (at) h5.com
Douglas W. Oard     oard (at) umd.edu
Stephen Tomlinson   stephent (at) magma.ca