TREC 2007 Legal Discovery Track: Main Task Guidelines

The objective of the main task of the TREC Legal Discovery Track is to evaluate the efficacy of automated support for review and production of electronic records in the context of litigation, regulation and legislation.

Revision History

2007 May 2: st: update link to refL06B
2007 May 1: st: add links to new training versions of last year's topics (fullL06.xml and shortL06.xml) and the reference boolean run (refL06B)
2007 Apr 17: st: reduce maximum accepted from 100000 to max(25000,B)
2007 Apr 12: st: get document id from <tid>
2007 Apr 11: st: updates to reflect comments from dwo
2007 Apr 8: st: named the method for estimating recall and precision as the "L07" method for ease of reference
2007 Apr 5: st: added link to overview paper; specified B target of 100 to 100000; re-organized estR, estRecall@k and estPrec@k sections and adjusted example accordingly
2007 Apr 2: st: added collection download information and fixed formatting issue in example
2007 Mar 31: st: produced first internal draft

Prerequisites

New participants are welcome! This is the second year of the TREC Legal Discovery Track, and this year's main task is the successor of the only task of last year's track. This guidelines document assumes familiarity with TREC and last year's legal track. In particular, all participants are requested to have read the following:

Last year's Legal Discovery Track Guidelines at http://trec-legal.umiacs.umd.edu/guidelines6.txt
Last year's Legal Track Overview paper at http://trec.nist.gov/pubs/trec15/papers/LEGAL06.OVERVIEW.pdf

All participants should also be subscribed to the track mailing list to participate in any track discussion and to be informed of any late-breaking announcements. Contact oard (at) umd.edu to be added to the list. Archives of the mailing list are available from the track web site at http://trec-legal.umiacs.umd.edu/.

Optionally, participants may join a 2nd, cross-track, mailing list for discussing evaluation issues, both for the Legal track and other tracks, called IREval:

It became clear at the TREC 2006 meeting that several different tracks are trying to address the problems associated with evaluating retrieval systems when traditional pooling is no longer viable (because the document set is too large, or there is not sufficient diversity in the runs, or there is not enough resources for sufficiently deep pools, etc.). To avoid having discussions of the issues splintered across multiple track mailing lists, NIST has set up a new mailing list for these discussions. Since the problem is broader than just TREC, the mailing list is named IReval.

To join the list, send an email message to listproc@nist.gov such that the body of the message contains only the line subscribe IReval <Firstname> <Lastname> You must subscribe to the list to post to the list, but otherwise the list is public. In particular, both the list archives and list membership will be publicly available from the NIST mailing list web site. "Reply"s to a message on the list will be routed to the entire list to facilitate discussion.

The archives of the IREval mailing list are available at http://cio.nist.gov/esd/emaildir/lists/ireval/maillist.html.

Overview of Differences From Last Year's Task

This year the main task is just one of 2 (or possibly 3) track tasks (the others being interactive and (maybe) routing). Last year's track just had the main task.
The target collection is the same as last year, but there won't be a special ocr-only category this year. The target this year will just be the full document set, including metadata.
There will be approximately 50 new topics. (Format and other differences from last year to be determined.) At least some of last year's topics will also be included in the topic set (such as the 12 old topics being used in the interactive task). There will likely be some new judging for at least some of last year's topics, but the official evaluation for the main task will just be on the new topics.
At the time of topic release, the B value (the number of documents matching the final negotiated boolean query) will also be provided for each topic, and an alphabetical list (by document-id) of the documents matching the boolean query will also be available for optional use by participants.
The main evaluation measure will be Estimated Recall@B, where B is the number of documents matching the boolean query (this year, B will be known to the participants). The legal community is very interested in knowing how many relevant documents are being missed by current (boolean) practice.
A new sampling method (herein called "L07") will be used to produce estimates of the main measure for each topic for all submitted runs. All submitted runs will be treated equally by the sampling procedure (including the reference boolean run).
In addition to mean scores across topics, there will be reporting of scores on individual topics and in particular topic-by-topic comparisons with the scores of the reference boolean run.
The new topics will be vetted so that the B value for any topic is not too small (e.g. B=1 happened for one topic last year) or too large (e.g. B=128195 happened for one topic last year). The target B range will be from 100 to 100000.
The maximum allowed number of documents to submit for each topic will be the larger of 25,000 or B. While the main track measure only rewards retrieving B documents for each query, when B<25000 we still encourage groups to retrieve 25,000 for each topic to help us estimate R, the number of relevant documents for each topic. Also, there may be interest in comparing runs at different cutoff levels, e.g. Estimated Recall@1000, 10000 and 25000.
To facilitate comparisons, there will be a standard condition run requested of all groups, namely a run which just uses the (typically one-sentence) request text. (The reference boolean results should not be used for this run, as they are based on a different topic field.) Additional runs which use other topic fields are also welcome and encouraged.

Documents

The document collection is the same as last year:

The set of documents for the track will be the IIT Complex Document Information Processing test collection. This collection consists of roughly 7 million documents (approximately 57 GB of metadata and OCR text uncompressed, 23 GB compressed) drawn from the Legacy Tobacco Document Library hosted by the University of California at San Francisco. These documents were made public during various legal cases involving US tobacco companies and contain a wide variety of document genres typical of large enterprise environments.

The metadata and OCR can be obtained by FTP at no charge. For teams unable to transfer this quantity of data by FTP, the collection will also be available by mail as a set of DVD's from NIST.

To download the collection, please fill out the form at the bottom of http://www.ir.iit.edu/projects/CDIP.html and you will be contacted with the ftp information.

Topics

The topics will be "production requests" like last year:

Participants in the track will search the IIT CDIP collection for documents relevant to a set of production requests. The production requests will be designed to simulate the kinds of requests that parties in a legal case would make for internal documents to be produced by other parties in the legal case. Each production request includes a broad complaint that lays out the background for several requests, one specific request for production of documents, and a negotiated Boolean query that serves as a reference and is also available for use by ranked retrieval systems.

Participating teams may form queries in any way they like, using materials provided in the complaint, the production request, the Boolean query, and any external resources that they have available (e.g. a domain-specific thesaurus). Note in particular that the Boolean query need not be used AS a Boolean query; it is provided as an example of what might be negotiated in present practice, and teams are free to use its contents in whatever way they think is appropriate.

Queries that are formed completely automatically using software that existed at the time the evaluation queries were first seen are considered automatic; all other cases are considered manual queries. Automatic queries provide a reasonably well controlled basis for cross-system comparisons, although they are typically representative of only the first query in an interactive search process. The most common use of manual queries is to demonstrate the retrieval effectiveness that can be obtained after interactive optimization of the query (which typically results in excellent contributions to the judgment pool and are thus highly desirable), but even interventions as simple as manual removal of stopwords or stop structure will result in manual queries.

Last year's topics and relevance judgements are available for training. (topics: http://trec-legal.umiacs.umd.edu/trec_legal_eval_topics.v1.3.zip; relevance judgments (qrels): http://trec.nist.gov/data/legal06.html).

New to 2007: For each topic, we intend to make available the list of documents which match the final negotiated boolean query, for optional use by participants. The B values (the number of documents matching the boolean query) will be included as part of the topic statement in the topic file. The format of the list of matching documents will be the same as the standard submission format for TREC ad hoc runs (described below); however, the documents will just be listed in alphabetical document-id order with all rsv and rank values set to a constant (i.e. no boolean-based ranking will be provided, just the list of matching documents).

Update May 2/07: For training, the lists of documents matching last year's boolean queries are now available by downloading the "refL06B" run from the Active Participants / Tracks Homepage at http://trec.nist.gov/act_part/tracks.html (direct link: http://trec.nist.gov/act_part/tracks/legal/input.refL06B.gz).

There are also two new training versions of last year's topics (fullL06.xml and shortL06.xml) available in http://trec-legal.umiacs.umd.edu/trec_legal_eval_topics.v1.3.zip:

fullL06.xml is the same as last year's topic file (trec_legal_topics6-51.xml) except that after the </BooleanQuery> tag there is now an additional "FinalB" element which specifies the number of records matching the final negotiated boolean query (as per the reference boolean run "refL06B"). For example, request #6 now includes the entry <FinalB>3862</FinalB>.
shortL06.xml is the same as fullL06.xml except that the Instruction, Definition and Complaint elements are removed for convenience. (shortL06.xml is just 37,946 bytes, compared to fullL06.xml which is 703,646 bytes.)

It is anticipated that this year's reference boolean run will be made available (July 1) in the same format as refL06B, and that this year's new topics will be made available (July 1) in both the fullL06.xml and shortL06.xml formats.

Evaluation Approach

Background on Estimating Precision and Recall

The Accuracy Issue of Simple Random Sampling: Even though simple random sampling produces unbiased estimates of a parameter, the individual estimates are usually too inaccurate to be useful for comparing retrieval systems. For example, suppose the target collection has 7 million documents, and for a particular topic 700 of these are relevant. Suppose we have the resources to judge 1000 documents. If we pick the 1000 documents from a simple random sample of the collection, most likely 0 of the documents will be judged relevant, producing an estimate of 0 relevant documents, which is far too low. If 1 of the documents was judged relevant, then we would produce an estimate of 7000 relevant documents, which is far too high.

The Traditional Pooling Solution: TREC evaluations have typically dealt with this accuracy issue by using an extreme variant of stratified sampling. The primary stratum, known as the pool, is typically the set of documents ranked in the top-100 for a topic by the participating systems. Traditionally, all of the documents in the pool are judged. Contrary to proper stratified sampling, typically none of the unpooled documents are judged (these documents are just assumed non-relevant). For the older TREC collections of ~500,000 documents, [Zo98] found that the results for comparing retrieval systems are reasonably reliable, even though it also found that probably only 50%-70% of relevant documents for a topic are assessed, on average.

Traditional Pooling Can Be Too Shallow for Larger Collections: As the judging pools have become relatively shallower, either from TREC collections becoming larger and/or the judging depth being reduced, concerns have been expressed with the reliability of results. e.g. [Bu06] recently reported bias issues with depth-55 judging for the 1 million-document AQUAINT corpus, and [To06] found that less than 20% of the relevant documents were judged on average for the 7 million-document TREC 2006 Legal collection.

Extending the Pooling Depth with Sampling: The TREC 2006 Terabyte Track [Bue06] recently experimented with taking random samples of 200 documents from (up to) depth-1252 pools, and estimated the average precision score for each run based on this deeper pooling by using the "inferred average precision" (infAP) measure suggested by [Yi06]. The infAP scores correlated highly with the MAP scores based on traditional depth-50 pooling.

The L07 Method

The L07 method for estimating recall and precision is based on how these components are estimated in the infAP calcuation. What distinguishes the L07 method is support for much deeper pooling by sampling higher-ranked documents with higher probability.

For legal discovery, recall is of central concern. It was found last year by [To06] that marginal precision exceeded 4% on average even at depth 9000 for standard vector-based retrieval approaches. Hence we propose to use depth-25000 pooling this year to get better coverage of the relevant documents. (For topics with B>25000, depth-B pooling will be used.)

A simple random sample of a depth-25000 pool, however, would be unlikely to produce accurate estimates for recall at less deep cutoff levels. Hence we propose to sample higher-ranked documents with higher probability in such a way that recall estimates at all cutoff levels up to max(25000,B) should be of similar accuracy.

An outline of the L07 method is below (followed by a worked through example). We intend to do a simulation study (by May 31) to verify that we have the software needed to carry out the method and determine how much judging effort is needed per topic for the estimated scores to be sufficiently accurate for comparing retrieval systems. The simulation approach will be to create a mock complete qrels based on last year's submitted legal track runs, then investigate if the sampling approach accurately estimates the scores from the complete qrels.

The preferred location for discussion of the evaluation approach is the cross-track IREval mailing list, rather than the track mailing list.

Detailed Outline:

Selection of Documents to Judge for a Topic

Let D be the set of documents in the target collection. For the legal track, |D|=6910912.

For each topic:

Let m be the pooling depth, i.e. the number of top-ranked documents to include the pool from each submitted run. We will use m=max(25000,B).

Let M be the pool, i.e. the set of documents retrieved in the top-m by at least one submitted run. All submitted runs will be included in the pooling. Note: m <= |M| <= |D|.

Let v be the number of documents we are willing to judge, e.g. v=800.
Let v' be the number of unpooled documents we are willing to judge, e.g. v'=35.

Define h(d) for all documents d in M as the highest rank at which any submitted run retrieved document d.
i.e. h(d) is a number from 1 to m, where 1 is the highest rank possible. Note: h(d) is not defined for d not in M.

Let p(d) be the probability that we include document d in our judging sample.
For d in M, p(d) will be min(C / h(d), 1),
where C is chosen so that the sum of all p(d) (for d in M) is v-v'.
For d not in M, p(d) will be min(v' / (|D|-|M|), C/m, 1).

Notes:

The accuracy of the various estimates will be a function of the smallest p(d) involved in the calculation. For the Precision@k estimates for k from C to max(25000,B), the above approach implies p(d) >= C/k for all submitted runs. Hence the accuracy for these estimates should be similar to what the accuracy would be if using C uniform random samples.
For just one run of 100000 documents, 800 judgements would imply C=100. However, lack of overlap in the submitted runs, especially in the upper ranks, will reduce C. If C is too low, especially C < 10, we may have to do extra judging, or just discard the topic.
The estimate of the number of relevant documents (estR) will involve low p(d) values from a small random sample outside the pool, and hence it may be less accurate. However, for a recall-oriented task, we need to be aware if large numbers of relevant documents exist outside of the pool. For most topics, all of the v' samples are likely to be non-relevant, in which case the recall estimates for submitted runs will be equivalent to ignoring the unpooled documents.
For diagnostic runs (i.e. runs not included in the pooling), the min p(d) involved in any estimate could be lower than C/k, hence the accuracy is potentially lower than for submitted runs. Participants are encouraged to set up their 8 submitted runs carefully to isolate the impact of the various experimental aspects of their techniques for subsequent diagnosis (which is good scientific practice anyway).

Let V be the set of documents chosen to be judged based on p(d).

Estimation of the Number of Relevant Documents for a Topic ("estR")

Let S be a subset of D.

Define JudgedRel(S) to be the set of documents in S which were judged relevant.
Define JudgedNonrel(S) to be the set of documents in S which were judged non-relevant.

Notes:

For any S, |JudgedRel(S)| + |JudgedNonrel(S)| <= |S|.
For our judged set V, |JudgedRel(V)| + |JudgedNonrel(V)| = |V|.
Since V contains all judged documents in D, JudgedRel(V) = JudgedRel(D) and JudgedNonrel(V) = JudgedNonrel(D).
For any S, |JudgedRel(S)| <= |JudgedRel(V)| and |JudgedNonrel(S)| <= |JudgedNonrel(V)|.

Define estRel(S) = min(sum of 1/p(d) for all d in JudgedRel(S), |S| - |JudgedNonrel(S)|)
(or 0 if |JudgedRel(S)| = 0).
Note: the min operator ensures that judged non-relevant documents are not inferred to be relevant. In particular, if all of the documents in S are judged, then estRel(S) will equal the actual number of relevant documents in S.

Let estR be the estimated number of relevant documents for the topic.
estR = estRel(D).

If estR is 0 for a topic, the topic will be discarded. Otherwise, estR is guaranteed to be at least 1.0.

Estimation of Recall@k for a Run

The L07 approach to estimating recall is similar to how infAP estimates its recall component, i.e. it's based on the run's judged relevant documents found to depth k compared to the total judged relevant documents. In our case, we have to weight the judged relevant documents to account for the sampling probability.

For a particular ranked retrieved set S:

Let S(k) be the set of top-k ranked documents of S.
Note: |S(k)| = min(k, |S|)

Define estimated Recall@k as
estRecall@k = estRel(S(k)) / estR
Note: estRecall@k will always be in the 0..1 range. Also, estRecall@k1 >= estRecall@k2 if k1 >= k2. Also, estR * estRecall@k <= k. Furthermore, if all of the documents in the collection are judged, estRecall@k would be equal to the actual Recall@k.

Estimation of Precision@k for a Run

The L07 approach to estimating precision is similar to how infAP estimates its precision component, i.e. it's based on the precision of the run's judged documents to depth k. In our case, we have to weight the judged documents to account for the sampling probability. We also multiply by a factor to ensure that unretrieved documents are not inferred to be relevant.

Note: anomalously, it is possible that (estRel(S) + estNonrel(S)) > |S|.
However, if all of the documents in S are judged, then (estRel(S) + estNonrel(S)) = |S|.

Define estimated Precision@k as
estPrec@k = (estRel(S(k)) / (estRel(S(k)) + estNonrel(S(k)))) * (|S(k)| / k)
or 0 if both estRel(S(k)) and estNonrel(S(k)) are 0.
Note: estPrec@k will always be in the 0..1 range. Anomalously, k * estPrec@k might exceed estR.
However, if all of the documents in S(k) are judged, then estPrec@k would be equal to the actual Precision@k.

Anomalously, estPrec@k and estRecall@k are not in general guaranteed to favor the same system for a particular topic (though they will usually correlate). Note that infAP has analogous issues; e.g. if a system happens to retrieve all the judged relevant documents at the top, then all the judged nonrelevant documents, then stops, the system would get an infAP score of 1.0, implying perfect recall, even though there are very likely additional unsampled relevant documents which were not retrieved. (Straying from the infAP model in favor of making the estPrec@k and estRecall@k measures perfectly consistent can lead to worse anomalies or inaccuracies. e.g. define estRecall2@k = (estPrec@k * k) / estR. Then estRecall2@k and estPrec@k always agree on each topic, but estRecall2@k could decrease for larger k (unlike estRecall@k) and estRecall2@k could exceed 100% (unlike estRecall@k), which are worse anomalies. Alternately, define estPrec2@k = (estRecall@k * estR) / k. Then estRecall@k and estPrec2@k always agree on each topic, but estPrec2@k seems likely to be a less accurate estimate of precision than estPrec@k.)

A Completely Worked Through (Small) Example

Let D be a collection of 100 documents, |D|=100.

Suppose we allow runs to submit the top-5 documents for a topic, i.e. m=5.

Suppose we are willing to judge v=6 documents for a topic, including v'=1 document that is not pooled.

Suppose there are 2 runs:

 run1 submits  d1, d2, d4, d6, d8
 run2 submits  d2, d3, d5, d7, d4

Then M (the pool) consists of 8 documents (d1 .. d8).

The h(d) values for the pooled documents are

   d :  d1   d2   d3   d4   d5   d6   d7   d8
 h(d):   1    1    2    3    3    4    4    5

We solve the following equation for C:
"sum of min(C / h(d), 1) over all d in M" = v-v' = 5,
finding that setting C to 1.6 will cause the probabilities to sum to v-v' (5) for this pool.
Note: since this C is so small, we know our estimated scores could be very inaccurate, but for this example we will proceed.

The p(d) values for the pooled documents are hence (from min(C / h(d), 1))

   d :   d1    d2    d3    d4    d5    d6    d7    d8
 h(d):    1     1     2     3     3     4     4     5
 p(d): 1.00  1.00  0.80  0.53  0.53  0.40  0.40  0.32

For the 92 unpooled documents (from |D|-|M| = 100-8) the p(d) value will be 1/92 (from min(v' / (|D|-|M|), C/m, 1)).

Suppose our draw process, based on the p(d) values, selects the following 6 documents for judging:
V = {d1, d2, d3, d5, d7, d51}

And suppose the judging results are as follows:

    JudgedRel(V)    = {d2, d5}
    JudgedNonrel(V) = {d1, d3, d7, d51}

The estimated number of relevant items for the topic (estR) is hence
estR = min(1/p(d2) + 1/p(d5), |D| - |JudgedNonrel(D)|) = min(1/1.00 + 1/0.53, 100 - 4) = min(2.9, 96) = 2.9

Suppose the B value for this topic is B=3, and we wish to estimate Recall@3 and Precision@3 for run1 and run2.

For run1:

 S(3)               = {d1, d2, d4}
 |S(3)|             = 3
 JudgedRel(S(3))    = {d2}
 JudgedNonrel(S(3)) = {d1}

 estRel(S(3)) = min(sum of 1/p(d) for all d in JudgedRel(S(3)), 
                    |S(3)| - |JudgedNonrel(S(3))|)
              = min(1/p(d2), 3-1) = min(1/1.00, 2) = 1.0
 estRecall@3 = estRel(S(3)) / estR
             = 1.0 / 2.9 = 0.3

 estNonrel(S(3)) = min(sum of 1/p(d) for all d in JudgedNonrel(S(3)),
                       |S(3)| - |JudgedRel(S(3))|)
                 = min(1/p(d1), 3-1) = min(1/1.00, 2) = 1.0
 estPrec@3 = (estRel(S(3)) / (estRel(S(3)) + estNonrel(S(3)))) * (|S(3)|/3)
           = (1.0/(1.0+1.0))*(3/3) = 0.5

For run2:

 S(3)               = {d2, d3, d5}
 |S(3)|             = 3
 JudgedRel(S(3))    = {d2, d5}
 JudgedNonrel(S(3)) = {d3}

 estRel(S(3)) = min(sum of 1/p(d) for all d in JudgedRel(S(3)), 
                    |S(3)| - |JudgedNonrel(S(3))|)
              = min(1/p(d2) + 1/p(d5), 3-1) = min(1/1.00 + 1/0.53, 2)
              = min(2.9, 2) = 2
 estRecall@3 = estRel(S(3)) / estR
             = 2 / 2.9 = 0.7

 estNonrel(S(3)) = min(sum of 1/p(d) for all d in JudgedNonrel(S(3)),
                       |S(3)| - |JudgedRel(S(3))|)
                 = min(1/p(d3), 3-2) = min(1/0.8, 1) = min(1.25, 1) = 1
 estPrec@3 = (estRel(S(3)) / (estRel(S(3)) + estNonrel(S(3)))) * (|S(3)|/3)
           = (2/(2+1))*(3/3) = 0.67

References

[Ba05] Jason R. Baron. Toward a Federal Benchmarking Standard for Evaluating Information Retrieval Products Used in E-Discovery. In 6 Sedona Conference Journal 237 (2005).

[Ba06] Jason R. Baron, David D. Lewis and Douglas W. Oard. TREC-2006 Legal Track Overview. In Proceedings of TREC 2006.

[Bu06] Chris Buckley, Darrin Dimmick, Ian Soboroff and Ellen Voorhees. Bias and the Limits of Pooling. In Proceedings of ACM SIGIR'06.

[Bue06] Stefan B�ttcher, Charles L. A. Clarke and Ian Soboroff. The TREC 2006 Terabyte Track. In Proceedings of TREC 2006.

[To06] Stephen Tomlinson. Experiments with the Negotiated Boolean Queries of the TREC 2006 Legal Discovery Track. In Proceedings of TREC 2006.

[Yi06] Emine Yilmaz and Javed A. Aslam. Estimating Average Precision with Incomplete and Imperfect Judgments. In Proceedings of ACM CIKM'06.

[Zo98] Justin Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of ACM SIGIR'98.

Submitting Results

Like last year, participating sites are invited to submit results from up to 8 runs for official scoring. To facilitate comparisons, we're asking groups this year to include at least one standard condition run which is just based on the (typically one-sentence) request text field. (The reference boolean results should not be used for the standard condition run, as they are based on a different topic field.)

For each run, the greater of 25000 or B results will be accepted for each topic. The format is the same as last year:

topicid Q0 docno rank score tag

topicid - topic identifier
Q0 -	  unused field (the literal 'Q0')
docno -	  document identifier taken from the <tid> field of the document
rank -	  rank assigned to the segment (1=highest)
score -   numeric score that is non-increasing with rank.  It is the
          score, not the rank, that is used by the evaluation software (see
          the "TREC 2007: Welcome to TREC" message from Ellen Voorhees
          for more details)
tag -	  unique identifier for this run (same for every topic and segment)

Participating sites should each adopt some standard convention for
tags that begins with a unique identifier for the site and ends with a
unique run identifier. Tags are limited to 12 letters or numbers with
no punctuation.

For each submitted run, the following information will be collected:

Judgment pool priority (1=highest, 8=lowest). Note: we expect all runs to be pooled, but just in case, assign highest priority to your most important experiments. The estimated scores will be most accurate for runs included in the pool.
Topic fields used. Specify one or more of the following:
- Request Text
- Initial Boolean Query from the Defendant
- Final Negotiated Boolean Query (only mark if you did your own processing of this field; just incorporating the boolean reference run does not count)
- Instruction
- Definition
- Complaint
Was the Boolean Reference Run Used? (only mark if you somehow used the list of documents matching the reference boolean query; do not mark if you just used the B value to limit your results)
Indexed document fields (OCR, Metadata or both). (We expect most runs will search both.)
Automatic or Manual search? (list "Manual" for runs with any human intervention, including any changes to the system implementation after reading the evaluation topics, any manual editing of the contents of the topic fields that are used, and any adjustments made to system output based on human examination of the system results). Bug fixes to the system after examining the topics result in manual runs.
Please give a short description of the techniques used to produce the run (or outline how it differs from another of your submitted runs). Max 1024 characters (ASCII text, no colons).

Schedule

   Jul 1	      Evaluation topic release
   Aug 1	      Runs submitted by sites
   Oct 1	      Results release
   mid-Oct	      Working notes papers due
   Nov 6-9	      TREC 2007, Gaithersburg, MD

For Additional Information

The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page. For additional questions, please contact one of the track coordinators:

Jason R. Baron      jason.baron (at) nara.gov
Douglas W. Oard     oard (at) umd.edu
Paul Thompson       paul.thompson (at) dartmouth.edu
Stephen Tomlinson   stephent (at) magma.ca