TREC 2008 Legal Track: Ad Hoc and Relevance Feedback Task Guidelines

The objective of the Ad Hoc and Relevance Feedback tasks of the TREC Legal Track is to evaluate the efficacy of automated support for review and production of electronic records in the context of litigation, regulation and legislation.

Revision History

2008 June 1: st: added that .K and .Kh data must be appended to the run file for submission; added question for RF submissions on whether the number (or estimated number) of known relevant documents was used as a factor (e.g. for setting the K values); added link to last year's topics, reference Boolean run and relevance judgments; added link to updated l07_eval evaluation software (v2.0)
2008 Apr 19: st: clarified boundary cases of formulas (e.g. K=0), filled in Evaluation Methodology section
2008 Apr 15: st: first draft released to track coordinators
2008 Apr 5: st: started first draft by updating last year's main task guidelines

Prerequisites

New participants are welcome! This is the third year of the TREC Legal Track, a forum for objective study of e-discovery techniques. Like last year, the track will feature 3 tasks: Ad Hoc, Relevance Feedback and Interactive. This document describes the Ad Hoc and Relevance Feedback tasks. For information on the Interactive task, please see the track web site at http://trec-legal.umiacs.umd.edu/.

All participants must be registered with TREC (in particular, access to the information on the Active Participants page and the ability to submit results is limited to registered participants). If you are not registered, please respond to the Call For Participation as soon as possible at http://trec.nist.gov/call08.html.

All participants should also be subscribed to the track mailing list to participate in any track discussion and to be informed of any late-breaking announcements. Contact oard (at) umd.edu to be added to the list. Archives of the mailing list are available from the track web site at http://trec-legal.umiacs.umd.edu/.

While this document is intended to be self-contained, participants are requested to be familiar with past runnings of TREC and the Legal Track. In particular, there are links to past Legal Track overview papers, guidelines, participant papers and presentations on the Legal Track web site at http://trec-legal.umiacs.umd.edu/. More background on TREC is available from the TREC web site at http://trec.nist.gov/.

Optionally, participants may join a 2nd, cross-track, mailing list for discussing evaluation issues, both for the Legal track and other tracks, called IREval:

It became clear at the TREC 2006 meeting that several different tracks are trying to address the problems associated with evaluating retrieval systems when traditional pooling is no longer viable (because the document set is too large, or there is not sufficient diversity in the runs, or there is not enough resources for sufficiently deep pools, etc.). To avoid having discussions of the issues splintered across multiple track mailing lists, NIST has set up a new mailing list for these discussions. Since the problem is broader than just TREC, the mailing list is named IReval.

To join the list, send an email message to listproc@nist.gov such that the body of the message contains only the line subscribe IReval <Firstname> <Lastname> You must subscribe to the list to post to the list, but otherwise the list is public. In particular, both the list archives and list membership will be publicly available from the NIST mailing list web site. "Reply"s to a message on the list will be routed to the entire list to facilitate discussion.

The archives of the IREval mailing list are available at http://cio.nist.gov/esd/emaildir/lists/ireval/maillist.html.

Overview of the Ad Hoc Task

The Ad Hoc task studies search of a fixed document collection using queries that the system has not seen before. This is the 3rd year of running the Ad Hoc task (sometimes known as the "main task" in past years). This year's task is enhanced to evaluate not just a system's ranking ability but also its ability to define a "set" of documents for review, balancing the requirements of recall and precision:

The target collection will be the same as last year (the IIT CDIP collection of almost 7 million metadata records and associated scanned documents, described in more detail below).
There will be approximately 40 new "topics" with a similar negotiated structure as last year (more details below). The number of new topics may be less than last year in part because we anticipate devoting more resources to the Relevance Feedback task this year (described below).
The maximum allowed number of documents to submit for each topic is proposed to be 100,000 this year (up from 25,000 last year). Last year it was found that there were more than 25,000 relevant documents for 13 of the 43 Ad Hoc topics (to a high of more than 77,000) and the median system was still achieving 10% precision at depths 20,000-25,000.
Like last year, at the time of topic release the "reference Boolean run" will be made available for optional use by participants, i.e. an alphabetical list (by document-id) of the documents matching the final negotiated Boolean query. The topic file will also report the B value for each topic (where B is the number of documents matching the final negotiated Boolean query).
- The topics will be vetted to ensure that the B value is between 100 and 100,000 inclusive for all topics.
New: The main evaluation measure this year will be F1@K:
- F1@K = 2 * Precision@K * Recall@K / (Precision@K + Recall@K).
- F1 is also known as the balanced F-measure. It is also the main measure for the Interactive task this year.
- Note: if Precision@K and Recall@K are both zero, then F1@K is defined to be zero.
In addition to submitting the system's top-100,000 ranked documents for each topic, this year participants are also required to specify a "K value" for each topic, where K is an integer between 0 and 100,000 inclusive, representing the threshold at which the system believes the competing demands of recall and precision are best balanced as the per the F1@K measure:
- This new requirement gives systems the opportunity to show that they can produce a closer set to the optimal set of R relevant documents than the reference Boolean run (for which K=B, where B is the number of documents matched by the final negotiated Boolean query). It also models a real operational requirement of e-discovery systems to return a set of documents, not just an unbounded ranked list.
- A post hoc study of last year's Ad Hoc runs (to appear in the 2007 overview paper) found that several runs would have outperformed the reference Boolean run in F1 (on average) if R was chosen as the K value (where R is the number of relevant documents), unlike if B was chosen. But last year we did not ask the systems to specify a K value before R was known, so the systems' ability to choose a good threshold was not evaluated.
- Even though only the top-K ranked documents are considered by the main measure, we still encourage participants to submit their top-100,000 ranked documents for each topic in order to enrich the pools and enable post hoc analysis of different choices of K and other measures.
- Participants who do not wish to study how to set a threshold K may use the B values provided with the reference Boolean run in order to be evaluated at depth-B like last year (except for the standard condition run as noted below). (Note, however, that no submitted run was able to outperform the reference Boolean run at depth B, on average, last year.)
- If K is set to zero for a topic, then F1@K is defined to be zero for that topic. (Although there is no incentive for a system to ever set K to 0 in this task, we allow it to support evaluation of algorithms which may sometimes conclude that there are no documents worth reviewing.)
A new secondary measure this year will be F1@R, where R is the number of relevant documents:
- This measure is independent of the choice of K and hence allows us to compare the ranking ability of systems independently of their ability to do thresholding.
- This measure can be interpreted as a recall or precision measure because R is the special depth at which recall, precision and all weightings of the F-measure (including F1) have the same true value (though our estimated values from sampling can differ as noted below). Indeed, F1@R is usually called "R-Precision" in the Information Retrieval community, but we wish to emphasize its connection with our main measure of F1@K.
- The F1@R measure is an indicator of the potential F1 a system could have achieved if it had chosen the optimal K value (because R is likely to be close to the optimal K value for the F1@K measure).
- Note that measures at depth R are considered undefined for the reference Boolean run. We will still report last year's Recall@B measure, which remains a fair indicator of whether a system can find more relevant documents than the reference Boolean query for the same reviewer effort, but our main measure for comparing to the reference Boolean run will be the F1@K measure.
- If R is zero, then the topic will be discarded from the evaluation set.
This year we are introducing the concept of "highly relevant" documents as a third category for purposes of assessment (i.e. in addition to last year's "relevant" and "not relevant"):
- This is an experiment for investigating the problem of isolating a set of "hot" or "material" documents for use in later phases of discovery (e.g. depositions) and at trial from a large set of potentially just tangentially relevant documents, which remains a key concern for the legal profession.
- Our definition of "relevant" is not intended to be more or less restrictive than last year. "Highly relevant" is just intended to be a subset of last year's "relevant".
- For most reported measures (including the main measure), both "highly relevant" and "relevant" documents will count as relevant.
- Measures which only count "highly relevant" documents as relevant may also be reported for contrast. However, the number of topics for such measures may be less than the number of topics for the main measure because some topics may have no highly relevant documents.
- In addition to submitting a K value for each topic, participants are required to submit a K_h value for each topic which will be assumed to be for targeting the HF1@K_h measure (where HF1 is the same as F1 except that only "highly relevant" documents are counted as relevant). Note that, unlike for K, we have no training data to provide for setting K_h (because highly relevant documents were not marked by the assessors in past runnings of the Legal Track) but we would like to collect the K_h data for analyzing post hoc whether any system hit upon a good way of identifying highly relevant documents.
- While it might seem logical that a system's K_h value for a topic should be less than or equal to its K value for a topic, this relationship will not be enforced by the submission system so that independent methods of thresholding for relevance and high relevance may be used (though both K and K_h are tied to the same ranked list for a topic).
- For the reference Boolean run, both K and K_h will be set to the B value (the number of documents matching the final negotiated Boolean query).
There is a standard condition run requested of all participants, namely a run which just uses the (typically one-sentence) request text field:
- This run helps facilitate comparisons and also models what a system can do without the help of the negotiated Boolean query fields.
- The reference Boolean results should not be used for this run, as they are based on one of the Boolean fields.
- The K and K_h values for this run should also not be based on any field other than the request text field. (In particular, the B values should not be a factor in setting the K and K_h values for this run.) Participants who do not wish to study how to set a threshold K could use a fixed value for each topic (such as 16,904, which was last year's average R value), though we hope that most participants can find a more generic strategy.
Like last year, a deep sampling method will be used to estimate R (the total number of relevant documents for a topic) and to estimate the recall, precision and F1 of a run for a topic to depths of up to 100,000 (details described below).
- An updated version of last year's l07_eval utility will be made available (target: May 5) which computes the new measures to support training based on last year's test collections.
Like last year, participants may submit up to 8 runs for official scoring (see below for submission format details and timeline).
Failure Analysis challenge: To further encourage the research and development of recall-enhancing techniques, participating sites which analyze their top system's results for at least 3 topics for why some relevant documents were still missed by the system and suggest ways that system recall could be improved will be given special mention in the 2008 track overview paper.

Overview of the Relevance Feedback Task

The Relevance Feedback task studies two-pass search in a controlled setting with some relevant and non-relevant documents manually marked after the first pass. This is the 2nd year of running the Relevance Feedback task. This year, we anticipate that considerably more topics will be assessed and we hope to see increased participation in the task:

The target collection will be the same IIT CDIP collection as in the Ad Hoc task.
There will be approximately 40 "topics" selected from past Ad Hoc topics of 2006 and 2007:
- Selection criteria will include the number of judged relevant documents available and whether the B_r value is in the 100 to 100,000 range (where B_r is the number of documents matched by the reference Boolean query after previously judged documents are discarded).
- Last year's 10 assessed Relevance Feedback topics will not be used again this year.
- The number of topics that are actually assessed will depend on the availability of resources, but we anticipate being able to judge considerably more than the 10 topics of last year.
Participants can use whatever techniques they wish to generate ranked result sets for each topic:
- Techniques using "relevance feedback" (i.e. past relevance judgments for the topic) are particularly encouraged.
- Feedback based on the collection's rich metadata is also encouraged. As noted at http://www.ir.iit.edu/projects/cdipdesc1-0.txt: "Particularly intriguing are the many levels of physical proximity (Bates numbers, file boxes, etc.) of documents that can be suggestive of related meanings."
Residual evaluation will be used again this year, i.e. documents judged for the topic in a previous year will be discarded before evaluating this year.
The maximum allowed number of documents to submit for each topic will be 101,000:
- The extra 1000 (compared to the Ad Hoc task) is to allow for the discarding of past judged documents (of which there are at most 1000 for any topic).
- Just the top-ranked 100,000 documents for the topic (after discarding past judged documents) will be included in the pools, and no evaluation measure will look deeper than 100,000 residual documents.
Like in the Ad Hoc task, a K value and K_h value should be submitted for each topic (for targeting the F1 and HF1 measures respectively as described in the Ad Hoc guidelines):
- The submitted K value should be relative to the original submitted ranked list. Hence K may range from 0 to 101,000.
- The task organizers will infer the residual K_r value for each topic by deducting from K the number of previously judged documents in the top-K ranked documents of the original list.
- If K_r would exceed 100,000, it will be reset to 100,000 (because the residual list of documents is capped at 100,000).
- For the submitted K_h values, we will likewise infer residual K_hr values.
- The main measure for the Relevance Feedback task will be F1@K_r, though F1@R_r and last year's Recall@B_r will also be reported.
Reference Boolean results for all topics will be made available for optional use by participants.
There is no standard condition run for this task, but for each feedback run we encourage participants to submit a corresponding baseline run which represents what the system would have produced without feedback, so that the usefulness of feedback to the system can be isolated.
- On the submission form, for feedback runs we will ask what the runname is of the corresponding baseline run. (The same baseline run could apply to multiple feedback runs.)
Participants may submit up to 8 runs for official scoring (including baseline runs). Please see below for the submission format details and timeline.
The Failure Analysis challenge of the Ad Hoc task also applies to the Relevance Feedback task.

Interactive Task

Participants who have developed systems for the Ad Hoc and/or Relevance Feedback tasks are encouraged to additionally enter the track's Interactive task this year. It is much like the Ad Hoc task but substantially fewer topics are expected to be required (e.g. perhaps just 3 or so), enabling participants to apply interactive techniques (e.g. some manual review of documents for relevance feedback, enhancing the queries or determining a good cutoff for the F1 measure) which might be too time-consuming for some participants to use for the full set of Ad Hoc or Relevance Feedback topics.

For the details of the Interactive task, please find the guidelines from the track home page at http://trec-legal.umiacs.umd.edu/.

Documents

The document collection is the same as the past 2 years:

The set of documents for the track will be the IIT Complex Document Information Processing test collection. This collection consists of roughly 7 million documents (approximately 57 GB of metadata and OCR text uncompressed, 23 GB compressed) drawn from the Legacy Tobacco Document Library hosted by the University of California at San Francisco. These documents were made public during various legal cases involving US tobacco companies and contain a wide variety of document genres typical of large enterprise environments.

The metadata and OCR can be obtained by FTP at no charge. For teams unable to transfer this quantity of data by FTP, the collection will also be available by mail as a set of DVD's from NIST.

To download the collection, please fill out the form at the bottom of http://www.ir.iit.edu/projects/CDIP.html and you will be contacted with the ftp information.

Topics

The topics will be "production requests" like the past 2 years:

Participants in the track will search the IIT CDIP collection for documents relevant to a set of production requests. The production requests will be designed to simulate the kinds of requests that parties in a legal case would make for internal documents to be produced by other parties in the legal case. Each production request includes a broad complaint that lays out the background for several requests, one specific request for production of documents, and a negotiated Boolean query that serves as a reference and is also available for use by ranked retrieval systems.

Participating teams may form queries in any way they like, using materials provided in the complaint, the production request, the Boolean query, and any external resources that they have available (e.g. a domain-specific thesaurus). Note in particular that the Boolean query need not be used AS a Boolean query; it is provided as an example of what might be negotiated in present practice, and teams are free to use its contents in whatever way they think is appropriate.

Queries that are formed completely automatically using software that existed at the time the evaluation queries were first seen are considered automatic; all other cases are considered manual queries. Automatic queries provide a reasonably well controlled basis for cross-system comparisons, although they are typically representative of only the first query in an interactive search process. The most common use of manual queries is to demonstrate the retrieval effectiveness that can be obtained after interactive optimization of the query (which typically results in excellent contributions to the judgment pool and are thus highly desirable), but even interventions as simple as manual removal of stopwords or stop structure will result in manual queries.

Reference Boolean run: For each topic, we intend to make available the list of documents which match the final negotiated Boolean query, for optional use by participants. The B values (the number of documents matching the Boolean query) will be included as part of the topic statement in the topic file. The format of the list of matching documents will be the same as the standard submission format for TREC ad hoc runs (described below); however, the documents will just be listed in alphabetical document-id order with all rsv and rank values set to a constant (i.e. no Boolean-based ranking will be provided, just the list of matching documents).

Note: The topics, relevance judgments and reference Boolean run for last year (2007) are archived at http://trec.nist.gov/data/legal07.html. An updated version of the l07_eval evaluation software (which includes the new F1@K and F1@R measures being investigated in 2008) is posted at http://trec-legal.umiacs.umd.edu/l07_eval_v20.zip.

Evaluation Methodology

Like last year, a deep sampling method will be used to estimate R (the total number of relevant documents for a topic) and to estimate the recall, precision and F1 of a run for a topic to depths of up to 100,000. The method will follow this outline:

Step 1: Pooling

Like last year, we intend to form a pool of documents for each topic consisting of all of the documents submitted by any run. Also, like last year, we intend to randomly select some documents from the rest of the collection for adding to the pool.

Last year, for the Ad Hoc task, there were 68 submitted runs which retrieved up to 25,000 documents per topic, and the resulting pool size was typically 300,000 documents for a topic. This year, runs may retrieve up to 100,000 documents per topic, hence we anticipate that the typical pool size could be more than a million documents this year.

Step 2: Assigning of p(d) values (judging probabilities)

Tentatively, we plan to set p(d), the probability of judging pooled document d, as follows:

  If (hiRank(d) <= 5) { p(d) = 1.0; }
  Else { p(d) = min(1.0, ((5/100000)+(C/hiRank(d))));

where hiRank(d) is the highest rank at which any submitted run retrieved document d, and C is chosen so that the sum of all p(d) (for all submitted documents d) is the number of documents we can judge (last year this number was typically between 500 and 1000).

This approach would, like last year, give us a combination of top-5 document judging from all runs (which typically contributed between 100 and 200 documents to the judging sample last year) and deeper sampling (typically between 300 and 400 documents last year). Measures at depth-100,000 would have the accuracy of approximately 5+C simple random sample points, while measures at other depths would have the accuracy of approximately (at least) C simple random sample points. Last year (with a similar p(d) formula) the C values were fairly low (between 0.3 and 2.4 when just 500 documents were judged), limiting the accuracy of the estimates for individual topics. The mean scores (over multiple topics) should be somewhat more reliable.

Step 3: Document Assessing

Based on the p(d) values, typically (at least) 500 documents will be drawn from each topic's pool for review by the assessors. (We anticipate that the assessing will be done by volunteers from the legal community, like last year.)

Note that the assessors review the scanned document images (such as http://legacy.library.ucsf.edu/tid/hdz83f00/pdf) rather than the xml-formatted text which is typically used by the automated systems.

Each document will be judged as "not relevant", "relevant", "highly relevant" or, in rare cases, left as "gray". (Gray documents typically are documents which are too long to fully review (e.g. more than 300 pages) and the assessor could not find a reason to judge them relevant after a partial review. Also, this category is used when a technical problem prevents the document from displaying properly in the assessor system.) Last year's guidelines for assessors are available at http://trec-legal.umiacs.umd.edu/TRECLega2007l_HowToGuide_Version1.1.doc (but there will be some updates this year, e.g. to define the "highly relevant" category).

Step 4: Estimating R, the Number of Relevant Documents for a Topic

The same formulas as last year will be used to estimate the number of relevant documents in the pool for each topic (along with the number of non-relevant, gray and (new this year) highly relevant documents):

Let D be the set of documents in the target collection. For the legal track, |D|=6,910,912.
Let S be a subset of D.
Define JudgedRel(S) to be the set of documents in S which were judged relevant (including those judged highly relevant).
Define JudgedNonrel(S) to be the set of documents in S which were judged non-relevant.
Define JudgedHighlyRel(S) to be the set of documents in S which were judged highly relevant (this is a subset of JudgedRel(S)).
Define Gray(S) to be the set of documents in S which were which were presented to the assessor but not judged relevant (including highly relevant) nor non-relevant.

Define estRel(S) = min(sum of 1/p(d) for all d in JudgedRel(S), |S| - |JudgedNonrel(S)|)
(or 0 if |JudgedRel(S)| = 0).
Note: the min operator ensures that judged non-relevant documents are not inferred to be relevant. In particular, if all of the documents in S are judged, then estRel(S) will equal the actual number of relevant documents in S.

Let R be the estimated number of relevant documents for the topic.
R = estRel(D).

If R is 0 for a topic, the topic will be discarded. Otherwise, R is guaranteed to be at least 1.0.

Define estGray(S) = min(sum of 1/p(d) for all d in Gray(S), |S| - (|JudgedRel(S)| + |JudgedNonrel(S)|))
(or 0 if |Gray(S)| = 0)

Define estHighlyRel(S) = min(sum of 1/p(d) for all d in JudgedHighlyRel(S), |S| - (|JudgedNonrel(S)| + (|JudgedRel(S)| - |JudgedHighlyRel(S)|)))
(or 0 if |JudgedHighlyRel(S)| = 0).

Let R_h be the estimated number of highly relevant documents for the topic.
R_h = estHighlyRel(D).

Last year, the average R value for the Ad Hoc task was 16,904.

Step 5: Estimating Recall@K, Precision@K, F1@K and F1@R for a Topic

The formulas for recall and precision will be the same as last year:

For a particular ranked retrieved set S:

Let S(k) be the set of top-k ranked documents of S.
Note: |S(k)| = min(k, |S|)

Define Recall@k = estRel(S(k)) / R

Define Precision@k = (estRel(S(k)) / (estRel(S(k)) + estNonrel(S(k)))) * (|S(k)| / k)
or 0 if both estRel(S(k)) and estNonrel(S(k)) are 0.

Anomalously, the estimated Recall@k and Precision@k are not in general guaranteed to favor the same system for a particular topic (though they will usually correlate). Alternate definitions to make the estimates perfectly consistent can lead to worse anomalies or inaccuracies. e.g. define Recall2@k = (Precision@k * k) / R. Then Recall2@k and Precision@k always agree on each topic, but Recall2@k could decrease for larger k (unlike Recall@k) and Recall2@k could exceed 100% (unlike Recall@k), which are worse anomalies. Alternately, define Prec2@k = (Recall@k * R) / k. Then Recall@k and Prec2@k always agree on each topic, but Prec2@k seems likely to be a less accurate estimate of precision than Precision@k.)

Define F1@k = 2 * Precision@k * Recall@k / (Precision@k + Recall@k)
or 0 if both Precision@k and Recall@k are 0.

The K and B values are integers and hence can be substituted for k in the above formulas. R, however, can be fractional, hence we provide the following additional definition:

Define F1@R = F1@R_ceil
where R_ceil is the ceiling of R (i.e. the smallest integer greater than or equal to R).

Note: To support training based on last year's Ad Hoc test collection, an updated version of last year's l07_eval utility has been made available which includes the new F1@K and F1@R measures for 2008. Please see http://trec-legal.umiacs.umd.edu/l07_eval_v20.zip.

More background on the evaluation methodology: Last year's method (described in last year's guidelines and in the 2007 overview paper) was referred to as the "L07 Method". Conceptually it turned out to be very similar to the "statAP" method evaluated by Northeastern University in the TREC 2007 Million Query Track. (The common ancestor was the "infAP" method, which also came from Northeastern.) Both methods associate a probability with each document judgment. Our approach used deeper sampling and more judgments per topic, but used many fewer test topics than the Million Query Track. The methods also assigned probabilities differently (though this should not matter on average) and targeted different measures. Please consult the Northeastern work for a more detailed discussion of the theoretical underpinnings of measure estimation.

Submitting Results

For the Ad Hoc task, participating sites are invited to submit results from up to 8 runs for official scoring. For each run, 100,000 results will be accepted for each topic.

For the Relevance Feedback task, participating sites are invited to submit results from up to 8 runs for official scoring. For each run, 101,000 results will be accepted for each topic.

The run submission format for both tasks is the standard TREC format used in past years:

topicid Q0 docno rank score tag

topicid - topic identifier
Q0 -	  unused field (the literal 'Q0')
docno -	  document identifier taken from the <tid> field of the document
rank -	  rank assigned to the segment (1=highest)
score -   numeric score that is non-increasing with rank.  It is the
          score, not the rank, that is used by the evaluation software (see
          the "TREC 2008: Welcome to TREC" message from Ellen Voorhees
          (Feb 26/08) for more details)
tag -	  unique identifier for this run (same for every topic and segment)

Participating sites should each adopt some standard convention for
tags that begins with a unique identifier for the site and ends with a
unique run identifier. Tags are limited to 12 letters or numbers with
no punctuation.

Note that at least one document must be submitted for each topic; otherwise, the submission system will not accept the run. If you wish to evaluate a system that does not retrieve a document for a particular topic, we suggest just having it submit the 1st document of the collection (aaa00a00) as a placeholder.

For the run's .K file (i.e. file of K values), specify a separate line for each topic. Each line must consist of the topic number, followed by whitespace, followed by the K value. The K value must be an integer between 0 and 100,000 inclusive for the Ad Hoc task and between 0 and 101,000 inclusive for the Relevance Feedback task. The lines must be in increasing order by topic number. (For an example, please see the provided refL08B.K file for the Ad Hoc task or refRF08B.K file for the Relevance Feedback task.)

For the run's .Kh file (i.e. file of K_h values), the format is the same as for the .K file.

Note: for run submission, the .K and .Kh data must be appended to the run file. The submitted run file must

start with the ranked list,
followed by a blank line,
followed by the 40 lines of K values (topicId whitespace K-value),
followed by the 40 lines of Kh values (topicId whitespace Kh-value).

For example, if there were 3 topics (103, 104, 105) an example combined file could be

103 Q0 aaa00a00 0 0.9999 runL08
104 Q0 aaa00a00 0 0.9999 runL08
105 Q0 aaa00a00 0 0.9999 runL08

103 16000
104 12000
105 18000
103 8000
104 7000
105 9000

If you do not wish to study how to set K values, you could just append the provided constL08.append file for each Ad Hoc task run and refRF08B.append file for each Relevance Feedback task run.

Note: it's strongly recommended that you check your runs' formatting with the NIST perl check script for your task before attempting to submit.

For each submitted run, the following information will be collected:

Judgment pool priority (1=highest, 8=lowest). Note: we expect all runs to be pooled, but just in case, assign highest priority to your most important experiments. The estimated scores will be most accurate for runs included in the pool.
Topic fields used. Specify one or more of the following:
- Request Text
- Initial Boolean Query Proposed by the Defendant
- Rejoinder Boolean Query by the Plaintiff
- Final Negotiated Boolean Query (only mark if you did your own processing of this field; just incorporating the reference Boolean run does not count)
- B value (the number of documents matching the final negotiated Boolean Query)
- Instruction
- Definition
- Complaint
Was the Reference Boolean Run Used? (only mark if you somehow used the provided list of documents matching the reference Boolean query; do not mark if you just used the B value field)
Indexed document fields (OCR, Metadata or both). (We expect most runs will search both.)
Automatic or Manual search? (list "Manual" for runs with any human intervention, including any changes to the system implementation after reading the evaluation topics, any manual editing of the contents of the topic fields that are used, and any adjustments made to system output based on human examination of the system results). Bug fixes to the system after examining the topics result in manual runs.
- If manual: did this manual run involve any reviewing of documents? If so, how many documents did you manually review for responsiveness (relevance) before finalizing this run? (Please list the average, lowest, and highest of all requests.)
(Relevance Feedback task only): Baseline or Feedback run? (A Feedback run is one which made any use of past relevance judgments for the topics.)
- If Feedback was used for the run, please enter the runname of the corresponding baseline run.
(Relevance Feedback task only): Did the run make use of the number of relevant documents (or estimated number of relevant documents) from the previous judging of the topic (e.g. for setting the K values)? (Yes/No)
Please give a short description of the techniques used to produce the run (or outline how it differs from another of your submitted runs). Max 450 characters (ASCII text, no colons).

Schedule

   June 1             Relevance Feedback topic release.
   June 15	      Ad Hoc topic release (new topics).
   July 20-24         SIGIR 2008 conference in Singapore (some track and NIST
                       coordinators may be unavailable around this time)
   Aug 6	      Runs submitted by sites (both Ad Hoc and RF tasks)
   Oct 1	      Results release
   mid-Oct	      Working notes papers due
   Nov 18-21	      TREC 2008, Gaithersburg, MD

For Additional Information

The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page. For additional questions, please contact one of the track coordinators:

Jason R. Baron      jason.baron (at) nara.gov
Bruce Hedin         bhedin (at) h5technologies.com
Douglas W. Oard     oard (at) umd.edu
Stephen Tomlinson   stephent (at) magma.ca