l07_eval v2.0 Usage Notes (TREC Legal Track)

Revision History

What is l07_eval?

l07_eval is the software utility used to calculate evaluation measures (such as Recall@B) of retrieval sets for the test collections of the TREC 2007 Legal Track.

What's new in version 2.0?

Version 2.0 adds new measures being investigated in 2008 (such as F1@K and F1@R). It also fixes some (fortunately minor) bugs that were found in version 1.0 (described below).

What are the files included in l07_eval_v20.zip?

The l07_eval_v20.zip includes the following files:

Do I need to compile the l07_eval software?

If you are using a Win32 platform, you can use the included l07_eval.exe executable. Otherwise, you need to compile l07_eval.c to l07_eval.exe for your platform. The compilation syntax is typically something like this:

gcc -lm -o l07_eval.exe l07_eval.c

Note that as of this writing, compilation of version 2.0 has not been tested on non-Win32 platforms. If you encounter any issues (such as having to add a missing cast to get the compilation to succeed), please let us know (especially stephent (at) magma.ca) so that we can make the fix available to everyone.

Another note for non-Win32 platforms: you may also need to compile the l07_sort utility (with syntax like "gcc -o l07_sort.exe l07_sort.c").

What is l07_sort?

l07_sort is the software utility used to convert a retrieval set to the canonical order, before invoking l07_eval. (Unlike the well-known trec_eval utility, the l07_eval utility assumes that the input runs list the documents in evaluation order.)

If your retrieval set is already in the canonical order, you can skip running l07_sort.

What is the canonical order for a retrieval set?

The canonical order used in the TREC 2007 Legal Track was based on the trec_eval canonical order:

Note: the specified rank (column 4 of the retrieval set) is not actually a factor in the canonical order.

Advanced note: It was discovered last year that there can still be a difference between the order produced by l07_sort and what the trec_eval 8.0 utility would use in the case of rsv scores which are the same to (approximately) 7 figures. The l07_sort utility uses double precision (C double type) for rsv scores whereas trec_eval 8.0 uses single precision (C float type) which can have more ties. e.g. 328.9999990 and 328.9999991 are tied when using single precision instead of double precision. If an rsv score is the same according to the floating point type used, then (as noted above) descending alphabetical docid is used to break the tie (e.g. the docid of rsv 328.9999990 may be placed before a docid of rsv 328.9999991 if single precision is used; in principle something similar could happen with double precision if the differences in rsv are small enough). For the TREC 2007 Legal Track (and again in 2008), the l07_sort order is the official order.

How do I sort a retrieval set with l07_sort?

Here are the steps to sorting the refL07B retrieval set (the reference Boolean run of the 2007 Ad Hoc task). (You can download refL07B from http://trec.nist.gov/data/legal/07/refL07B.gz. It also needs to be uncompressed, e.g. "gzip -d refL07B.gz".)

1. Save the retrieval set (e.g. refL07B) in a subdirectory named 'unsorted'.

2. Make a subdirectory named 'sorted'.

3. Make a text file named 'sortlist.txt' which contains a line specifying each run's name, such as follows:

refL07B

4. Run l07_sort.exe to sort the runs and output them to the subdirectory named 'sorted' as follows (this command-line is also in the included sortruns.bat file):

l07_sort.exe in=sortlist.txt inDir=unsorted\ outDir=sorted\ trecevalOrder

Note: on Unix platforms, you may need to use forward slashes instead of backslashes.

The sorting can take several seconds to run. The expected output is as follows:

l07_sort v2.0 (2008-05-19 st)
Using trec_eval ordering
Processing file 1 (refL07B)
 Opening "unsorted\refL07B" for reading
 225257 lines read (6099427 chars saved)
Lines 1 and 2 out of order:
"52     Q0      aak34d00        1       1       refL07B"
"52     Q0      aau90f00        1       1       refL07B"
 225207 out of order, 0 ties
 Sorting 225257 lines.
 Done sorting.
 There were ties in the rsv scores.
 Writing to sorted\refL07B
 Processed refL07B successfully (225257 lines)
Processed 1 files.

5. Verify that the sorted run appears in the 'sorted' subdirectory. For refL07B, the first 3 lines of the sorted version should be as follows:

52	Q0	zzm84e00	1	1	refL07B
52	Q0	zza20a00	1	1	refL07B
52	Q0	zyz25d00	1	1	refL07B

How do I evaluate a retrieval set with l07_eval?

Here are the steps to evaluating the refL07B retrieval set (the reference Boolean run of the 2007 Ad Hoc task). (The previous section describes how to download refL07B.)

1. If your retrieval set is not in the canonical order, sort it with l07_sort (as described in the previous section).

Note: For the refL07B retrieval set, you can skip this step if you are just looking at measures at depth B (the number of documents retrieved by the reference Boolean query).

2. Run l07_eval.exe to evaluate refL07B as an Ad Hoc run as follows (this command line is also in the included evalAdHoc.bat file):

l07_eval run=sorted\refL07B q=qrelsL07.probs out=ignore1 out2=ignore2 out5=refL07B.eval stringDisplay=100 M1000=25000 probD=6910912 precB=b07.txt Kfile=refL07B.K

Note: on Unix platforms, you may need to use forward slashes instead of backslashes.

The command can take a few seconds to run. Once it has completed, please verify that your output in refL07B.eval is the same as the refL07B-sorted.eval that was included in the .zip file. (If you did not sort refL07B, then please verify it is the same as refL07B-unsorted.eval.)

In particular, you should see these lines near the end of refL07B.eval:

The line with the F1@K score (the main measure in 2008) averaged over 43 topics:

:est_K-F1:     	all	0.1423

The line with the Recall@B score (the main measure in 2007) averaged over 43 topics:

:est_RB:       	all	0.2158

For evaluating your own Ad Hoc runs (once they are sorted), you just need to replace the parts in bold in the above command-line. In particular:

What was the highest F1@K score in 2007?

We didn't collect K values for the ranked runs in 2007, so we don't know. For the reference Boolean run (for which K=B), the F1@K score was 0.1423 (as shown above).

What was the highest F1@R score in 2007?

The hi_all70.eval20 file lists the highest scores in each measure of all 70 submitted runs of 2007. In particular, the highest mean F1@R score in 2007 was 0.2068 as per the following line:

:est_R-F1:     	all	0.2068

The hi_req25.eval20 file lists the highest scores in each measure for the 25 "standard condition" runs submitted in 2007 (i.e. the runs which just used the request text field). In particular, the highest mean F1@R score of the standard condition runs in 2007 was 0.1579 as per the following line:

:est_R-F1:     	all	0.1579

The median scores over all 70 runs are available in the med_all70.eval20 file, and the median scores over all 25 standard condition runs are available in the med_req25.eval20 file. (The median mean F1@R score was 0.1268 for all 70 runs and 0.1184 for the 25 standard condition runs.)

For reference, here are the 35 runs with mean F1@R scores above the all-70 median of 0.1268 in 2007: :est_R-F1: all : (1: otL07frw 0.2068) (2: otL07fbe 0.1792) (3: CMUL07ibt 0.1766) (4: wat5nofeed 0.1738) (5: CMUL07ibs 0.1713) (6: otL07pb 0.1704) (7: CMUL07ibp 0.1671) (8: CMUL07o3 0.1641) (9: CMUL07irs 0.1626) (10: wat1fuse 0.1624) (11: otL07rvl 0.1579) (12: IowaSL0704 0.1577) (13: UMKC4 0.1526) (14: otL07fb2x 0.1526) (15: UMKC6 0.1524) (16: wat3desc 0.1520) (17: CMUL07o1 0.1515) (18: IowaSL0703 0.1503) (19: otL07fb 0.1471) (20: wat8gram 0.1468) (21: UMKC1 0.1457) (22: UMKC5 0.1449) (23: UMKC2 0.1435) (24: UMass15 0.1403) (25: SabL07ab1 0.1398) (26: UMass10 0.1366) (27: otL07fv 0.1362) (28: SabL07arbn 0.1357) (29: UMass14 0.1338) (30: IowaSL0702 0.1327) (31: UMKC3 0.1323) (32: ursinus1 0.1287) (33: IowaSL0706 0.1287) (34: UMass13 0.1277) (35: IowaSL0705 0.1275) [median 0.1268].

For reference, here are the 12 standard condition runs with mean F1@R scores above the standard condition median of 0.1184 in 2007: :est_R-F1: all : (1: otL07rvl 0.1579) (2: wat3desc 0.1520) (3: UMKC5 0.1449) (4: UMKC2 0.1435) (5: UMass15 0.1403) (6: UMass10 0.1366) (7: UMass14 0.1338) (8: ursinus1 0.1287) (9: CMUL07std 0.1234) (10: fdwim7rs 0.1232) (11: ursinus4 0.1221) (12: IowaSL07Ref 0.1188) [median 0.1184].

Why are there F1@K scores in the median and high score files when K values were not collected in 2007?

For the runs from 2007, the l07_eval Kfile argument was omitted, and l07_eval defaults to setting the K value for each topic to the number of retrieved documents (which was usually 25,000 for most of the runs in 2007 because that was the maximum allowed number of documents to submit for each topic). Please ignore the F1@K scores in the median and high score files for 2007.

How should system training be done for the F1@K measure?

This is up to you. A suggestion would be to first train the system's ranking based on F1@R, then hold the ranked retrieval set steady and train setting the K values for optimizing F1@K.

How do I produce diagnostic retrieval sets for use with the judgments from 2007?

The main task guidelines for 2007 at http://trec-legal.umiacs.umd.edu/main07b.html describe what you need to do, e.g. how to acquire the document collection and the format of a retrieval set.

Some of the resources that you need are now posted at http://trec.nist.gov/data/legal07.html. In particular, the topics (numbered 52 to 101) are in the topicsL07_v1.zip, and refL07B is an example of a valid retrieval set for 2007.

The required format for the file of K values (new to 2008) is a separate line for each topic, in which each line consists of the topic number, followed by whitespace, followed by the K value. The refL07B.K file noted earlier is an example of a valid file of K values for use with the 2007 retrieval sets.

The track web site at http://trec-legal.umiacs.umd.edu/ has the details of the tasks for 2008.

What are the codes used in the qrels files?

  1. The 1st column is the topic number (from 52 to 101).
  2. The 2nd column is always a 0.
  3. The 3rd column is the document identifier (e.g. tmw65c00).
  4. The 4th column is the relevance judgment: 1 for "relevant", 0 for "non-relevant", -1 for "gray" and -2 for "gray". (In the assessor system, -1 was "unsure" (the default setting for all documents) and -2 was "unjudged" (the intended label for gray documents).)
  5. The 5th column is the probability the document had of being selected for assessment from the pool of all submitted documents.
  6. The 6th column is the highest rank at which any submitted run in 2007 retrieved the document (where 1 is the highest possible rank).
  7. The 7th column is one of the systems that retrieved the document at that rank.

The qrelsL07.normal file just includes the first 4 columns and can be used with the traditional trec_eval utility.

The qrelsL07.probs file has the same first 4 columns as qrelsL07.normal, plus the above listed 5th probablity column which the l07_eval utility can use for estimating scores at greater depths from the judged samples.

The qrelsL07.runids file has the same first 5 columns as qrelsL07.probs, plus the above listed 6th and 7th columns which tell you which system most favored each assessed document. This file was used to produce the lists of the "5 Deepest Sampled Relevant Documents" included in Section 4 of the track overview paper for 2007.

The judgments for 1 additional topic which were received too late for reporting at the conference are in separate files: (qrelsL07.late1 (4 columns), qrelsL07.late1.probs (5 columns) and qrelsL07.late1.runids (7 columns)).

What are the codes used in the .eval output of l07_eval?

The .eval files (such as refL07B.eval) produced by l07_eval use a similar format as trec_eval 8.0, with several additional measures.

The first 27 measures for each topic ("num_ret" though "P1000") are the same as in trec_eval 8.0 (but the calculation was done with l07_eval software instead of trec_eval). These measures do not use the probabilities for estimation. They are defined in http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf.

The next 41 measures for each topic (":S1:" though ":SLastJudgedB:") also do not use the probabilities for estimation. (We will provide definitions in a later version of this document.)

Of the remaining measures, those which start with the "est_" prefix use estimation. For individual topics:

The ":relstring:" listing for each topic uses the following codes for the first 100 retrieved documents: R=relevant, N=non-relevant, U=gray. A hyphen (-) is used for documents which were not shown to the assessor.

At the end of the .eval file, the summary over "all" topics is given for each measure. For all of the estimated ("est_") measures, the arithmetic mean is used (all topics are weighted equally). The mean is also used for most other measures (except a few of the trec_eval measures).

Some additional information is in the version 1.0 glossary at http://trec.nist.gov/data/legal/07/glossaryL07.html.

What were the bugs in version 1.0?

In l07_eval.c, when the estimated number of non-relevant documents at depth k (called dEstNret) exceeded k minus the raw number of relevant documents in the first k retrieved (called nLimNret), then dEstNret should have been capped at nLimNret, but instead version 1.0 set dEstNret to nLimRret (which was k minus the raw number of non-relevant documents in the first k retrieved).

None of the estimated recall scores were affected by this bug because the estimates of the number of non-relevant documents were not used in the recall calculations. In particular, the track's main measure of Estimated Recall@B was not affected.

The estimated precision scores were sometimes affected by this bug, but fortunately the impact was typically minor. For example, the biggest impact on a precision score in Table 1 of the 2007 track overview paper is that a score of 0.164 should have been listed as 0.161. For the refL07B run, the mean Estimated Precision@B of 0.292 was not impacted at all.

There also were bugs in the calculations of GS30, GS30J and bpref, but these measures were typically not cited in 2007.

An unreleased version of l07_eval (between version 1.0 and 2.0) over-estimated F1@R when R exceeded the number of retrieved documents (which particularly happened when R exceeded 25,000 for some topics; the calculation in this case used Precision@ret as an input instead of Precision@R). There was a minor impact on the F1@R numbers reported in Section 5.1 of the 2007 overview paper from this issue (the reported high score of 0.22 should have been 0.21, and the reported median of 0.14 should have been 0.13). Fortunately, the conclusions in Section 5.1 are still valid.

All of these issues have been resolved in version 2.0.

What if I have further questions?

The best place to ask questions is the track mailing list (see the track web site at http://trec-legal.umiacs.umd.edu/ for how to join the list and see the archives). The mailing list is monitored by practically all of the users of l07_eval, giving you the best chance for a quick response. You are also welcome to send your question directly to stephent (at) magma.ca (but the response time may vary).

(end of usage notes)