TREC 2007 Legal Discovery Track: Main Task Evaluation Topics
Revision History
- 2007-July-2 (st) - updated last year's readme for 2007
Contents:
File List
Background on Topic Creation
Boolean Strings Syntax
Topic File Structure
Further Information
File list for topicsL07_v1.zip:
TREC2007_ComplaintA_v1.doc (background on 17 requests (52-68))
TREC2007_ComplaintB_v1.doc (background on 10 requests (69-78))
TREC2007_ComplaintC_v1.doc (background on 10 requests (79-88))
TREC2007_ComplaintD_v1.doc (background on 13 requests (89-101))
fullL07_v1.xml (official topic file for participants)
shortL07_v1.xml (topic file excluding repeated fields)
readmeL07_v1.txt (this file)
This README file explains the format of the topic (request) files used
for the TREC 2007 Legal track, as well as how the topics are created.
There are a cumulative total of 50 requests to produce among the 4
complaints, representing 50 topics for the TREC legal track. The 50
topics are assembled in a single XML file (fullL07_v1.xml).
Background on Topic Creation:
1. Working with legal documents: the hypothetical Complaints and
Requests to Produce.
In 2007, a TREC track coordinator worked with colleagues from
The Sedona Conference(R) in the drafting of the four hypothetical
"Complaints" marked A through D. (Technically speaking, Complaint D,
representing a fictional letter from the Antitrust Division of the US
Department of Justice, Federal Watchdog Commission is known as a
"second request" demand letter, but for ease of reference it too will
be deemed a "Complaint.").
As indicated by their titles, each of the four documents focuses on
a different area of the law. Although there are variations in the
makeup of the four Complaints, each generally sets out in narrative
format a fictional set of facts which gives rise to allegations of one
or more violations of law. Every effort was made to fictionalize the
actors involved and in most cases the venue of the alleged illegality,
although some real world places and names have crept in.
Each of the four Complaints includes a Request for Production of
Documents; three also contain Instructions and Definitions included
with the actual Set of Requests to Produce. Each request to produce
corresponds to a "topic" for purposes of the overall TREC project. In
the past, and at least up until the advent of new Federal Rules of
Civil Procedure which went into effect in December 2006, lawyers have
not routinely negotiated "search protocols," including Boolean strings
of terms to accompany or guide the requisite search through large
electronic databases held in corporations and other institutions in
response to discovery demands. Nor are typical requests to produce
aimed at finding a targeted small number of relevant documents in a
much larger universe. Rather, lawyers tend to go about framing the
majority of their requests in typically a more open-ended fashion, so
as to aim to have the opposing party spend a maximum of resources in
responding to and/or fighting against such broad discovery (the
problem of so-called "fishing expeditions"). There is also a large
amount of overlap in many real-world requests to produce. In
contrast, some attempt was made here to narrow the typical requests
(although with somewhat relaxed constraints for Year 2 of the legal
track, allowing B values to range as high as 25,000), as well as to
make them somewhat more diverse and less redundant than in a typical
set in the "real world"; however, these are still very recognizable
proxies to requests that would be asked for in a real litigation
context.
Please note in particular that the "Complaints," which in some cases
go on at great length, nevertheless were not drafted with an eye
towards providing the participating TREC community with any special
"clues" or insights helpful to doing searches against the topics
represented in the requests to produce. Rather, they merely provide
context in the same manner as real-world legal pleadings would also
help provide. Thus, while it may be helpful to read the Complaints
for background purposes, it is not obvious that the Complaints need to
be de-constructed in any intensive way in order to obtain better
search results against the document requests themselves.
2. The Boolean strings. A TREC track coordinator engaged one or more
individual colleagues in some form of conversation or negotiation in
every instance represented here. The strings were not, however,
intended to contain a definitive or "all-inclusive" list of English
language synonyms for material terms; for that, we leave the TREC
community to its own creative resources. A note on Boolean operators:
Boolean Strings Syntax:
- AND, OR, NOT, () : As usual
- BUT NOT: (x BUT NOT y) means same thing as (x AND (NOT (y)))
- x : Match this word exactly (case-insensitive).
- x! : Truncation - matches all strings that begin with substring x.
- !x : Truncation - matches all strings that end with substring x.
- x?y : Single-character wildcard - matches all strings that
begin with substring x, end with substring y,
and have exactly one-character in between x and y
- x*y : Muliple-character wildcard - matches all strings that
begin with substring x, end with substring y,
and have 0 or more characters between x and y
- "x", "x y", "x y z", etc. : Phrase - match this string or sequence
of words exactly (case-insensitive).
- "y x!", "x! y", etc. : If ! is used internal to a phrase, then do
the truncated match on the words with !, and exact match on the
others. (The * and ? wildcard operators may also be used inside a
phrase.)
- w/k: Proximity - x w/k y means match "x a b ... c y"
or "y a b ... c x" if "a b ... c" contains k or fewer words
- x w/k1 y w/k2 z: Chained proximity - a match requires the same
occurrence of y to satisfy x w/k1 y and y w/k2 z
3. Topic file structure (fullL07_v1.xml).
The following layout shows all the elements
used in the file and their relationships:
...
...
...
...
...
...
...
The meaning of each element is explained as follows:
- : the root element of the XML file;
- : request element. One request element
corresponds to one request (topic). Each request element has 8
subelements:
- : a number uniquely identify the request, which
ranges between 52 and 101 inclusively. It is the same as the topic
number in a traditional TREC evaluation;
- : a brief description of the subject of the documents
that are relevant to the request. It has a function similar to the
field in a traditional TREC topic;
- : the intermediary and final query negotiation
results, expressed as Boolean queries. Under this element,
shows the final negotiated query, while
contains the query proposed by the defendant
() the query rejoindered by the plaintiff
(). Unlike last year, may
differ from .
- : (new to 2007) specifies the number of records matching
the final negotiated boolean query (as per the reference boolean
run "refL07B").
- : (new to 2007) specifies the corresponding
complaint (A, B, C or D) and its request number in the
complaint .doc file; e.g. 2007-A-1 is the 1st request in
the complaint A .doc file
- : describing the possession and entirety requirement
of the responsive documents for the request;
- : defining particular terms in the context of TREC
legal track.
- : the hypothetical complaint that generated the topics
in the form of requests to produce. This element has several
optional elements, some of which are: specifies
the case number of the complaint; specifies the date when
the complaint is filed; specifies the court where the
complaint is filed; specifies the plaintiff's name in
the case; specifies the defendant's name in the case;
briefly introduce the case; gives
additional information of the parties involved in the case;
describes the jurisdiction and venue of the case;
provides the context that the case arises;
describes the legal claims that plaintiff is
making against defendant in the case and
describes the remedy that plaintiff is seeking, including for
example monetary compensation.
Entities: Be aware that in xml formatting, entities are used to
represent characters with a special meaning in xml, in particular:
& for &
< for <
> for >
Short form of topic file (shortL07_v1.xml):
shortL07_v1.xml is the same as fullL07_v1.xml except that the
Instruction, Definition and Complaint elements are removed
for convenience. shortL07_v1.xml is ~40,000 bytes,
compared to fullL07_v1.xml which is ~700,000 bytes.
4. Further information of the track can be
found at its web site: http://trec-legal.umiacs.umd.edu/