TREC 2007 Legal Discovery Track: Main Task Evaluation Topics Revision History - 2007-July-2 (st) - updated last year's readme for 2007 Contents: File List Background on Topic Creation Boolean Strings Syntax Topic File Structure Further Information File list for topicsL07_v1.zip: TREC2007_ComplaintA_v1.doc (background on 17 requests (52-68)) TREC2007_ComplaintB_v1.doc (background on 10 requests (69-78)) TREC2007_ComplaintC_v1.doc (background on 10 requests (79-88)) TREC2007_ComplaintD_v1.doc (background on 13 requests (89-101)) fullL07_v1.xml (official topic file for participants) shortL07_v1.xml (topic file excluding repeated fields) readmeL07_v1.txt (this file) This README file explains the format of the topic (request) files used for the TREC 2007 Legal track, as well as how the topics are created. There are a cumulative total of 50 requests to produce among the 4 complaints, representing 50 topics for the TREC legal track. The 50 topics are assembled in a single XML file (fullL07_v1.xml). Background on Topic Creation: 1. Working with legal documents: the hypothetical Complaints and Requests to Produce. In 2007, a TREC track coordinator worked with colleagues from The Sedona Conference(R) in the drafting of the four hypothetical "Complaints" marked A through D. (Technically speaking, Complaint D, representing a fictional letter from the Antitrust Division of the US Department of Justice, Federal Watchdog Commission is known as a "second request" demand letter, but for ease of reference it too will be deemed a "Complaint."). As indicated by their titles, each of the four documents focuses on a different area of the law. Although there are variations in the makeup of the four Complaints, each generally sets out in narrative format a fictional set of facts which gives rise to allegations of one or more violations of law. Every effort was made to fictionalize the actors involved and in most cases the venue of the alleged illegality, although some real world places and names have crept in. Each of the four Complaints includes a Request for Production of Documents; three also contain Instructions and Definitions included with the actual Set of Requests to Produce. Each request to produce corresponds to a "topic" for purposes of the overall TREC project. In the past, and at least up until the advent of new Federal Rules of Civil Procedure which went into effect in December 2006, lawyers have not routinely negotiated "search protocols," including Boolean strings of terms to accompany or guide the requisite search through large electronic databases held in corporations and other institutions in response to discovery demands. Nor are typical requests to produce aimed at finding a targeted small number of relevant documents in a much larger universe. Rather, lawyers tend to go about framing the majority of their requests in typically a more open-ended fashion, so as to aim to have the opposing party spend a maximum of resources in responding to and/or fighting against such broad discovery (the problem of so-called "fishing expeditions"). There is also a large amount of overlap in many real-world requests to produce. In contrast, some attempt was made here to narrow the typical requests (although with somewhat relaxed constraints for Year 2 of the legal track, allowing B values to range as high as 25,000), as well as to make them somewhat more diverse and less redundant than in a typical set in the "real world"; however, these are still very recognizable proxies to requests that would be asked for in a real litigation context. Please note in particular that the "Complaints," which in some cases go on at great length, nevertheless were not drafted with an eye towards providing the participating TREC community with any special "clues" or insights helpful to doing searches against the topics represented in the requests to produce. Rather, they merely provide context in the same manner as real-world legal pleadings would also help provide. Thus, while it may be helpful to read the Complaints for background purposes, it is not obvious that the Complaints need to be de-constructed in any intensive way in order to obtain better search results against the document requests themselves. 2. The Boolean strings. A TREC track coordinator engaged one or more individual colleagues in some form of conversation or negotiation in every instance represented here. The strings were not, however, intended to contain a definitive or "all-inclusive" list of English language synonyms for material terms; for that, we leave the TREC community to its own creative resources. A note on Boolean operators: Boolean Strings Syntax: - AND, OR, NOT, () : As usual - BUT NOT:  (x BUT NOT y) means same thing as (x AND (NOT (y))) - x : Match this word exactly (case-insensitive). - x! : Truncation - matches all strings that begin with substring x. - !x : Truncation - matches all strings that end with substring x. - x?y : Single-character wildcard - matches all strings that begin with substring x, end with substring y, and have exactly one-character in between x and y - x*y : Muliple-character wildcard - matches all strings that begin with substring x, end with substring y, and have 0 or more characters between x and y - "x", "x y", "x y z", etc. : Phrase - match this string or sequence of words exactly (case-insensitive). - "y x!", "x! y", etc. : If ! is used internal to a phrase, then do the truncated match on the words with !, and exact match on the others. (The * and ? wildcard operators may also be used inside a phrase.) - w/k: Proximity - x w/k y means match "x a b ... c y" or "y a b ... c x" if "a b ... c" contains k or fewer words - x w/k1 y w/k2 z: Chained proximity - a match requires the same occurrence of y to satisfy x w/k1 y and y w/k2 z 3. Topic file structure (fullL07_v1.xml). The following layout shows all the elements used in the file and their relationships:

...

...

...

...

...

...

...
The meaning of each element is explained as follows: - : the root element of the XML file; - : request element. One request element corresponds to one request (topic). Each request element has 8 subelements: - : a number uniquely identify the request, which ranges between 52 and 101 inclusively. It is the same as the topic number in a traditional TREC evaluation; - : a brief description of the subject of the documents that are relevant to the request. It has a function similar to the field in a traditional TREC topic; - : the intermediary and final query negotiation results, expressed as Boolean queries. Under this element, shows the final negotiated query, while contains the query proposed by the defendant () the query rejoindered by the plaintiff (). Unlike last year, may differ from . - : (new to 2007) specifies the number of records matching the final negotiated boolean query (as per the reference boolean run "refL07B"). - : (new to 2007) specifies the corresponding complaint (A, B, C or D) and its request number in the complaint .doc file; e.g. 2007-A-1 is the 1st request in the complaint A .doc file - : describing the possession and entirety requirement of the responsive documents for the request; - : defining particular terms in the context of TREC legal track. - : the hypothetical complaint that generated the topics in the form of requests to produce. This element has several optional elements, some of which are: specifies the case number of the complaint; specifies the date when the complaint is filed; specifies the court where the complaint is filed; specifies the plaintiff's name in the case; specifies the defendant's name in the case; briefly introduce the case; gives additional information of the parties involved in the case; describes the jurisdiction and venue of the case; provides the context that the case arises; describes the legal claims that plaintiff is making against defendant in the case and describes the remedy that plaintiff is seeking, including for example monetary compensation. Entities: Be aware that in xml formatting, entities are used to represent characters with a special meaning in xml, in particular: & for & < for < > for > Short form of topic file (shortL07_v1.xml): shortL07_v1.xml is the same as fullL07_v1.xml except that the Instruction, Definition and Complaint elements are removed for convenience. shortL07_v1.xml is ~40,000 bytes, compared to fullL07_v1.xml which is ~700,000 bytes. 4. Further information of the track can be found at its web site: http://trec-legal.umiacs.umd.edu/