From misclists1@daviddlewis.com Thu May 17 12:21:20 2007 Date: Thu, 17 May 2007 11:18:53 -0500 From: "Dave Lewis (address for public mailing lists)" To: trec-legal@umiacs.umd.edu Subject: [Trec-legal] 17-May-07 update of description of IIT CDIP v. 1.0 / TREC 2007 data This version corrects some errors in the counts of elements, which were based on an early version of the data. Thanks to Howard Turtle for catching this. Dave --------- Data Elements in Version 1 of the IIT Complex Document Information Processing Test Collection (DRAFT) David D. Lewis David D. Lewis Consulting Chicago, IL cdipdoc20070514@DavidDLewis.com Version: 17-May-2007 ************************************************************************ **** ************************************************************************ **** I. Introduction I.A. Some Quick Advice for Text Retrieval Researchers Each document in this collection is an XML-formatted element called . The element contains a unique ID. The records contain a daunting number of elements with a variety of awkward properties (redundancy, inconsistent formatting, complex semantics, etc.). Text retrieval researchers wishing to get a system running on this data quickly may wish to use only a subset of the data. Here are some thoughts on a few quick and dirty approaches to try. 1. Text Only Ignore the metadata and use the standard textual elements: : Body of the text (as OCR output) : Title of the document The reason this isn't quite as easy as it sounds is that the OCR output is sometimes *very* noisy. There are huge numbers of noise words, which may break indexing software and conceivably degrade within document term weighting strategies. You may wish to discard low frequency strings before indexing. 2. High Quality, Text-Like Metadata There are a number of metadata fields that contain high quality, text-like content identifiers (e.g. lists of names manually extracted from documents). Using the following elements would produce a compact and moderate quality indexing while avoiding dealing with OCR output, mysterious identifiers, etc.: / : Title / : Authors (people, or unspecified) / : Authors (organizations) / : Attendees / : Brands / : Copied (people CC'd on a document) / : Short description of the document. / : Organizational names mentioned in document. / : Personal or unspecified names mentioned in document. / : Organizations who received the document. / : People who received the document. / : Controlled vocabulary categories. 3. And Beyond Using OCR text plus metadata is the obvious next step. Beyond that there's a huge range of other information in the metadata. Particularly intriguing are the many levels of physical proximity (Bates numbers, file boxes, etc.) of documents that can be suggestive of related meanings. Distinguishing documents by corporate source, or site within a corporation, may be of interest to those working on distributed search. There is rich genre/format information that may be helpful as well. ************************************************************************ **** ************************************************************************ **** II. Structure of the Data Each file in the IIT CDIP collection is an XML file contains a single element. The element contains some documentation elements, followed by a set of elements. The elements are the documents for this test collection. elements have the following properties: Occurrences: 6910192 Parent: Subcollections: ALL Cardinality: 1 Attributes: none Description: Each element contains information on a single tobacco document. Most of the interesting information is in two subelements of : and . II. vs. ? The information in the and elements is largely, but not completely, redundant with each other. The data in the elements was produced more recently and fixes some known minor glitches with data in the elements. On the other hand, some of the interesting data in the elements is in XML comments, while all the data in the elements is in XML subelements. Also, some data is unique to each. Only the elements contain information on the corporate source of the data (see or ) and only the elements contain the OCR text (see ). ************************************************************************ **** ************************************************************************ **** III. The , , , and Elements Occurrences: 6909982 Parent: Subcollections: ALL Cardinality: 1 Attributes: none Description: We just wrap this around as a reminder that the data is older than the data. See the description of in the next section. Note that due to processing glitches, 210 records in IIT CDIP v. 1.0 do not contain a element and thus do not contain an element. Occurrences: 6910192 Parent: Subcollections: ALL Cardinality: 1 Attributes: none Description: Pathname of original metadata file containing element. Occurrences: 6910192 Parent: Subcollections: ALL Cardinality: 1 Attributes: none Description: md5 checksum of original metadata file containing element. Occurrences: 6909982 Parent: Subcollections: ALL Cardinality: 1 Attributes: none Description: UCSF ID, i.e. a unique ID assigned to the document by UCSF LTDL. We recommend that this ID be used whenever it is necessary to uniquely identify an IIT CDIP v. 1.0 records. Corresponds to value of ID attribute in tag, and to contents of element in element. However, due to a glitch in data preparation, the element is missing from a 210 records in IIT CDIP 1.0. Therefore the element should be used as the source of the UCSF ID instead. The UCSF ID also corresponds to the last part of the permalink URLs found in LTDL search results. For instance in http://legacy.library.ucsf.edu/tid/mjj22d00 we see the UCSF ID mjj22d00. THE UCSF ID is also the unique ID used in TREC qrels files. ************************************************************************ **** ************************************************************************ **** IV. The Element and its Subelements IV.A. The Element Occurrences: 6909982 Parent: Subcollections: ALL Cardinality: 1 Attributes: ID. Value is unique ID for this document. Description: The elements were sent by UCSF to IIT in July 2005. They contain an earlier version of a portion of the data in the elements. The information in the elements should largely be ignored in using this collection, since it is redundant with information in the record. One important exception is the element (see below). In creating Version 1 of the IIT CDIP collection, we attempted to find for each record the corresponding record. The match was done by comparing the contents of the element of a records with the value of the ID attribute of a record. We created elements only for those cases where we had a matching and record. IV.B. The Subelements of Note that some subelements of have the attribute 't'. The meaning of this attribute varies from element to element. Within a particular element, the attribute is not used for all subcollections. I have not checked whether the attribute is always used within those collections that use it at all. Many elements contain lists of names. Formatting of those names, and of lists of names, is not consistent between collections. It may or may not be consistent within collections - don't get your hopes up. Occurrences: 6909982 Parent: Children: none Attributes: none LDTLWOCR Equivalent: A very few records have an element in LTDLWOCR that corresponds to the element in , but in general there is no equivalent. Description: Single letter code indicating from which subcollection (corporate source) the document originated: a = American Tobacco Company b = Brown & Williamson c = Council for Tobacco Research l = Lorillard p = Philip Morris r = RJ Reynolds t = Tobacco Institute For more about the corporate sources, see http://legacy.library.ucsf.edu/about_the_collections.html. Subcollection appears to be the only information in the element that is not duplicated in some fashion in the element. Occurrences: 6039228 Parent: Children: none Attributes: none LDTLWOCR Equivalent: Description: Title. Occurrences: 6345912 Parent: Children: none Attributes: t. The attribute 't' can take on the values "p" (for person) or "o" (for organization). LDTLWOCR Equivalent: and correspond to in LTDLWOCR. corresponds to in LTDLWOCR. Description: Author or authors. Occurrences: 6909982 Parent: Children: none Attributes: none LDTLWOCR Equivalent: A very few records have an element in LTDLWOCR that corresponds to the element in , but in general there is no equivalent. Description: When present, this element contains two strings separated by whitespace. The first string is 2 characters and indicates the corporate source: at = American Tobacco Company bw = Brown & Williamson ct = Council for Tobacco Research ll = Lorillard pm = Philip Morris rj = RJ Reynolds ti = Tobacco Institute The first letter of the two character code is always the same as the single letter code in the element. The second string is the UCSF ID for the document and is always the same as the value of the ID attribute in the open tag of the element. Occurrences: 6909982 Parent: Children: none Attributes: none LDTLWOCR Equivalent:
Description: Document Date. There are a substantial number of errors and dummy values that appear in the contents of this element. Occurrences: 1274713 Parent: Children: none Attributes: none LDTLWOCR Equivalent: A comment includes element name and contents. Description: Attachment Group. I'm not sure what this is. Occurrences: 36584 Parent: Children: none Attributes: See 14-June-05 DTD. LTDLWOCR Equivalent: A comment includes element name and contents. Description: People or organizations attending (a meeting?). Occurrences: 2128997 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Brands.
Occurrences: 9347262 Parent:
Children: none Attributes: t. The attribute 't' can take on the values "p" (for primary) or "o" (for secondary/other). LTDLWOCR Equivalent: Primary Bates range (no attribute, or t="p") corresponds to element in LTDLWOCR. Secondary/other Bates ranges correspond to LTDL comments. Description: One or more ranges of Bates number associated with a document. Bates numbers are unique alphanumeric identifiers traditionally stamped onto physical documents used in litigation, but today often assigned during scanning of documents. Each page of a document gets a unique identifier. Pages within the same document have a common prefix, but numbers which are increasing (usually in numeric order, but potentially in some alphanumeric order), from the first page to the last. Some documents have more than one set of Bates numbers associated with them. I believe this results when: 1) a document is submitted to the court and is stamped with a Bates number, 2) a party to the lawsuit puts a copy of that stamped document in their files, 3) a second lawsuit requires documents on various topics to be produced and the once-stamped document gets a new stamp when produced in the second lawsuit. The attribute t encodes whether the
element contains ???. The formatting (including abbreviating) of the Bates number ranges varies across corporate sources. This, combined with the frequent association of multiple Bates number ranges with the same document, makes it difficult to parse the Bates number fields present in the data. This table attempts to summarize the variations in format: tobacco.health.usyd.edu.au/site/gateway/docs/pdf/Bates.pdf (Note that this table refers to the metadata as present on the tobacco company websites, not the reformatted metadata from UCSF that is used in the IIT CDIP collection.) I suspect there are additional inconsistencies in the formatting beyond those discussed in Bates.pdf, particularly when the same document has multiple Bates number ranges associated with it. Because of the problems with Bates number formatting, and presence of multiple Bates number ranges on a single document, not to mention the presence in the collection of multiple physical copies of the same logical document, I strongly recommend not using Bates numbers as identifiers of either documents or pages within documents. Since documents are often assigned Bates numbers in order as they are taken from a box, file cabinet, etc. documents with nearby Bates numbers may have related content. This fact has been exploited by tobacco document researchers to find useful documents with poor or misleading metadata and OCR. Occurrences: 1744575 Parent:
Children: none Attributes: t. The attribute 't' can take on the values "p" (for person) or "o" (for organization). LTDLWOCR Equivalent: (attribute is ignored) Description: Copied. Names of people/organizations that got copies of this document. Occurrences: 1113978 Parent: Children: none Attributes: LTDLWOCR Equivalent: A comment includes element name and contents. Description: Case ID. Identifies legal case during which the document was produced? Occurrences: 296457 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Case name. Identifies legal case during which the document was produced? Occurrences: 3913611 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Characteristics. Appears to be physical characteristics of document, including types of annotations present. Examples include whether document has marginalia, attachments, etc. Occurrences: 358364 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Description. High level description of the either the purpose or nature of the document. Short and highly variable in nature. Occurrences: 1113977 Parent: Children: none Attributes: LTDLWOCR Equivalent: none Description: Doc Begin. First number of (primary?) Bates range. Should be redundant with part of
. Occurrences: 1113975 Parent:
Children: none Attributes: LTDLWOCR Equivalent: none Description: Doc End. Last number of (primary?) Bates range. Should be redundant with part of
.
Occurrences: 6688146 Parent: Children: none Attributes: LTDLWOCR Equivalent: A comment includes element name and contents. Description: Date loaded. Date on which document was first posted to the producing company's website. Occurrences: 6909982 Parent: Children: none Attributes: LTDLWOCR Equivalent:
Description: Date Modified. Not sure what this means. Occurrences: 522769 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Date Produced. Date the document was turned over to the Minnesota repository. Note that the description of this element at http://legacy.library.ucsf.edu/search_fields.html appears to be wrong.
Occurrences: 9359755 Parent: Children: none Attributes: t. The attribute 't' can take on the values "p" (for primary) or "o" (for secondary/other). LTDLWOCR Equivalent:
. Each
element from has a corresponding
element in , but
elements in do not have attribute values. Description: Document Type. Format or kind of document. The categorizations here, which vary from company to company, seem to be based on notions of both physical format and genre. Occurrences: 162568 Parent: Children: none Attributes: none LTDLWOCR Equivalent: none Description: Estimated Date. This contains the letter E when the original tobacco company metadata indicated that a date was estimated. (Of course, many dates are simply grossly wrong without any indication of trouble.) Occurrences: 59752 Parent: Children: none Attributes: LTDLWOCR Equivalent: none Description: Ending Date. For documents that were associated with a date range rather than a single date, this is the last date of that range. In practice, usually contains a dummy value. Occurrences: 2801817 Parent: Children: none Attributes: LTDLWOCR Equivalent: Description: File. Contents are highly variable, including company-internal ID numbers, names, and descriptions. Occurrences: 58544 Parent: Children: none Attributes: LTDLWOCR Equivalent: A comment includes element name and contents. Description: Grant Number. From the CTR website: 'Description: All contract, grant, application for grant, and special project numbers mentioned in the document. Format: Search by using a two-letter abbreviation ("CT" for contract, "GR" for grant, "AP" for application, and "SP" for special project) followed by a five-digit number.' Occurrences: 3860799 Parent: Children: none Attributes: LTDLWOCR Equivalent: A comment includes element name and contents. Description: Litigation Usage. Identifies a case or cases in which the document was produced. Cases are identified by company-specific codes. Occurrences: 5139005 Parent: Children: none Attributes: t. The attribute 't' can take on the values "p" (for person) or "o" (for organization). LTDLWOCR Equivalent: . Specifically, or corresponds to element in . corresponds to element in . Note that and in both correspond to and in . Description: Mentioned. Names mentioned in the document. Occurrences: 454025 Parent: Children: none Attributes: t. The attribute 't' can take on the values "p" (for person) or "o" (for organization). LTDLWOCR Equivalent: . Specifically, or corresponds to element in . corresponds to element in . Note that and in both correspond to and in . Description: Noted. Names that appear in marginalia. Occurrences: 97209 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Oklahoma Downgrades. My guess is that this is related to a legal case in Oklahoma involving assertions of privilege for certain document. (See also .) Contents is an empty string or "Yes".

Occurrences: 6764565 Parent: Children: none Attributes: LTDLWOCR Equivalent: Description: Page Count. Appears to be number of pages in document. Unclear why some documents don't have this. Occurrences: 401768 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Physical Attachment 1. Records information about physical grouping of documents with rubber bands, paper clips, etc. See http://www.pmdocs.com/navindex.asp Occurrences: 78833 Parent: Children: none Attributes: none LTDLWOCR Equivalent: none Description: Physical Attachment 2. Records information about physical grouping of documents with rubber bands, paper clips, etc. See http://www.pmdocs.com/navindex.asp Occurrences: 1820583 Parent: Children: none Attributes: none LTDLWOCR Equivalent: Description: Production Box. Identifier of the box in which the physical copy of the document resides at the Minnesota repository. Occurrences: 3917362 Parent: Children: none Attributes: t. The attribute 't' can take on the values "p" (for person) or "o" (for organization). LTDLWOCR Equivalent: and correspond to in LTDLWOCR. corresponds to in LDTLWOCR. Description: Recipients. Names of individuals and/or organizations who received the document (presumedly as indicated by addresses on the document). Occurrences: 7177 Parent: Children: none Attributes: LTDLWOCR Equivalent: A comment includes element name and contents. Description: Redacted. Documents from which portions protected under legal privilege were redacted. Occurrences: 4973097 Parent: Children: none Attributes: Variable (see 14-June-05 DTD). LTDLWOCR Equivalent: A comment includes element name and contents. Description: Request Number. A code indicating at least one of the cases / requests for documents for which this document was produced. Occurrences: 5325646 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Source. Where within a corporate entity the document was found. Sometimes described in geographic terms, and sometimes in organizational terms. Occurrences: 34479 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Special Collections. Identifies certain subsets of documents in which plaintiff or defendant took a special interest. See http://www.ctr-usa.org/ctr/index.wmt?tab=home&selection=home/fields and http://www.rjrtdocs.com/rjrtdocs/search.wmt? tab=search&PRODUCT=docall#. Occurrences: 1324007 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Date Shipped. I haven't found any description of this element, but my guess is that it is the date the physical document was shipped to Minnesota repository. Occurrences: 3596670 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Site. An alternate coding of the the information in (Source). Occurrences: 2334 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Unknown. Occurrences: 10732 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Trial Exhibit. Identifies documents that were introduced by plaintiffs in certain cases. Occurrences: 799061 Parent: Children: none Attributes: none LTDLWOCR Equivalent: A comment includes element name and contents. Description: Not described in 14-June-05 DTD but probably stands for Topics. Seems to contain controlled vocabulary categories from the list at http://www.rjrtdocs.com/rjrtdocs/index.wmt?tab=home&selection=home/ rjrt_topics. ************************************************************************ **** ************************************************************************ **** IV. The Element and Its Subelements Occurrences: 6910192 Parent: Children: See below. Attributes: none Equivalent: --- Description: See above. Occurrences: 21 Parent: Children: none Attributes: none Equivalent: Description: See . The elements seem to be a glitch since in general the DS information is not recorded in LTDLWOCR elements. Occurrences: 21 Parent: Children: none Attributes: none Equivalent: Description: See . The elements seem to be a glitch since in general the PV information is not recorded in LTDLWOCR elements. Occurrences: 4575986 Parent: Children: none Attributes: none Equivalent: or . Description: See . Occurrences: 6910192 Parent: Children: none Attributes: none Equivalent:
or Description: See
. Occurrences: 1820586 Parent: Children: none Attributes: none
Equivalent: Description: See . Occurrences: 1768030 Parent: Children: none Attributes: none Equivalent: Description: See . Occurrences: 934496 Parent: Children: none Attributes: none Equivalent: Description: See .

Occurrences: 6910192 Parent: Children: none Attributes: none Equivalent: Description: See
Occurrences: 6910192 Parent: Children: none Attributes: none Equivalent: Description: See .
Occurrences: 9360071 Parent: Children: none Attributes: none Equivalent:
Description: See
under . Occurrences: 2647214 Parent: Children: none Attributes: none Equivalent: Description: See . Occurrences: 2132534 Parent: Children: none Attributes: none Equivalent: or Description: See and . Occurrences: 3460698 Parent: Children: none Attributes: none Equivalent: ; ; ; or Description: See and . Occurrences: 6794895 Parent: Children: none Attributes: none Equivalent: none Description: Output of optical character recognition (OCR) applied to the scanned document image. The OCR output is wrapped as a CDATA since it may contain XML reserved characters. The string "pgNbr=" appearing at the beginning of a line appears to signal the end of each page. Occurrences: 1744613 Parent: Children: none Attributes: none Equivalent: , , or . Description: See . Occurrences: 6764775 Parent: Children: none Attributes: none Equivalent:

Description: See

. Occurrences: 2980815 Parent: Children: none Attributes: none Equivalent: or . Description: See . Occurrences: 6037066 Parent: Children: none Attributes: none Equivalent: Description: See Occurrences: 6910192 Parent: Children: none Attributes: none Equivalent: Value of ID attribute on . Description: See , , and . ************************************************************************ **** ************************************************************************ **** V. Additional Information The UCSF metadata (both the and versions) were produced by reformatting metadata whose original fields varied considerably between corporate sources. There were undoubtedly many compromises and inconsistencies in that process. The file dtds.html (which I will email to the TREC 2006 list) was originally sent by UCSF to IIT on 14-Jun-2005. While it is incomplete and has some obvious cut-and-paste typos, it provides interesting information on the subelements of the element. Further description of the elements, though not alas in terms of the element names, is available at http://legacy.library.ucsf.edu/search_fields.html Each of the tobacco company websites provides some information about the fields in their metadata: 1. American Tobacco Company (click on individual field names) http://www.rjrtdocs.com/rjrtdocs/search.wmt?tab=search&PRODUCT=atco# 2. Brown & Williamson (click on individual field names) http://www.rjrtdocs.com/rjrtdocs/search.wmt?tab=search&PRODUCT=bw# 3. Council for Tobacco Research: http://www.ctr-usa.org/ctr/index.wmt?tab=home&selection=home/fields 4. Lorillard: http://www.lorillarddocs.com/navindex.asp 5. Philip Morris http://www.pmdocs.com/navindex.asp 6. R. J. Reynolds (click on individual field names) http://www.rjrtdocs.com/rjrtdocs/search.wmt? tab=search&PRODUCT=docall# http://www.rjrtdocs.com/rjrtdocs/index.wmt?tab=home&selection=home/ rjrt_about_site 7. Tobacco Institute: http://www.tobaccoinstitute.com/index1.htm Other resources: http://www.cdc.gov/tobacco/industrydocs/4babout.htm ************************************************************************ **** ************************************************************************ **** [ Part 2: "Attached Text" ] _______________________________________________ Trec-legal mailing list Trec-legal@umiacs.umd.edu http://lists.umiacs.umd.edu/mailman/listinfo/trec-legal