From misclists1@daviddlewis.com Thu May 17 12:21:20 2007
Date: Thu, 17 May 2007 11:18:53 -0500
From: "Dave Lewis (address for public mailing lists)" <misclists1@daviddlewis.com>
To: trec-legal@umiacs.umd.edu
Subject: [Trec-legal] 17-May-07 update of description of IIT CDIP v. 1.0 /	TREC 2007 data

This version corrects some errors in the counts of elements, which  
were based on an early version of the data.  Thanks to Howard Turtle  
for catching this.

Dave


---------

Data Elements in Version 1 of the IIT Complex Document
Information Processing Test Collection (DRAFT)

David D. Lewis
David D. Lewis Consulting
Chicago, IL
cdipdoc20070514@DavidDLewis.com

Version: 17-May-2007

************************************************************************ 
****
************************************************************************ 
****

I. Introduction

I.A. Some Quick Advice for Text Retrieval Researchers

Each document in this collection is an XML-formatted element
called <record>.  The <docid> element contains a unique ID.

The records contain a daunting number of elements with a
variety of awkward properties (redundancy, inconsistent
formatting, complex semantics, etc.).  Text retrieval
researchers wishing to get a system running on this data
quickly may wish to use only a subset of the data.  Here are
some thoughts on a few quick and dirty approaches to try.

1. Text Only

Ignore the metadata and use the standard textual elements:

      <ot> : Body of the text (as OCR output)
      <ti> : Title of the document

The reason this isn't quite as easy as it sounds is that the
OCR output is sometimes *very* noisy.  There are huge
numbers of noise words, which may break indexing software
and conceivably degrade within document term weighting
strategies. You may wish to discard low frequency strings
before indexing.

2. High Quality, Text-Like Metadata

There are a number of metadata fields that contain high
quality, text-like content identifiers (e.g. lists of names
manually extracted from documents).  Using the following
elements would produce a compact and moderate quality
indexing while avoiding dealing with OCR output, mysterious
identifiers, etc.:

<LTDLWOCR>/<ti> : Title
<LTDLWOCR>/<au> : Authors (people, or unspecified)
<LTDLWOCR>/<ca> : Authors (organizations)
<A>/<at> : Attendees
<A>/<b> : Brands
<LTDLWOCR>/<c> : Copied (people CC'd on a document)
<A>/<d> : Short description of the document.
<LTDLWOCR>/<no> : Organizational names mentioned in document.
<LTDLWOCR>/<np> : Personal or unspecified names
mentioned in document.
<LTDLWOCR>/<cr> : Organizations who received the document.
<LTDLWOCR>/<rc> : People who received the document.
<LTDLWOCR>/<tp> : Controlled vocabulary categories.


3. And Beyond

Using OCR text plus metadata is the obvious next
step. Beyond that there's a huge range of other information
in the metadata.  Particularly intriguing are the many
levels of physical proximity (Bates numbers, file boxes,
etc.) of documents that can be suggestive of related
meanings.  Distinguishing documents by corporate source, or
site within a corporation, may be of interest to those
working on distributed search.  There is rich genre/format
information that may be helpful as well.

************************************************************************ 
****
************************************************************************ 
****

II. Structure of the Data

Each file in the IIT CDIP collection is an XML file contains
a single <IITCDIPgp1> element. The <IITCDIPgp1> element
contains some documentation elements, followed by a set of
<record> elements.  The <record> elements are the documents
for this test collection.  <record> elements have the
following properties:

<record>
   Occurrences: 6910192
   Parent: <IITCDIPgp1>
   Subcollections: ALL
   Cardinality: 1
   Attributes: none
   Description: Each <record> element contains information on
a single tobacco document.  Most of the interesting
information is in two subelements of <record>: <A> and
<LTDLWOCR>.

II. <A> vs. <LTDLWOCR>?

The information in the <A> and <LTDLWOCR> elements is
largely, but not completely, redundant with each other. The
data in the <LTDLWOCR> elements was produced more recently
and fixes some known minor glitches with data in the <A>
elements.  On the other hand, some of the interesting data
in the <LTDLWOCR> elements is in XML comments, while all the
data in the <A> elements is in XML subelements.

Also, some data is unique to each.  Only the <A> elements
contain information on the corporate source of the data (see
<DS> or <PV>) and only the <LTDLWOCR> elements contain the
OCR text (see <ot>).

************************************************************************ 
****
************************************************************************ 
****


III. The <ucsf200507>, <path>, <md5>, and <docid> Elements

<ucsf200507>
   Occurrences: 6909982
   Parent: <record>
   Subcollections: ALL
   Cardinality: 1
   Attributes: none
   Description: We just wrap this around <A> as a reminder
that the <A> data is older than the <LTDLWOCR> data.  See
the description of <A> in the next section.  Note that
due to processing glitches, 210 records
in IIT CDIP v. 1.0 do not contain a <ucsf200507> element
and thus do not contain an <A> element.

<path>
   Occurrences: 6910192
   Parent: <ucsf200507>
   Subcollections: ALL
   Cardinality: 1
   Attributes: none
   Description: Pathname of original metadata file containing
<LTDLWOCR> element.

<md5>
   Occurrences: 6910192
   Parent: <ucsf200507>
   Subcollections: ALL
   Cardinality: 1
   Attributes: none
   Description: md5 checksum of original metadata file
containing <LTDLWOCR> element.

<docid>
   Occurrences: 6909982
   Parent: <ucsf200507>
   Subcollections: ALL
   Cardinality: 1
   Attributes: none
   Description: UCSF ID, i.e. a unique ID assigned to the
document by UCSF LTDL. We recommend that this ID
be used whenever it is necessary to uniquely identify
an IIT CDIP v. 1.0 records. Corresponds to value of ID
attribute in <A> tag, and to contents of <tid> element in
<LTDLWOCR> element.  However, due to a glitch in data
preparation, the <docid> element is missing from a
210 records in IIT CDIP 1.0.  Therefore the <tid> element
should be used as the source of the UCSF ID instead.
The UCSF ID also corresponds to the last part of the
permalink URLs found in LTDL search results.  For instance
in  http://legacy.library.ucsf.edu/tid/mjj22d00 we see
the UCSF ID mjj22d00. THE UCSF ID is also the unique
ID used in TREC qrels files.

************************************************************************ 
****
************************************************************************ 
****


IV. The <A> Element and its Subelements


IV.A.  The <A> Element

<A>
   Occurrences: 6909982
   Parent: <ucsf200507>
   Subcollections: ALL
   Cardinality: 1
   Attributes: ID. Value is unique ID for this document.
   Description: The <A> elements were sent by UCSF to IIT in
July 2005. They contain an earlier version of a portion of
the data in the <LTDLWOCR> elements.  The information in the
<A> elements should largely be ignored in using this
collection, since it is redundant with information in the
<LTDLWOCR> record.  One important exception is the <DS>
element (see below).
      In creating Version 1 of the IIT CDIP collection, we
attempted to find for each <LTDLWOCR> record the
corresponding <A> record.  The match was done by comparing
the contents of the <tid> element of a <LTDLWOCR> records
with the value of the ID attribute of a <A> record.  We
created <record> elements only for those cases where we had
a matching <LTDLWOCR> and <A> record.


IV.B. The Subelements of <A>

Note that some subelements of <A> have the attribute
't'. The meaning of this attribute varies from element to
element.  Within a particular element, the attribute is not
used for all subcollections.  I have not checked whether the
attribute is always used within those collections that use
it at all.

Many elements contain lists of names.  Formatting of those
names, and of lists of names, is not consistent between
collections.  It may or may not be consistent within
collections - don't get your hopes up.


<DS>
   Occurrences: 6909982
   Parent: <A>
   Children: none
   Attributes: none
   LDTLWOCR Equivalent: A very few records have an <NA:DS>
element in LTDLWOCR that corresponds to the <DS> element in
<A>, but in general there is no equivalent.
   Description: Single letter code indicating from which
subcollection (corporate source) the document originated:
           a = American Tobacco Company
           b = Brown & Williamson
           c = Council for Tobacco Research
           l = Lorillard
           p = Philip Morris
           r = RJ Reynolds
           t = Tobacco Institute
For more about the corporate sources, see
      http://legacy.library.ucsf.edu/about_the_collections.html.
Subcollection appears to be the only information in the <A>
element that is not duplicated in some fashion in the
<LTDLWOCR> element.


<K>
   Occurrences: 6039228
   Parent: <A>
   Children: none
   Attributes: none
   LDTLWOCR Equivalent: <ti>
   Description: Title.

<L>
   Occurrences: 6345912
   Parent: <A>
   Children: none
   Attributes: t. The attribute 't' can take on the values
"p" (for person) or "o" (for organization).
   LDTLWOCR Equivalent: <L> and <L t="p"> correspond to <au>
in LTDLWOCR.  <L t="o"> corresponds to <ca> in LTDLWOCR.
   Description: Author or authors.

<PV>
   Occurrences: 6909982
   Parent: <A>
   Children: none
   Attributes: none
   LDTLWOCR Equivalent: A very few records have an <NA:PV>
element in LTDLWOCR that corresponds to the <PV> element in
<A>, but in general there is no equivalent.
   Description: When present, this element contains two
strings separated by whitespace.  The first string is 2
characters and indicates the corporate source:
           at = American Tobacco Company
           bw = Brown & Williamson
           ct = Council for Tobacco Research
           ll = Lorillard
           pm = Philip Morris
           rj = RJ Reynolds
           ti = Tobacco Institute
The first letter of the two character code is always the
same as the single letter code in the <DS> element.  The
second string is the UCSF ID for the document and is always
the same as the value of the ID attribute in the open tag of
the <A> element.


<YR>
   Occurrences: 6909982
   Parent: <A>
   Children: none
   Attributes: none
   LDTLWOCR Equivalent: <dd>
   Description: Document Date.  There are a substantial
number of errors and dummy values that appear in the
contents of this element.


<ag>
   Occurrences: 1274713
   Parent: <A>
   Children: none
   Attributes: none
   LDTLWOCR Equivalent: A comment includes element name and
contents.
   Description: Attachment Group.  I'm not sure what this is.

<at>
   Occurrences: 36584
   Parent: <A>
   Children: none
   Attributes: See 14-June-05 DTD.
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: People or organizations attending (a meeting?).

<b>
   Occurrences: 2128997
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Brands.


<br>
   Occurrences: 9347262
   Parent: <A>
   Children: none
   Attributes: t.  The attribute 't' can take on the values
"p" (for primary) or "o" (for secondary/other).
   LTDLWOCR Equivalent: Primary Bates range (no attribute, or
t="p") corresponds to <bt> element in LTDLWOCR.
Secondary/other Bates ranges correspond to LTDL comments.

   Description: One or more ranges of Bates number associated
with a document.  Bates numbers are unique alphanumeric
identifiers traditionally stamped onto physical documents
used in litigation, but today often assigned during scanning
of documents.  Each page of a document gets a unique
identifier.  Pages within the same document have a common
prefix, but numbers which are increasing (usually in numeric
order, but potentially in some alphanumeric order), from the
first page to the last.

      Some documents have more than one set of Bates numbers
associated with them.  I believe this results when: 1) a
document is submitted to the court and is stamped with a
Bates number, 2) a party to the lawsuit puts a copy of that
stamped document in their files, 3) a second lawsuit
requires documents on various topics to be produced and the
once-stamped document gets a new stamp when produced in the
second lawsuit.  The attribute t encodes whether the <br>
element contains ???.

      The formatting (including abbreviating) of the Bates number
ranges varies across corporate sources.  This, combined
with the frequent association of multiple Bates number
ranges with the same document, makes it difficult to parse
the Bates number fields present in the data.  This table
attempts to summarize the variations in format:

tobacco.health.usyd.edu.au/site/gateway/docs/pdf/Bates.pdf

(Note that this table refers to the metadata as present on
the tobacco company websites, not the reformatted metadata
from UCSF that is used in the IIT CDIP collection.)  I
suspect there are additional inconsistencies in the
formatting beyond those discussed in Bates.pdf, particularly
when the same document has multiple Bates number ranges
associated with it.

      Because of the problems with Bates number formatting,
and presence of multiple Bates number ranges on a single
document, not to mention the presence in the collection of
multiple physical copies of the same logical document, I
strongly recommend not using Bates numbers as identifiers
of either documents or pages within documents.

      Since documents are often assigned Bates numbers in
order as they are taken from a box, file cabinet,
etc. documents with nearby Bates numbers may have related
content.  This fact has been exploited by tobacco document
researchers to find useful documents with poor or misleading
metadata and OCR.


<c>
   Occurrences: 1744575
   Parent: <A>
   Children: none
   Attributes: t. The attribute 't' can take on the values
"p" (for person) or "o" (for organization).
   LTDLWOCR Equivalent: <pc> (attribute is ignored)
   Description: Copied. Names of people/organizations that
got copies of this document.

<ci>
   Occurrences: 1113978
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Case ID. Identifies legal case during which
the document was produced?

<cn>
   Occurrences: 296457
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Case name. Identifies legal case during which
the document was produced?

<co>
   Occurrences: 3913611
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Characteristics.  Appears to be physical
characteristics of document, including types of annotations
present. Examples include whether document has marginalia,
attachments, etc.

<d>
   Occurrences: 358364
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Description. High level description of the
either the purpose or nature of the document.  Short and
highly variable in nature.

<db>
   Occurrences: 1113977
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: none
   Description: Doc Begin.  First number of (primary?) Bates
range.  Should be redundant with part of <br>.

<de>
   Occurrences: 1113975
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: none
   Description: Doc End.  Last number of (primary?) Bates
range.  Should be redundant with part of <br>.

<dl>
   Occurrences: 6688146
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Date loaded.  Date on which document was
first posted to the producing company's website.

<dm>
   Occurrences: 6909982
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: <dl>
   Description: Date Modified.  Not sure what this means.

<dp>
   Occurrences: 522769
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Date Produced. Date the document was turned
over to the Minnesota repository.  Note that the description
of this element at
    http://legacy.library.ucsf.edu/search_fields.html
appears to be wrong.

<dt>
   Occurrences: 9359755
   Parent: <A>
   Children: none
   Attributes: t.  The attribute 't' can take on the values "p"
(for primary) or "o" (for secondary/other).
   LTDLWOCR Equivalent: <dt>.  Each <dt> element from <A> has
a corresponding <dt> element in <LTDLWOCR>, but <dt>
elements in <LTDLWOCR> do not have attribute values.
   Description: Document Type. Format or kind of document.
The categorizations here, which vary from company to
company, seem to be based on notions of both physical format
and genre.

<ed>
   Occurrences: 162568
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: none
   Description: Estimated Date.  This contains the letter E
when the original tobacco company metadata indicated that a
date was estimated.  (Of course, many dates are simply
grossly wrong without any indication of trouble.)

<eda>
   Occurrences: 59752
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: none
   Description: Ending Date.  For documents that were
associated with a date range rather than a single date, this
is the last date of that range. In practice, usually
contains a dummy value.

<f>
   Occurrences: 2801817
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: <fn>
   Description: File.  Contents are highly variable,
including company-internal ID numbers, names, and
descriptions.

<gn>
   Occurrences: 58544
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Grant Number. From the CTR website:
         'Description: All contract, grant, application for grant, and
special project numbers mentioned in the document.
          Format: Search by using a two-letter abbreviation ("CT" for
contract, "GR" for grant, "AP" for application, and "SP" for special
project) followed by a five-digit number.'


<lu>
   Occurrences: 3860799
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Litigation Usage.  Identifies a case or cases
in which the document was produced.  Cases are identified by
company-specific codes.

<m>
   Occurrences: 5139005
   Parent: <A>
   Children: none
   Attributes: t. The attribute 't' can take on the values "p"
(for person) or "o" (for organization).
   LTDLWOCR Equivalent: <np>.  Specifically, <m> or <m t="p">
corresponds to <np> element in <LTDLWOCR>.  <m t="o">
corresponds to <no> element in <LTDLWOCR>. Note that <m> and
<n> in <A> both correspond to <np> and <no> in <LTDLWOCR>.
   Description: Mentioned.  Names mentioned in the document.


<n>
   Occurrences: 454025
   Parent: <A>
   Children: none
   Attributes: t. The attribute 't' can take on the values
"p" (for person) or "o" (for organization).
   LTDLWOCR Equivalent: <np>.  Specifically, <m> or <m t="p">
corresponds to <np> element in <LTDLWOCR>.  <m t="o">
corresponds to <no> element in <LTDLWOCR>. Note that <m> and
<n> in <A> both correspond to <np> and <no> in <LTDLWOCR>.
   Description: Noted. Names that appear in marginalia.

<od>
   Occurrences: 97209
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Oklahoma Downgrades. My guess is that this is
related to a legal case in Oklahoma involving assertions of
privilege for certain document. (See also <sc>.)  Contents
is an empty string or "Yes".


<p>
   Occurrences: 6764565
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: <pg>
   Description: Page Count.  Appears to be number of pages in
document. Unclear why some documents don't have this.



<pa1>
   Occurrences: 401768
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Physical Attachment 1.  Records information
about physical grouping of documents with rubber bands,
paper clips, etc. See http://www.pmdocs.com/navindex.asp

<pa2>
   Occurrences: 78833
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: none
   Description: Physical Attachment 2.  Records information
about physical grouping of documents with rubber bands,
paper clips, etc. See http://www.pmdocs.com/navindex.asp


<pb>
   Occurrences: 1820583
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: <bx>
   Description: Production Box.  Identifier of the box in
which the physical copy of the document resides at the
Minnesota repository.

<r>
   Occurrences: 3917362
   Parent: <A>
   Children: none
   Attributes: t. The attribute 't' can take on the values
"p" (for person) or "o" (for organization).
   LTDLWOCR Equivalent: <r> and <r t="p"> correspond to <rc>
in LTDLWOCR. <r t="o"> corresponds to <cr> in LDTLWOCR.
   Description: Recipients.  Names of individuals and/or
organizations who received the document (presumedly as
indicated by addresses on the document).

<re>
   Occurrences: 7177
   Parent: <A>
   Children: none
   Attributes:
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Redacted.  Documents from which portions
protected under legal privilege were redacted.

<rn>
   Occurrences: 4973097
   Parent: <A>
   Children: none
   Attributes: Variable (see 14-June-05 DTD).
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Request Number.  A code indicating at least
one of the cases / requests for documents for which this
document was produced.


<s>
   Occurrences: 5325646
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Source.  Where within a corporate entity the
document was found.  Sometimes described in geographic
terms, and sometimes in organizational terms.


<sc>
   Occurrences: 34479
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Special Collections.  Identifies certain
subsets of documents in which plaintiff or defendant took a
special interest.  See

   http://www.ctr-usa.org/ctr/index.wmt?tab=home&selection=home/fields

and

   http://www.rjrtdocs.com/rjrtdocs/search.wmt? 
tab=search&PRODUCT=docall#.



<sh>
   Occurrences: 1324007
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Date Shipped.  I haven't found any
description of this element, but my guess is that it is the
date the physical document was shipped to Minnesota
repository.

<si>
   Occurrences: 3596670
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Site.  An alternate coding of the the
information in <s> (Source).


<st>
   Occurrences: 2334
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Unknown.


<te>
   Occurrences: 10732
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Trial Exhibit.  Identifies documents that were
introduced by plaintiffs in certain cases.


<tp>
   Occurrences: 799061
   Parent: <A>
   Children: none
   Attributes: none
   LTDLWOCR Equivalent: A comment includes element name and
contents.
   Description: Not described in 14-June-05 DTD but probably
stands for Topics.  Seems to contain controlled vocabulary
categories from the list at

   http://www.rjrtdocs.com/rjrtdocs/index.wmt?tab=home&selection=home/ 
rjrt_topics.



************************************************************************ 
****
************************************************************************ 
****

IV. The <LTDLWOCR> Element and Its Subelements

<LTDLWOCR>
   Occurrences: 6910192
   Parent: <record>
   Children: See below.
   Attributes: none
   <A> Equivalent: ---
   Description: See above.


<NA:DS>
   Occurrences: 21
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <DS>
   Description: See <DS>.  The <NA:DS> elements seem to be a
glitch since in general the DS information is not recorded
in LTDLWOCR elements.

<NA:PV>
   Occurrences: 21
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <PV>
   Description: See <PV>.  The <NA:PV> elements seem to be a
glitch since in general the PV information is not recorded
in LTDLWOCR elements.

<au>
   Occurrences: 4575986
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <L> or <L t="p">.
   Description: See <L>.

<bt>
   Occurrences: 6910192
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <br> or <br, t="p">
   Description: See <br>.

<bx>
   Occurrences: 1820586
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <pb>
   Description: See <pb>.

<ca>
   Occurrences: 1768030
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <L t="o">
   Description: See <L>.

<cr>
   Occurrences: 934496
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <r t="o">
   Description: See <r>.

<dd>
   Occurrences: 6910192
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <YR>
   Description: See <YR>

<dl>
   Occurrences: 6910192
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <dm>
   Description: See <dm>.

<dt>
   Occurrences: 9360071
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <dt>
   Description: See <dt> under <A>.

<fn>
   Occurrences: 2647214
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <f>
   Description: See <f>.

<no>
   Occurrences: 2132534
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <m, t="o"> or <n, t="o">
   Description: See <m> and <n>.

<np>
   Occurrences: 3460698
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <m>; <n>; <m, t="p">; or <n, t="p">
   Description: See <m> and <n>.

<ot>
   Occurrences: 6794895
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: none
   Description: Output of optical character recognition (OCR)
applied to the scanned document image.  The OCR output is
wrapped as a CDATA since it may contain XML reserved
characters.  The string "pgNbr=" appearing at the beginning
of a line appears to signal the end of each page.

<pc>
   Occurrences: 1744613
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <c>, <c t="p">, or <c t="o">.
   Description: See <c>.

<pg>
   Occurrences: 6764775
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <p>
   Description: See <p>.

<rc>
   Occurrences: 2980815
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <r> or <r t="p">.
   Description: See <r>.

<ti>
   Occurrences: 6037066
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: <K>
   Description: See <K>

<tid>
   Occurrences: 6910192
   Parent: <LTDLWOCR>
   Children: none
   Attributes: none
   <A> Equivalent: Value of ID attribute on <A>.
   Description: See <docid>, <A>, and <PV>.




************************************************************************ 
****
************************************************************************ 
****

V. Additional Information

The UCSF metadata (both the <A> and <LTDLWOCR> versions)
were produced by reformatting metadata whose original fields
varied considerably between corporate sources.  There were
undoubtedly many compromises and inconsistencies in that
process.

The file dtds.html (which I will email to the TREC 2006
list) was originally sent by UCSF to IIT on 14-Jun-2005.
While it is incomplete and has some obvious cut-and-paste
typos, it provides interesting information on the
subelements of the <A> element.

Further description of the elements, though not alas in
terms of the element names, is available at

    http://legacy.library.ucsf.edu/search_fields.html

Each of the tobacco company websites provides some
information about the fields in their metadata:

1. American Tobacco Company (click on individual field names)

   http://www.rjrtdocs.com/rjrtdocs/search.wmt?tab=search&PRODUCT=atco#

2. Brown & Williamson (click on individual field names)

   http://www.rjrtdocs.com/rjrtdocs/search.wmt?tab=search&PRODUCT=bw#

3. Council for Tobacco Research:

   http://www.ctr-usa.org/ctr/index.wmt?tab=home&selection=home/fields

4. Lorillard:

   http://www.lorillarddocs.com/navindex.asp

5. Philip Morris

   http://www.pmdocs.com/navindex.asp

6. R. J. Reynolds (click on individual field names)

   http://www.rjrtdocs.com/rjrtdocs/search.wmt? 
tab=search&PRODUCT=docall#

   http://www.rjrtdocs.com/rjrtdocs/index.wmt?tab=home&selection=home/ 
rjrt_about_site

7. Tobacco Institute:

   http://www.tobaccoinstitute.com/index1.htm

Other resources:

   http://www.cdc.gov/tobacco/industrydocs/4babout.htm

************************************************************************ 
****
************************************************************************ 
****



    [ Part 2: "Attached Text" ]

_______________________________________________
Trec-legal mailing list
Trec-legal@umiacs.umd.edu
http://lists.umiacs.umd.edu/mailman/listinfo/trec-legal