Identification and download helpers for EDRM Enron v2 dataset

Name	Last Modified (UTC)	Size	Description
alldoc-sdoc.csv.bz2	2010-06-23 19:49:53	30M	Mapping from EDRM docid to SDOC number (stored in email headers of PST format collection) for all emails
msg-sdoc.csv.bz2	2010-06-23 12:26:00	11M	Mapping from EDRM docid to SDOC number (stored in email headers of PST format collection) for canonical versions of deduplicated emails
docids-v2.csv.bz2	2010-07-05 15:03:38	20M	Official list of docids that are candidates for production
msg-uniqmsg.csv.bz2	2010-06-23 12:22:43	38M	Deduplication mapping from message docid to canonical version of message
uniqmsg.csv.bz2	2010-06-23 12:15:27	10M	List of canonical IDs of unique messages
DownloadPST.bat	2010-07-12 14:17:23	11K	Script to download PST version of EDRM Enron collection (obsolete)
DownloadXML.bat	2010-06-22 15:55:03	11K	Script to download XML version of EDRM Enron collection (obsolete)
edrmv2txt-v2.tar.bz2	2010-07-06 12:03:13	596M	A de-duplicated version of the text rendering of the EDRM Enron collection, containing only the canonical versions of emails and their attachments
edrmv2nativeattach.tar.bz2	2013-03-27 14:35:01	8G	The deduplicated attachments in native format
seed.csv	2010-06-23 10:26:44	898K	Seeds sets for the TREC 2010 Legal Track Learning Task

These tools and data sets may help you to download the EDRM Enron Dataset v2 used by the TREC 2010 Legal Track.

The EDRM dataset (XML version) consists of 159 files that must be selected and downloaded. We provide a script to download the XML version. (NOTE: as of late 2012, distribution arrangements for the EDRM Enron collection have changed, and this download script no longer words. See the EDRM Enron dataset page for current details on how to download the collection.)
The EDRM dataset (PST version) consists of 153 files that must be selected and downloaded. We provide a script to download the PST version. (NOTE: as of late 2012, distribution arrangements for the EDRM Enron collection have changed, and this download script no longer words. See the EDRM Enron dataset page for current details on how to download the collection.)
The script uses wget which is found on most Linux and Unix systems. For Windows, wget.exe is free, and requires no installation. Just save it in C:\Windows
A de-duplicated version of the text-only portion of the XML version is available here (596MB total).
A de-duplicated version of the native attachments of the XML version is available here (8GB total).
The duplicate list is available as a list of canonical messages and a mapping from duplicate to canonical messages.
The seed sets for the TREC 2010 Legal Track Learning Task. This file has four columns:
- The XML document id of the containing email (augmented by the md5 sum, if the document is an attachment)
- The topic number (from 200 to 207)
- The assessment (1 = responsive; 0 = not repsonsive; -1 = not assessed; -2 = not assessed)
- The XML document id for the document
The Official List of document ids that are candidates for production:
- The first column is the XML document id of the containing email (augmented by the md5 sum, if the document is an attachment)
- The second column is the XML document id of the document itself. Submit this document id.

For PST File Format Users ...

A mapping from document id to SDOC# for every canonical message. (SDOC# is found in the header of every message, but document id is not.)
A a mapping from document id to SDOC# for every message, including duplicates.