GitXplorerGitXplorer
y

es-enron

public
11 stars
5 forks
1 issues

Commits

List of commits on branch master.
Unverified
176c876e4dae8da46f3078baec53c9fe91617ebd

Move dataset.tgz off GitHub

yycombinator committed 8 years ago
Unverified
59cabfe72d48361880bb3db42ae089167cb043dc

Using _source_include instead of the obsolete fields

yycombinator committed 9 years ago
Unverified
8fa499497494bf453e3bc881f720e2810f020aac

Merge branch 'master' of github.com:ycombinator/es-enron

yycombinator committed 9 years ago
Unverified
41863ff9e15cd4ae7274d48f8e1e0d19c63b483b

Adding Console examples

yycombinator committed 9 years ago
Unverified
490595d2f63eab46a9d415ce3c9c758ea3fbe8c9

Using keyword instead of not_analyzed

yycombinator committed 9 years ago
Unverified
fc51785875bc2931357fb46429673a5ae67eb5c4

Adding pre-requisite about git-lfs

yycombinator committed 9 years ago

README

The README file for this repository.

Pre-requisite

Download dataset.tgz from here into the same folder as where you clone this repository.

Preparation

The dataset.tgz file contains an archive of all Enron emails, de-duped, and parsed into JSON files. Each JSON file in the archive represents one email message.

The size of this compressed dataset is 252MB. Uncompressed into individual JSON files, the size becomes 1.3GB.

  1. Install Node.js, MySQL, and Elasticsearch. Make sure MySQL and Elasticsearch are running.

  2. Uncompress the archive.

tar xvf dataset.tgz
  1. Load the emails into Elasticsearch.
npm install   # if you haven't run this already
./load_into_es.sh
  1. Load the emails in MySQL.
./load_into_mysql.sh

Appendix

The original Enron email dataset was taken from https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz. This is an archive of all Enron emails in EML format, where each file represents one email message. Some of these messages are duplicated in multiple files.

The parse_email_files.js script will parse the original Enron email dataset into JSON files, after de-duplicating them.

The included dataset.tgz file is archive of exactly these JSON files.