GitXplorerGitXplorer
k

color_analysis

public
2 stars
4 forks
0 issues

Commits

List of commits on branch master.
Unverified
c7951b166a05c6742b1277e66bdfc00e288b1ff9

sentence complexity spreadsheet

cchrchung committed 8 years ago
Unverified
522268d522ed59eb595f71803a2b50a2a6188442

distribution entire corpus lists

cchrchung committed 8 years ago
Unverified
cb530966fa648e965ee20e09ec7a33a94dcd5965

bug fix

cchrchung committed 8 years ago
Unverified
b84c459f38b75d3c9a0eb6163128bc5b105d9709

ks test

cchrchung committed 8 years ago
Unverified
3a3eb8718683f56a35131ae697ad1c29d09286e2

color count per decade

cchrchung committed 8 years ago
Unverified
468f386f682cc36333345e812d567e647fd98753

bug fix

cchrchung committed 8 years ago

README

The README file for this repository.

Color Analysis

This project is an analysis of color in 19th century literature, done in concert with the Literary Lab at Stanford University. Please contact Irena Yamboliev (firstname[at]stanford.edu) or Kawin Ethayarajh ([firstname][at]cs.toronto.edu) if you'd like to contribute.

File List:

color_analysis.py
- Run main method to build the databases. This takes a long time and is currently spread over 22 threads. Run merge_databases() to merge the newly created databases into a single one called color_analysis_merged.db
storage.py
Contains methods for insertion into the smaller databases (one for each thread).
sentence.py
Extracts the salient features from each sentence.
book.py
Processing an entire book. Contains preprocessing methods and the parse_book method for parsing an entire book.
metadata.p
A pickled dictionary with the metadata (i.e. title, author, and year published) for each book. Indexed by file names ending in _tokenized.txt.
extended_colors.csv
List of all valid colors and relevant properties. Colors may be added to the program's internal list ad hoc, but such bootstrapped colors are not then added to this csv file, which was manually prepared.
schema.txt
Schema for the database. For more info, see storage.py
test_sentences_tokenized.txt
Test sentences, tokenized, one per line. Most have colors.
test_sentences_tagged.txt
The test_sentences_tokenized.txt file after it has gone through the Senna tagger.
color_analysis_sample.db
A sample database created using the sentences in test_sentences_tokenized.txt and test_sentences_tagged.txt.

To be added:

color_analysis_merged.db
The database with all the relevant data (called 'merged' because the processing is multi-threaded, and so many databases are merged to form this one). Schema can be found in schema.txt.