GitXplorerGitXplorer
n

Code_Name_Jeeves

public
8 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
62b85a801a243120e5f5a0049ca0c098fbb801d3

Provided clarifying comments and changed email codes.

nnyghtowl committed 11 years ago
Unverified
41c0ac811ba410821ae7e772e2701a95aec937b5

First pass updating readme to descibe proejct.

nnyghtowl committed 11 years ago
Unverified
372b211c24d3ad4ba6cadb51bf018a3ad4cf02e3

Created script a week ago to run grid-search for whole pipeline but couldn't get it to complete. Shelved in lieu of other priorities for the project.

nnyghtowl committed 11 years ago
Unverified
0fd615e87eabe75e8f352a006bed4be6310e8ac6

Requirements file for project.

nnyghtowl committed 11 years ago
Unverified
5ed9da7680042392b56561d9753c7b95969dcd59

Draft presentation, final presentation and screen capture.

nnyghtowl committed 11 years ago
Unverified
8a7d8f84a67ec6b5fcef090a1f4446d4c9327df4

Updated graphs for presentations.

nnyghtowl committed 11 years ago

README

The README file for this repository.

Code Name Jeeves

Zipfian Final Project Spring 2014

What is it?

Jeeves is an email natural language classifier that finds messages that need a meeting location defined.

I built this as my Zipfian final project because I want my computer to do more things for me. Why not when it has all this data on me. I made a long wish list of items and then focused on getting my computer to read emails and classify the ones that meeting location defined. Then it sends me a text if an email is classified as true.

If there was enough time I wanted the program to take the next step of finding a couple recommendations on locations. Just getting the classifier working was plenty for the two weeks we had for the project.

This project is similar to spam filters where a false positive (getting texts on email incorrectly classified as true) is more acceptible than a false negative (missing an email that needs a location)

Main Structure:

  • Handle_email.py, common.py & twilio_views.py in the app folder are the main files that run the application.
    • Handle_email:

      • Main file that runs the full project pipeline from getting the email to sending a text
      • Call check_email function to get the program to check for new gmails
      • There is a pickle file of the last time that the email was checked that will be upated after new email is found
      • If there are new emails then the data is parsed out and cleaned
      • The modified tf-idf vectorizer is unpickled and applied to the email message to generate features
      • The modified logistic regression model is unpickled and the email features are run through the logistic regression for a prediction
      • If the prediction is true then a message is created
      • The message is sent through the jeeves_notifications function under the Twilio views
      • Also the message is passed to the unix say command so my computer says that a new email needs a meeting location defined
    • Common:

      • Holds common functions that are used for the project pipeline and for building out and testing the vectorizer and classifier
      • Pickle functions
      • API call to get new gmails
      • Function to clean data (strip whitespace and characters that are not UTF8, etc.)
      • Store data in Postgres
      • Put data into Pandas dataframe
      • Generate dataset splits for X, y and cross validation
    • Twilio_views:

      • Passes the message through the Twilio API to turn it into a text message on my phone

More Information:

Check out nyghtowl.io for my blog posts on my progress with developing Jeeves:

Future Plans:

  • Try Porter for stemming

  • Continue to code for producing the model:

    • Grid Search again to improve parameters
    • plot error rate and learning curves (regularization sprint)
    • explore other ways to apply k-fold
  • Other feature ideas / customization

    • Make date in body of text more informatitve
    • Python NLP - Regex library
    • length of thread (3+)
    • email from an email address / new or not

    • email address in contacts or on Linkedin...
  • Closing loop with a cronjob to automate email checking

  • Use Ec2, picloud(install package - pass function and arguments and it goes up to EC2) to run models / esp grid search

  • Continue to look at dimensionality reduction approaches

  • Turn pipeline sections into classes/objects?

  • Take model to another level

    • use partial predict so can take in feedback on new results and if they are correct (Adam gave this idea)
    • make the vectorizer/classifier result one feature and then add in other engineered features and feed through another classifer

Data Storage Structure:

Raw data stored in Postgres DB

~1000+ emails total ~120 emails meet target conditions (may be smaller based on thread emails)

Raw Data:

  • Message ID (message_id) = string / primary key
  • Thread ID (thread_id) = string
  • To (to_email) = string
  • From (from_email) = string
  • CC (cc) = string
  • Date Sent (date) = timestamp
  • Subject (subject) = string
  • Starred Email (starred) = boolean (true meanss its starred)
  • Message Body (body) = text
  • Message Subject & Body (sub_body) = text
  • Email Owner (email_owner) = string (email source)
  • Box (email box) = string
  • Needs Location (target) = boolean (true needs a meeting location and based on labels 'Jeeves' & 'Starred')

Support:

  • Any bugs about Markdown Preview please feel free to report here
  • And you are welcome to fork and submit pullrequests

Copyright:

Copyright (c) 2014 Nyghtowl