GitXplorerGitXplorer
m

predictive-text

public
1 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
695b2b9b568f7900eea674dc1f62ddb26cd960a6

Add readme

mmattdean1 committed 5 years ago
Unverified
25bc2588ce35ca5eaf37f883ba923abc5a407d34

Make demo interactive

mmattdean1 committed 5 years ago
Unverified
457f81dbfec2ba5552ce18a2a941e1bb83ef944e

Read csv and predict using trie

mmattdean1 committed 5 years ago
Unverified
9542ce98e77ddd5f5e7b36fd83555aa3c526339e

Implement prediction

mmattdean1 committed 5 years ago
Unverified
883ad880c9483fa4b229085e1efb5da678efd007

Better config

mmattdean1 committed 5 years ago
Unverified
df06895c52aaedb2d0540c7f46eac771702a0d8e

Add Trie type with insert

mmattdean1 committed 5 years ago

README

The README file for this repository.

Here I implemented a prefix-trie to suggest predictive text options. The trie is populated using a chunk of Enron email data (100k emails).

You can run the example like this (assuming you need to have a recent version of node/yarn installed):

yarn
yarn tsc
node index.js

Next steps would be:

  • Use promises/await everywhere instead of callbacks
  • Serialize the populated tree and save it to disk (so it loads faster)
  • Wrap that into a library, with an interface to import data and "query" it

My notes

Initial thoughts

  • elasticsearch
  • train an ml model
  • from scratch → seems more fun - more actual coding

Dataset

https://www.kaggle.com/wcukierski/enron-email-dataset/

actually this one → https://data.world/brianray/enron-email-dataset

(first chunk only)

Data structure

https://www.futurice.com/blog/data-structures-for-fast-autocomplete/

→ trie with weight (frequency) at each leaf to order predictions

→ are there any other factors we should take into account when ordering?

→ how can we extend that for near matches?

will it fit in memory: https://stackoverflow.com/questions/22183005/whats-the-size-of-a-prefix-tree-trie-that-contains-all-the-english-words

Language etc.

Could do Cpp for efficiency or java for nice stdlib but would be a lot of effort making an interface

Could do Go → KO since not enough exp, don't want to be checking syntax all the time

python/js easiest → go with js since easier to find nice libraries for interface

autocomplete cli libraries:

if there's time could make a frontend and host it somewhere

Plan

  1. Implement trie, populate it, save that somewhere
  2. Add cli interface → load trie into mem, print highest x suggestions after each char typed
  3. fuzzy