GitXplorerGitXplorer
s

word2vec

public
0 stars
0 forks
28 issues

Commits

List of commits on branch master.
Unverified
276e638f9ca49998a913488579ec8d963b3ad524

updated the code so that it would run even with extremely large N (number of closest words), such as 5000

committed 11 years ago
Unverified
2c3c561da0c6cb417236a32306a90204f563203c

fixed bug: the counter in K-means was updated too frequently

committed 11 years ago
Unverified
2ca74fd9bbc1b26bff1d1e5146cd0b1f36e56aba

fixed typo back of words -> bag of words

committed 11 years ago
Unverified
db4c5c6757cb2ddaf500e035e28c2eef2722225e

bugfix - the vector computation was done before check for out-of-vocabulary words was performed, which resulted in accessing memory outside of the allocated block

committed 11 years ago
Unverified
b8de5c49e7dabaacfe5982e8bd1b85ca3055493b

changed makefile so that the .sh scripts would become executable after 'make'

committed 11 years ago
Unverified
7070a0bb8327aab35b86c792c49b897c53f45f3b

distance -> cosine distance

committed 11 years ago

README

The README file for this repository.

Tools for computing distributed representtion of words

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:

  • desired vector dimensionality
  • the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
  • training algorithm: hierarchical softmax and / or negative sampling
  • threshold for downsampling the frequent words
  • number of threads to use
  • the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/