GitXplorerGitXplorer
r

word_knn

public
6 stars
2 forks
4 issues

Commits

List of commits on branch master.
Unverified
56cb113a8f843eaafec6200ec5ed6e88876edf12

add getting started notebook, closes #5

rrom1504 committed 4 years ago
Unverified
b24d4f08fa03377fea6ac844d354e245a2f3b519

Release 1.4.0

rrom1504 committed 4 years ago
Unverified
d915b58fe7bf17d5714665788dd8cfe8de94c50d

move to pypi

rrom1504 committed 4 years ago
Verified
fa886d834cecabc3aeaa85e61a9fa6a4b6011316

fix keep embeddings in from_nlpl

rrom1504 committed 5 years ago
Verified
61235d0dcd0d2a9e8e93cc55c6e86a53cafdce96

Release 1.3.0

rrom1504 committed 5 years ago
Verified
9f71b2c9180f57727daee5b7e91e5404f3e30e70

add option to not keep embeddings in memory (only the knn index)

rrom1504 committed 5 years ago

README

The README file for this repository.

word_knn

pypi ci

Quickly find closest words using an efficient knn and word embeddings. Uses :

To start quickly, you may start with the colab notebook

Installation

First install python3 then :

pip install word_knn

Usage

Command line

Just run python -m word_knn --word "cat"

Details :

$ python -m word_knn --help
usage: python -m word_knn [-h] [--word WORD] [--count COUNT]
                   [--root_embeddings_dir ROOT_EMBEDDINGS_DIR]
                   [--embeddings_id EMBEDDINGS_ID] [--save_zip SAVE_ZIP]
                   [--serve SERVE]

Find closest words.

optional arguments:
  -h, --help            show this help message and exit
  --word WORD           word
  --count COUNT         number of nearest neighboors
  --root_embeddings_dir ROOT_EMBEDDINGS_DIR
                        dir to save embeddings
  --embeddings_id EMBEDDINGS_ID
                        word embeddings id from
                        http://vectors.nlpl.eu/repository/
  --save_zip SAVE_ZIP   save the zip (default false)
  --serve SERVE         serve http API to get nearest words

Python interface

First go to http://vectors.nlpl.eu/repository/ and pick some embeddings. I advise the Google News 2013 one (id 1). For these embeddings, you will need about 15GB of disk space and 6GB of RAM.

you can also use id 0 which is smaller (faster to download) but contains much less words

You can then run this to get some closest words. This will automatically download and extract the embeddings.

from word_knn import from_nlpl
from pathlib import Path
home = str(Path.home())
closest_words = from_nlpl(home + "/embeddings", "0", False)
print(closest_words.closest_words("cat", 10))

The word dictionary, embeddings and knn index are then cached. Second run will be much faster.

You can also download and extract the embeddings yourself with this :

mkdir -p ~/embeddings/0
cd ~/embeddings/0
wget http://vectors.nlpl.eu/repository/11/0.zip
unzip 0.zip
from word_knn import from_csv_or_cache
home = str(Path.home())
closest_words = from_csv_or_cache(home+"/embeddings/0")

print(closest_words.closest_words("cat", 10))

Development

Prerequisites

Make sure you use python>=3.6 and an up-to-date version of pip and setuptools

python --version
pip install -U pip setuptools

It is recommended to install word_knn in a new virtual environment. For example

python3 -m venv word_knn_env
source word_knn_env/bin/activate
pip install -U pip setuptools
pip install word_knn

Using Pip

pip install word_knn

From Source

First, clone the word_knn repo on your local machine with

git clone https://github.com/rom1504/word_knn.git
cd word_knn
make install

To install development tools and test requirements, run

make install-dev

Test

To run unit tests in your current environment, run

make test

To run lint + unit tests in a fresh virtual environment, run

make venv-lint-test

Lint

To run black --check:

make lint

To auto-format the code using black

make black