word_knn

Quickly find closest words using an efficient knn and word embeddings. Uses :

faiss for an efficient knn implementation
nlpl word embeddings for quality word embeddings

To start quickly, you may start with the colab notebook

Installation

First install python3 then :

pip install word_knn

Usage

Command line

Just run python -m word_knn --word "cat"

Details :

$ python -m word_knn --help
usage: python -m word_knn [-h] [--word WORD] [--count COUNT]
                   [--root_embeddings_dir ROOT_EMBEDDINGS_DIR]
                   [--embeddings_id EMBEDDINGS_ID] [--save_zip SAVE_ZIP]
                   [--serve SERVE]

Find closest words.

optional arguments:
  -h, --help            show this help message and exit
  --word WORD           word
  --count COUNT         number of nearest neighboors
  --root_embeddings_dir ROOT_EMBEDDINGS_DIR
                        dir to save embeddings
  --embeddings_id EMBEDDINGS_ID
                        word embeddings id from
                        http://vectors.nlpl.eu/repository/
  --save_zip SAVE_ZIP   save the zip (default false)
  --serve SERVE         serve http API to get nearest words

Python interface

First go to http://vectors.nlpl.eu/repository/ and pick some embeddings. I advise the Google News 2013 one (id 1). For these embeddings, you will need about 15GB of disk space and 6GB of RAM.

you can also use id 0 which is smaller (faster to download) but contains much less words

You can then run this to get some closest words. This will automatically download and extract the embeddings.

from word_knn import from_nlpl
from pathlib import Path
home = str(Path.home())
closest_words = from_nlpl(home + "/embeddings", "0", False)
print(closest_words.closest_words("cat", 10))

The word dictionary, embeddings and knn index are then cached. Second run will be much faster.

You can also download and extract the embeddings yourself with this :

mkdir -p ~/embeddings/0
cd ~/embeddings/0
wget http://vectors.nlpl.eu/repository/11/0.zip
unzip 0.zip

from word_knn import from_csv_or_cache
home = str(Path.home())
closest_words = from_csv_or_cache(home+"/embeddings/0")

print(closest_words.closest_words("cat", 10))

Development

Prerequisites

Make sure you use python>=3.6 and an up-to-date version of pip and setuptools

python --version
pip install -U pip setuptools

It is recommended to install word_knn in a new virtual environment. For example

python3 -m venv word_knn_env
source word_knn_env/bin/activate
pip install -U pip setuptools
pip install word_knn

Using Pip

pip install word_knn

From Source

First, clone the word_knn repo on your local machine with

git clone https://github.com/rom1504/word_knn.git
cd word_knn
make install

To install development tools and test requirements, run

make install-dev

Test

To run unit tests in your current environment, run

make test

To run lint + unit tests in a fresh virtual environment, run

make venv-lint-test

Lint

To run black --check:

make lint

To auto-format the code using black

make black

word_knn

Commits

add getting started notebook, closes #5

Release 1.4.0

move to pypi

fix keep embeddings in from_nlpl

Release 1.3.0

add option to not keep embeddings in memory (only the knn index)

README

word_knn

Installation

Usage

Command line

Python interface

Development

Prerequisites

Using Pip

From Source

Test

Lint