Quickly find closest words using an efficient knn and word embeddings. Uses :
- faiss for an efficient knn implementation
- nlpl word embeddings for quality word embeddings
To start quickly, you may start with the colab notebook
First install python3 then :
pip install word_knn
Just run python -m word_knn --word "cat"
Details :
$ python -m word_knn --help
usage: python -m word_knn [-h] [--word WORD] [--count COUNT]
[--root_embeddings_dir ROOT_EMBEDDINGS_DIR]
[--embeddings_id EMBEDDINGS_ID] [--save_zip SAVE_ZIP]
[--serve SERVE]
Find closest words.
optional arguments:
-h, --help show this help message and exit
--word WORD word
--count COUNT number of nearest neighboors
--root_embeddings_dir ROOT_EMBEDDINGS_DIR
dir to save embeddings
--embeddings_id EMBEDDINGS_ID
word embeddings id from
http://vectors.nlpl.eu/repository/
--save_zip SAVE_ZIP save the zip (default false)
--serve SERVE serve http API to get nearest words
First go to http://vectors.nlpl.eu/repository/ and pick some embeddings.
I advise the Google News 2013
one (id 1).
For these embeddings, you will need about 15GB of disk space and 6GB of RAM.
you can also use id 0 which is smaller (faster to download) but contains much less words
You can then run this to get some closest words. This will automatically download and extract the embeddings.
from word_knn import from_nlpl
from pathlib import Path
home = str(Path.home())
closest_words = from_nlpl(home + "/embeddings", "0", False)
print(closest_words.closest_words("cat", 10))
The word dictionary, embeddings and knn index are then cached. Second run will be much faster.
You can also download and extract the embeddings yourself with this :
mkdir -p ~/embeddings/0
cd ~/embeddings/0
wget http://vectors.nlpl.eu/repository/11/0.zip
unzip 0.zip
from word_knn import from_csv_or_cache
home = str(Path.home())
closest_words = from_csv_or_cache(home+"/embeddings/0")
print(closest_words.closest_words("cat", 10))
Make sure you use python>=3.6
and an up-to-date version of pip
and
setuptools
python --version
pip install -U pip setuptools
It is recommended to install word_knn
in a new virtual environment. For
example
python3 -m venv word_knn_env
source word_knn_env/bin/activate
pip install -U pip setuptools
pip install word_knn
pip install word_knn
First, clone the word_knn
repo on your local machine with
git clone https://github.com/rom1504/word_knn.git
cd word_knn
make install
To install development tools and test requirements, run
make install-dev
To run unit tests in your current environment, run
make test
To run lint + unit tests in a fresh virtual environment, run
make venv-lint-test
To run black --check
:
make lint
To auto-format the code using black
make black