GitXplorerGitXplorer
r

word_knn

public
6 stars
2 forks
4 issues

Commits

List of commits on branch master.
Verified
cfdb2d1092676a9765d34d88da5e6a0110edd08d

make serve listen on all addresses

rrom1504 committed 5 years ago
Verified
738fda86fe40f8e51a1a8d501ccfb1f6c32f2ca9

Release 1.1.1

rrom1504 committed 5 years ago
Verified
57f26c492fc3039af2268b793dc6998ad64f0324

add missing run dependencies

rrom1504 committed 5 years ago
Verified
b5be6a75dc4f0afc59fe9b1493f589a25f71536b

Release 1.1.0

rrom1504 committed 5 years ago
Verified
39d9b7c3c976d1aa3c8eda9ecc6a3296028907aa

add simple http api (under --serve option)

rrom1504 committed 5 years ago
Verified
00a85090499be490831ee06ce0ee1b528ec6196a

case fix

rrom1504 committed 5 years ago

README

The README file for this repository.

word_knn

pypi ci

Quickly find closest words using an efficient knn and word embeddings. Uses :

To start quickly, you may start with the colab notebook

Installation

First install python3 then :

pip install word_knn

Usage

Command line

Just run python -m word_knn --word "cat"

Details :

$ python -m word_knn --help
usage: python -m word_knn [-h] [--word WORD] [--count COUNT]
                   [--root_embeddings_dir ROOT_EMBEDDINGS_DIR]
                   [--embeddings_id EMBEDDINGS_ID] [--save_zip SAVE_ZIP]
                   [--serve SERVE]

Find closest words.

optional arguments:
  -h, --help            show this help message and exit
  --word WORD           word
  --count COUNT         number of nearest neighboors
  --root_embeddings_dir ROOT_EMBEDDINGS_DIR
                        dir to save embeddings
  --embeddings_id EMBEDDINGS_ID
                        word embeddings id from
                        http://vectors.nlpl.eu/repository/
  --save_zip SAVE_ZIP   save the zip (default false)
  --serve SERVE         serve http API to get nearest words

Python interface

First go to http://vectors.nlpl.eu/repository/ and pick some embeddings. I advise the Google News 2013 one (id 1). For these embeddings, you will need about 15GB of disk space and 6GB of RAM.

you can also use id 0 which is smaller (faster to download) but contains much less words

You can then run this to get some closest words. This will automatically download and extract the embeddings.

from word_knn import from_nlpl
from pathlib import Path
home = str(Path.home())
closest_words = from_nlpl(home + "/embeddings", "0", False)
print(closest_words.closest_words("cat", 10))

The word dictionary, embeddings and knn index are then cached. Second run will be much faster.

You can also download and extract the embeddings yourself with this :

mkdir -p ~/embeddings/0
cd ~/embeddings/0
wget http://vectors.nlpl.eu/repository/11/0.zip
unzip 0.zip
from word_knn import from_csv_or_cache
home = str(Path.home())
closest_words = from_csv_or_cache(home+"/embeddings/0")

print(closest_words.closest_words("cat", 10))

Development

Prerequisites

Make sure you use python>=3.6 and an up-to-date version of pip and setuptools

python --version
pip install -U pip setuptools

It is recommended to install word_knn in a new virtual environment. For example

python3 -m venv word_knn_env
source word_knn_env/bin/activate
pip install -U pip setuptools
pip install word_knn

Using Pip

pip install word_knn

From Source

First, clone the word_knn repo on your local machine with

git clone https://github.com/rom1504/word_knn.git
cd word_knn
make install

To install development tools and test requirements, run

make install-dev

Test

To run unit tests in your current environment, run

make test

To run lint + unit tests in a fresh virtual environment, run

make venv-lint-test

Lint

To run black --check:

make lint

To auto-format the code using black

make black