GitXplorerGitXplorer
r

word_knn

public
6 stars
2 forks
4 issues

Commits

List of commits on branch master.
Verified
611460d9a6819c25239ad2a9c3a9575978773e03

Release 1.2.0

rrom1504 committed 5 years ago
Verified
6d27a2b1bee038f2a5f8c43b9fc7d61445ee6b70

add distance option in api and http api

rrom1504 committed 5 years ago
Verified
a60c99467cd45381a0bf33efd0b0c322c152b152

Release 1.1.3

rrom1504 committed 5 years ago
Verified
c5b4b92176403d9a201205d051380c429b64bdb5

check if word is in dict to avoid crashes

rrom1504 committed 5 years ago
Verified
194ff7e96456cb7c25c1fe0e426454c38d250b6d

add some instructions

rrom1504 committed 5 years ago
Verified
4932b7666cfd3107b3ef3db35fa05227dc13d9e2

Release 1.1.2

rrom1504 committed 5 years ago

README

The README file for this repository.

word_knn

pypi ci

Quickly find closest words using an efficient knn and word embeddings. Uses :

To start quickly, you may start with the colab notebook

Installation

First install python3 then :

pip install word_knn

Usage

Command line

Just run python -m word_knn --word "cat"

Details :

$ python -m word_knn --help
usage: python -m word_knn [-h] [--word WORD] [--count COUNT]
                   [--root_embeddings_dir ROOT_EMBEDDINGS_DIR]
                   [--embeddings_id EMBEDDINGS_ID] [--save_zip SAVE_ZIP]
                   [--serve SERVE]

Find closest words.

optional arguments:
  -h, --help            show this help message and exit
  --word WORD           word
  --count COUNT         number of nearest neighboors
  --root_embeddings_dir ROOT_EMBEDDINGS_DIR
                        dir to save embeddings
  --embeddings_id EMBEDDINGS_ID
                        word embeddings id from
                        http://vectors.nlpl.eu/repository/
  --save_zip SAVE_ZIP   save the zip (default false)
  --serve SERVE         serve http API to get nearest words

Python interface

First go to http://vectors.nlpl.eu/repository/ and pick some embeddings. I advise the Google News 2013 one (id 1). For these embeddings, you will need about 15GB of disk space and 6GB of RAM.

you can also use id 0 which is smaller (faster to download) but contains much less words

You can then run this to get some closest words. This will automatically download and extract the embeddings.

from word_knn import from_nlpl
from pathlib import Path
home = str(Path.home())
closest_words = from_nlpl(home + "/embeddings", "0", False)
print(closest_words.closest_words("cat", 10))

The word dictionary, embeddings and knn index are then cached. Second run will be much faster.

You can also download and extract the embeddings yourself with this :

mkdir -p ~/embeddings/0
cd ~/embeddings/0
wget http://vectors.nlpl.eu/repository/11/0.zip
unzip 0.zip
from word_knn import from_csv_or_cache
home = str(Path.home())
closest_words = from_csv_or_cache(home+"/embeddings/0")

print(closest_words.closest_words("cat", 10))

Development

Prerequisites

Make sure you use python>=3.6 and an up-to-date version of pip and setuptools

python --version
pip install -U pip setuptools

It is recommended to install word_knn in a new virtual environment. For example

python3 -m venv word_knn_env
source word_knn_env/bin/activate
pip install -U pip setuptools
pip install word_knn

Using Pip

pip install word_knn

From Source

First, clone the word_knn repo on your local machine with

git clone https://github.com/rom1504/word_knn.git
cd word_knn
make install

To install development tools and test requirements, run

make install-dev

Test

To run unit tests in your current environment, run

make test

To run lint + unit tests in a fresh virtual environment, run

make venv-lint-test

Lint

To run black --check:

make lint

To auto-format the code using black

make black