Simple and fast implementation of a static on disk kv, in python
leveldb, rocksdb and lmdb all have issues for a static collections of key and values:
- slow to build (many hours) : 3h for rocksdb compared to 1h for this lib (for a 5B collections for 1 long and 2 float16)
- uses more space than necessary (100GB for rocksdb unlike 60GB)
- as fast as this much simpler lib: about 5k sample/s on nvme drive
What this lib does not support:
- non static collection
- variable length values and keys
pip install static_ondisk_kv
Checkout these examples:
from static_ondisk_kv import OnDiskKV
from tqdm import tqdm
import random
kv = OnDiskKV(file='/media/nvme/mybigfile', key_format="q", value_format="ee")
print("length", kv.length)
k = kv.get_key(100)
v = kv.get_value(100)
print(k)
print(v)
print(kv[k])
Creates an ondisk kv from file
using key_format
and value_format
for decoding.
Returns the key at position i.
Returns the value at position i.
Returns the value for the key k
sort parquet files of collection input_collection
by key_column
and writes to output_folder
read parquet of sorted input_collection
and writes to output_file
the key and values using format key_format
and value_format
Either locally, or in gitpod (do export PIP_USER=false
there)
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
to run tests:
pip install -r requirements-test.txt
then
make lint
make test
You can use make black
to reformat the code
python -m pytest -x -s -v tests -k "dummy"
to run a specific test