GitXplorerGitXplorer
r

static-ondisk-kv

public
9 stars
0 forks
1 issues

Commits

List of commits on branch main.
Unverified
1edf9f4d4fda068577bbb41a941292b6e074662c

Release 1.1.2

rrom1504 committed 3 years ago
Unverified
d063baf6ddd3c91ecfef51c08c6df03a0dbcbc59

add space after key in error

rrom1504 committed 3 years ago
Unverified
0127bef34e2fb4b3e882f91ad2839426a946b8f0

Release 1.1.1

rrom1504 committed 3 years ago
Unverified
da49f98fdd8719fccba382f2b6f672c05ad37a47

simplify implementation of onDiskKV

rrom1504 committed 3 years ago
Unverified
b03e617feb791cc5452c772c1f360278280b8031

Release 1.1.0

rrom1504 committed 3 years ago
Unverified
b1dc0fac074de2469251323cb574c2d92ff57138

Make OnDiskKV pickable.

rrom1504 committed 3 years ago

README

The README file for this repository.

static_ondisk_kv

pypi Open In Colab Try it on gitpod

Simple and fast implementation of a static on disk kv, in python

Why this lib?

leveldb, rocksdb and lmdb all have issues for a static collections of key and values:

  • slow to build (many hours) : 3h for rocksdb compared to 1h for this lib (for a 5B collections for 1 long and 2 float16)
  • uses more space than necessary (100GB for rocksdb unlike 60GB)
  • as fast as this much simpler lib: about 5k sample/s on nvme drive

What this lib does not support:

  • non static collection
  • variable length values and keys

Install

pip install static_ondisk_kv

Python examples

Checkout these examples:

from static_ondisk_kv import OnDiskKV
from tqdm import tqdm
import random

kv = OnDiskKV(file='/media/nvme/mybigfile', key_format="q", value_format="ee")
print("length", kv.length)
k = kv.get_key(100)
v = kv.get_value(100)
print(k)
print(v)
print(kv[k])

API

OnDiskKV(file, key_format="q", value_format="ee")

Creates an ondisk kv from file using key_format and value_format for decoding.

get_key(i)

Returns the key at position i.

get_value(i)

Returns the value at position i.

getitem(k)

Returns the value for the key k

sort_parquet(input_collection, key_column, value_columns, output_folder)

sort parquet files of collection input_collection by key_column and writes to output_folder

parquet_to_file(input_collection, key_column, value_columns, output_file, key_format, value_format)

read parquet of sorted input_collection and writes to output_file the key and values using format key_format and value_format

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test