GitXplorerGitXplorer
c

json2bin

public
2 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
8086660b18bc154d0628cf4b078da5269dec226d

debug the performance of each thread

ccahya-wirawan committed 5 months ago
Unverified
8f2a5ccc58e0fd1cda9b789c88f64c7d57daefe3

debug the performance of each thread

ccahya-wirawan committed 5 months ago
Unverified
4faf34370a85657d5d879cc7672bf2ebf40dec8b

fixed probably the utf-8 issue

ccahya-wirawan committed 5 months ago
Unverified
06a060619a30f05c5aaadf1d87dcc4bf2809d660

make configurable threads number

ccahya-wirawan committed 5 months ago
Unverified
f51a96d44ea0ff300f1577fe236f2f2c3574bc0d

Cleanup code

ccahya-wirawan committed 5 months ago
Unverified
c8af55e2f1689a6ba01dd700ff9aa4bf5be4f4c1

fixed counting newlines

ccahya-wirawan committed 5 months ago

README

The README file for this repository.

Json2bin

Crates.io Version Crates.io Downloads License: Apache 2.0

A fast multithreading Jsonl converter to RWKV binidx files written in Rust.

performance-multithreading

Installation

$ cargo install json2bin

Usage

$ json2bin -h
Json converter to RWKV binidx file format
Usage: json2bin [OPTIONS] --input <INPUT>

Options:
  -i, --input <INPUT>            Jsonlines file to read
  -o, --output-dir <OUTPUT_DIR>  Output directory for binidx files [default: -]
  -t, --thread <THREAD>          Number of threads [default: 8]
  -v, --verbose                  Verbosity
  -h, --help                     Print help
  -V, --version                  Print version

Following command will convert the jsonl file src/sample.jsonl into src/sample.bin and src/sample.idx files.

$ json2bin -i src/sample.jsonl

The output directory can be set with the argument "--output-dir <OUTPUT_DIR>" or "-o <OUTPUT_DIR>"

$ json2bin -i src/sample.jsonl -o output

The default threads number is 8, it can be changed with the argument "--thread" or "-t"

$ json2bin -i src/sample.jsonl -t 4

Performance comparison

We converted a 19GB English Wikipedia (20231101.en) in jsonl format to binidx format in M2 Apple machine. The Rust json2bin run with 7 threads, and it was 70 times faster than the Python json2binidx:

  • The Python json2binidx: 1:01:45 or 5.13MB/s
  • This Rust json2bin: 52.64s or 360.86MB/s