GitXplorerGitXplorer
j

gibbon

public
7 stars
2 forks
0 issues

Commits

List of commits on branch master.
Unverified
b786153748b537445b224e5777ce7616a4aaaf3d

Version 0.1.3

jjohshoff committed 5 years ago
Unverified
00cc9a13a7f01dd1e5d3f91ba8586d3b04d88137

Rename test and add explanation

jjohshoff committed 5 years ago
Verified
00018e23a2fd5c83fe1192dc3a7d849effdb73e4

Merge pull request #3 from That3Percent/master

jjohshoff committed 5 years ago
Unverified
81e5afbd18a850cdb86fadef8fbb9ff67397f807

Fix corner case with aligned read

jjohshoff committed 5 years ago
Unverified
1c32af2023eecd3905b9dd01395c9176950fde38

Added panic test case

TThat3Percent committed 5 years ago
Unverified
327197855c52b864a366ea46878fe7f433b8da18

Publish new version

jjohshoff committed 5 years ago

README

The README file for this repository.

This is an incomplete implementation of the memory format of Facebook's Gorilla database in Rust, as described in Gorilla: A Fast, Scalable, In-Memory Time Series Database.

Example

There is an example at examples/csv_to_packed.rs. Run as follows:

cargo run --release --example csv_to_packed

This will read the file examples/test_data.csv and compress it in memory. It's not a very interesting file, but replacing it with your favorite data will show compression ratio and speed differences between compressed and uncompressed reads.

There are also examples in the test code in the modules.

Implementation details

The Gorilla Paper leaves some details out:

  • The number of significant bits when compressing doubles are stored in a 6 bits, giving a max value of 63. The key thing to notice is that only 63 values are actually needed: 1 through 64. I solve this by storing M - 1, where M is the number of significant bits ([MEANING64] in code). Another implementation stores it by storing M & 63 and resolving it at read time. Either solution is fine. The former optimizes for read speed and the latter for write speed.
  • The number of leading zeros is stored in 5 bits, which gives a maximum of 31 leading zeros. There is nothing preventing significant bits from having leading zeros, though, so we just use 31 if it's 31 or higher. ([LEADING31] in code)
  • Leading number in previous XOR. Are we storing that or the XOR itself? If the former, the window will keep the same if we reuse it, if not it might shrink as new data comes in. Unsure about the best solution. ([XORORLEADING] in code)
  • IntStream writes the number plus a bias so that the resulting number is always a non-negative number. This makes it fast to encode and decode without branching or being dependent on hardware representation of numbers. The initial version was not as smart and took about twice as long to decode.

Further work

  • Measure and optimize performance
  • Resolve open questions in Implementation details
  • Implement the rest of the paper
  • Investigate whether Rust's Write and Read traits could be used instead of hand rolled traits
  • Better naming:
    • Stream can now mean both bit-stream and compressed stream
    • Writer can refer both to the Writer trait and its impls or to a "compressor"