GitXplorerGitXplorer
g

datafusion-gpu

public
0 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
a5f9ebcb2c0a0bebefbc29c4516e02e0c9efec7e

Add run command in the README

ggabotechs committed a month ago
Unverified
8aadfb1121a3b2fd25b9a58c30ba1c33779d903f

Add readme

ggabotechs committed a month ago
Unverified
18ee7c08e304502afb62028f75a7f8fad0ee7f2e

Change random table from numbers to types

ggabotechs committed a month ago
Unverified
4812f6358cea3a7798a0dc666b91d0cbe8d531a8

Support more types on plane operation

ggabotechs committed a month ago
Unverified
2349639902307a9e8e3ba7b58bc742477f32b1e6

Support plane sums for cubecl

ggabotechs committed a month ago
Unverified
15582928c9e7b0752a4b361d66fff375b64dfcb4

Fix conditional cuda

ggabotechs committed a month ago

README

The README file for this repository.

Datafusion GPU

This repo intends to showcase the capabilities of wiring up Apache Datafusion with GPU execution runtimes in order to speed up heavy computations.

Objective

The main objective is not to provide a wide set of execution nodes or math functions that can run on GPU, but trying out different technologies to run a single aggregation function, and see what are the benefits and drawbacks of each approach.

For that, two approaches where followed:

Compiling compute kernels at runtime with CubeCL

This approach uses https://github.com/tracel-ai/cubecl for writing kernels directly in Rust code, which get compiled down to different backends, like CUDA or WGPU.

Example here

Advantages

  • Write the kernel once, and use it for any datatype and several GPU technologies
  • Use Rust for writing the kernel, no need to learn hardware-specific languages

Disadvantages

  • Small ecosystem, lack of documentation, lack of examples, immature technology
  • Bad performance (this could be on me)
  • Bugs? (got my laptop bricked several times trying to run some kernels)
  • Certain abstractions very tailored to working with Tensors rather than 1d arrays

Writing CUDA kernels by hand and feeding them data with cudarc

This approach uses https://github.com/coreylowman/cudarc with some handwritten CUDA kernels. The library allows feeding buffers to the GPU and scheduling them to be executed using the compiled handwritten kernel.

Example here

Advantages

  • More control over the kernel code that gets executed on the GPU
  • Good performance
  • Wide CUDA ecosystem

Disadvantages

  • Works only on CUDA devices
  • Making a kernel work for different datatypes is not supported out of the box
  • Needs knowledge about writing CUDA code

Results

Given the following conditions:

  • Measured on a g4dn.xlarge AWS instance with 4vCPU and a T4 GPU
  • In-memory table called types with 1000000 entries with the following schema:
+--------+-------+-----+
| string | float | int |
+--------+-------+-----+
  • Datafusion runtime with two sum aggregation function variants:
    • sum_cubecl: using plane reduction kernel from cudarc
    • sum_cudarc: using a handwritten CUDA kernel based on a shared memory algorithm
  • Code run with the following command: cargo run --release --features cuda -- -l 1000000
Query Execution time
SELECT sum(float) FROM types ~7.5ms
SELECT sum_cudarc(float) FROM types ~2ms
SELECT sum_cubecl(float) FROM types ~440ms