GitXplorerGitXplorer
f

q4f16-gemm-gemv-benchmark

public
5 stars
0 forks
1 issues

Commits

List of commits on branch main.
Unverified
8bded43e3b4196020bdcd67683cdd0a25f188224

fix

ffxmarty committed 2 years ago
Unverified
a4ef4e13d95fedfb84ca20521da2ba18cd0dc752

working

ffxmarty committed 2 years ago
Unverified
e864a12b6e1023b3a93727b156408cc7bcd3c336

script

ffxmarty committed 2 years ago
Verified
c954030f31ac8066b606061631ef450092771b64

Initial commit

ffxmarty committed 2 years ago

README

The README file for this repository.

This repository aims to compare the available open-source GEMM / GEMV kernels using a mixed precision scheme int4 / fp16, with per-group quantization.

Available implementations

Results

On A100-SXM4-80GB & Intel Xeon Platinum 8275CL CPU + CUDA 11.7/11.8 (should be rerun in docker):

m n k implementation act_order Time (ms/op) Max mem (MB)
1 8192 8192 baseline True 0.0937 177.6845
1 8192 8192 gptqforllama True 0.2038 69.8450
1 8192 8192 exllama False 0.0681 34.9143
1 8192 8192 exllama True 0.0675 34.9471
1 8192 8192 autogptq-triton True 0.3990 69.8450
1 8192 8192 autogptq-cuda-old False 0.0831 71.9585
1 8192 8192 autogptq-cuda True 0.1546 69.8778

On RTX 4090 + AMD Ryzen 9 7950X CPU + CUDA 11.8:

TODO

On A10G + AMD EPYC 7R32 CPU + CUDA 11.8 (docker):

m n k implementation act_order Time (ms/op) Max mem (MB)
1 8192 8192 baseline True 0.2891 177.6845
1 8192 8192 gptqforllama True 0.1746 69.8450
1 8192 8192 autogptq-triton True 0.2963 69.8450
1 8192 8192 autogptq-cuda-old False 0.0979 71.9585
1 8192 8192 autogptq-cuda True 0.1483 69.8778
1 8192 8192 exllama False 0.0842 34.9143
1 8192 8192 exllama True 0.0839 34.9471

Run the benchmark

A=m * k, B=k * n, compute C= A*B^T

It can be a good idea to first lock the GPU frequency, see https://github.com/NVIDIA/cutlass/issues/430#issuecomment-1069535238

Run exllama in exllama env:

CUDA_VISIBLE_DEVICES=0 python run_benchmark.py --m 1 --n 8192 --k 8192 --group_size 128 --exllama-path ../exllama --act-order yes

Run gptqforllama in gptqforllama env:

CUDA_VISIBLE_DEVICES=0 python run_benchmark.py --m 1 --n 8192 --k 8192 --group_size 128 --gptqforllama-path ../GPTQ-for-LLaMa --act-order yes

Run AutoGPTQ (specify --autogptq-implem {triton, cuda-old, cuda}):

CUDA_VISIBLE_DEVICES=0 python run_benchmark.py --m 1 --n 8192 --k 8192 --group_size 128 --autogptq-path ../AutoGPTQ/ --autogptq-implem triton --act-order yes

Run PyTorch fp16 * fp16 baseline:

CUDA_VISIBLE_DEVICES=0 python run_benchmark.py --m 1 --n 8192 --k 8192 --group_size 128 --baseline

Run all benchmarks

Follow https://stackoverflow.com/a/61737404 and

docker build -f Dockerfile --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) -t container-q4f16 .

and

docker run --gpus device=0 -it --rm container-q4f16:latest /bin/bash run.sh