Benchmarking rolvsparse on DeepSeek-R1 and Llama 4 – up to 82x vs. cuBLAS

Posted by heggenhougen |2 hours ago |1 comments

heggenhougen 2 hours ago

We have been running a sparse matrix library called rolvsparse on real model weights downloaded directly from HuggingFace and measuring throughput and energy against cuBLAS on an NVIDIA B200. Here are the results across five models so far.

DeepSeek-R1: all 256 MoE experts stacked into a 524,288 x 7,168 matrix. 78.9x throughput vs cuBLAS, 98.7% energy reduction, 5,294 effective TFLOPS. Operator build time 0.11 seconds.

Llama 4 Scout: MoE FFN weights, 81.7x throughput, 98.8% energy reduction.

Mixtral 8x22B: 55.1x throughput across all 56 MoE layers, 98.2% energy reduction.

Qwen3-235B-A22B: 22.4x throughput, 95.5% energy reduction.

Llama 4 Maverick: 20.7x throughput, 81.5% energy reduction.

Each result is SHA-256 verified against a normalized output hash. The same hash has been reproduced independently by the University of Miami across NVIDIA B200, AMD MI300X, Intel Xeon, and Apple M4 Pro hardware, published on Zenodo in December 2025.

The library works without model retraining, quantization, or hardware changes. It operates on the weight matrices directly.

We are happy to answer questions about methodology, the hardware counters, or anything else.

rolv.ai