heggenhougen 2 hours ago
DeepSeek-R1: all 256 MoE experts stacked into a 524,288 x 7,168 matrix. 78.9x throughput vs cuBLAS, 98.7% energy reduction, 5,294 effective TFLOPS. Operator build time 0.11 seconds.
Llama 4 Scout: MoE FFN weights, 81.7x throughput, 98.8% energy reduction.
Mixtral 8x22B: 55.1x throughput across all 56 MoE layers, 98.2% energy reduction.
Qwen3-235B-A22B: 22.4x throughput, 95.5% energy reduction.
Llama 4 Maverick: 20.7x throughput, 81.5% energy reduction.
Each result is SHA-256 verified against a normalized output hash. The same hash has been reproduced independently by the University of Miami across NVIDIA B200, AMD MI300X, Intel Xeon, and Apple M4 Pro hardware, published on Zenodo in December 2025.
The library works without model retraining, quantization, or hardware changes. It operates on the weight matrices directly.
We are happy to answer questions about methodology, the hardware counters, or anything else.
rolv.ai