Show HN: Nabla – Pure Rust GPU math engine, 7.5× faster matmul than PyTorch

Posted by fumishiki |3 hours ago |1 comments

fumishiki 3 hours ago

nabla provides 190+ tensor ops (matmul, conv, autodiff, einsum, kernel fusion, ODE solvers, symbolic CAS) across 4 GPU backends, all in pure Rust. No LAPACK, no libtorch, no CUDA SDK at build time.

The pitch: &a * &b for matmul, fuse!(x.sin().powf(2.0); x) for GPU kernel fusion, loss.backward() for autodiff. Same code runs on CPU, Vulkan, CUDA, or AMD — switch one Cargo feature flag.

Benchmarks vs PyTorch on GH200: eager training 4–6× faster across all batch sizes, CUDA Graph beats PyTorch at batch≥128. All reproducible via cd benchmarks && bash run.sh.