fumishiki 3 hours ago
The pitch: &a * &b for matmul, fuse!(x.sin().powf(2.0); x) for GPU kernel fusion, loss.backward() for autodiff. Same code runs on CPU, Vulkan, CUDA, or AMD — switch one Cargo feature flag.
Benchmarks vs PyTorch on GH200: eager training 4–6× faster across all batch sizes, CUDA Graph beats PyTorch at batch≥128. All reproducible via cd benchmarks && bash run.sh.