Unweight: We compressed an LLM 22% without sacrificing quality

Posted by subset |2 hours ago |1 comments

ttd 2 hours ago

I love these optimization tales. Memory throughput bottlenecks (extremely common, perhaps moreso than they seem) are my favorite to tackle - there are frequently some juicy optimizations that can apply there.

Do model weights have any spatial locality that can be exploited? If so, there are some more general pre-compression techniques that might be interesting to try, e.g. bitshuffle is one I've worked with (https://github.com/kiyo-masui/bitshuffle).

Another fun fact: in some scenarios (depends a lot on CPU and memory characteristics), gzip+memcpy+gunzip can be faster end-to-end than just memcpy. I forget where I first heard this but my familiarity comes from the blosc compression library.