Show HN: Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)

Posted by fatihturker |3 hours ago |1 comments

fatihturker 3 hours ago

One question I'm interested in exploring:

If models become heavily compressed and streamed from SSD, where do people think the real bottleneck moves to — storage bandwidth, memory bandwidth, or kernel efficiency?