Paper Link
Sehoon Kim et al (UC Berkeley)
Introduction
This paper proposes a Psuedo-PTQ method considering the weight distribution of LLM and outliers.
Motivation
- Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks.
- Deploying LLMs for inference has been a significant challenge due to their unprecedented resource requirements.
- AThis has forced existing deployment frameworks to use multi-GPU inference pipelines, or to use smaller and less performant models.
- They demonstrates that the main bottleneck for inference with LLMs is memory bandwidth, specifically for single batch inference.
Contribution
Sensitivity-based Non-Uniform Quantization
- In LLaMA-7B. The distribution clearly demonstrates a non-uniform pattern.
- In LLM quantization. Uniformly distributing quantized values is sub-optimal because the weight distribution in neural networks is typically non-uniform.
- Main advantage of uniform quantization is fast and efficient reduced precision computation, this does not lead to end-to-end latency improvement in memory-bound LLM inference.
- While quantization introduces errors or perturbations in each layer, we need to minimize the overall perturbation with respect to the final loss term, rather than focusing on individual layers.
- The k-means centroids closer to the values that are more sensitive with respect to the final loss, rather than treating all weight values equally.
- Taylor series expansion to analyze how the model output changes in response to perturbations in the parameters W
Dense and Sparse Decomposition
- a method to filter out outliers from the weight matrix W by performing a very simple, yet effective, decomposition of the weight matrix into a dense (D) and sparse matrix (S).
- The sparse part is calculated by computing the outliers in a given layer, and taking it out of the weight matrix.
- The remainder is a dense matrix that can be quantized much more effectively thanks to its significantly reduced range of values, which in some cases is more than 10×:
Experiments
- PTQ like framework that enables lossless compression to ultra-low precisions of up to 3-bit.
- Consistent performance improvement across different model sizes compared to GPTQ and AWQ.
Conclusion
- Nope