[논문] SqueezeLLM: Dense-and-Sparse Quantization

Paper Link

Sehoon Kim et al (UC Berkeley)

This paper proposes a Psuedo-PTQ method considering the weight distribution of LLM and outliers.

Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks.
Deploying LLMs for inference has been a significant challenge due to their unprecedented resource requirements.
AThis has forced existing deployment frameworks to use multi-GPU inference pipelines, or to use smaller and less performant models.
They demonstrates that the main bottleneck for inference with LLMs is memory bandwidth, specifically for single batch inference.

In LLaMA-7B. The distribution clearly demonstrates a non-uniform pattern.
In LLM quantization. Uniformly distributing quantized values is sub-optimal because the weight distribution in neural networks is typically non-uniform.
Main advantage of uniform quantization is fast and efficient reduced precision computation, this does not lead to end-to-end latency improvement in memory-bound LLM inference.
While quantization introduces errors or perturbations in each layer, we need to minimize the overall perturbation with respect to the final loss term, rather than focusing on individual layers.
The k-means centroids closer to the values that are more sensitive with respect to the final loss, rather than treating all weight values equally.
Taylor series expansion to analyze how the model output changes in response to perturbations in the parameters W

a method to filter out outliers from the weight matrix W by performing a very simple, yet effective, decomposition of the weight matrix into a dense (D) and sparse matrix (S).
The sparse part is calculated by computing the outliers in a given layer, and taking it out of the weight matrix.
The remainder is a dense matrix that can be quantized much more effectively thanks to its significantly reduced range of values, which in some cases is more than 10×:

PTQ like framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Consistent performance improvement across different model sizes compared to GPTQ and AWQ.