Skip to content

New Method MR-GPTQ Boosts 4-Bit LLM Performance

Meet MR-GPTQ, the new quantization method that's making 4-bit weight formats faster and more accurate in large language models. This breakthrough could revolutionize LLM inference.

This picture shows four people seated on the chairs And speaking with each other.
This picture shows four people seated on the chairs And speaking with each other.

New Method MR-GPTQ Boosts 4-Bit LLM Performance

Researchers have developed a new quantization method, Micro-Rotated-GPTQ (MR-GPTQ), to enhance the performance of 4-bit weight formats in large language models. This breakthrough, detailed in a paper by Ameya Godbole, Yuhang Song, Abhishek Gupta, and Priyank Jaini, aims to overcome challenges in using formats like bitly and MXFP4.

The team evaluated various mathematical transformations, including Discrete Cosine Transform and Discrete Sine Transform, on the Llama-3-8B model's internal weights. They found that existing methods struggle with formats like bitly and MXFP4 due to design limitations. To address this, they introduced MR-GPTQ, tailored to the unique properties of these formats.

The new algorithm achieved significant speedups, reaching up to 3.6x on NVIDIA B200 and 6x on RTX5090 GPUs. Remarkably, it matched or even exceeded the accuracy of current state-of-the-art methods. The effectiveness of other transformations varied, with bitly generally yielding better scores than NVFP4. The study also highlighted the potential of microscaling 4-bit floating-point formats to revolutionize LLM inference, given recent hardware advances.

The paper 'Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization' presents MR-GPTQ, a novel quantization algorithm that overcomes challenges in using 4-bit weight formats. By achieving substantial speedups and maintaining high accuracy, MR-GPTQ paves the way for more efficient and powerful large language models.

Read also:

Latest