Applying KL Divergence in LLM Quantization
Report Highlights
Applying KL Divergence in LLM Quantization is a report to guide you to the quantization of Large Language Models and to the quantification of information lost due to this process.
You will learn about the advantages and disadvantages of LLM quantization with the following subjects:
- What is LLM quantization?
- Model compression and LLM quantization;
- What is KL divergence? The KL divergence formula;
- KL Divergence in LLM Quantization — example;
- Conclusions;
- And much more.
So, learn how to apply different statistics tools and see how trustworthy your LLM can be.
Beyond this report, you will see that we have a lot of content and tools for you. Expand your learning skills by registering for an Omni Calculator Account today. It is fast, easy, and it allows you to create, edit, and share calculators. You can also access the previous tools used in the blink of an eye.
You’ve probably heard the word “quantization” a lot, mainly in the context of quantum mechanics. But calm down: when we talk about LLM quantization, we do not mean we are doing quantum computation. The word quantum is used to describe small packs of quantities. Thus, LLM quantization means we reduce the activation values of an LLM, shrinking the model size and reducing memory usage when running it.
This process allows running larger models on GPUs with less VRAM, speeding up inference, reducing power consumption, and improving LLM compatibility. This method also means moving from high-precision data — usually FP32 or FP16 — to a lower-precision format, such as 8- or 4-bit integers. So, in practice, we are essentially compressing a large number of bits into a smaller number.
This technique is widely applied across areas such as signal processing, big data compression, and machine learning. There are multiple methods for quantizing an LLM; the most popular include absolute max quantization, affine quantization, activation-aware weight quantization, SmoothQuant, and GPTQ.

The absolute max quantization method uses the vector's maximum absolute value for symmetric scaling, for speed and simplicity. In contrast, affine quantization offers asymmetric scaling with a zero-point offset to better preserve the data distributions.
If you want to address the specific fragility of large language models, you can use Activation-Aware Weight Quantization (AWQ). This method identifies and protects the most critical weight channels based on activation magnitude, enabling high-accuracy 4-bit compression without hardware overhead. Besides, SmoothQuant mathematically migrates the quantization difficulty from erratic activation outliers to more stable weight vectors, facilitating efficient 8-bit inference for both weights and activations.
Finally, GPTQ uses second-order Hessian information to perform a layer-wise, one-shot weight compression that minimizes output error, allowing for extreme precision retention even at 3-bit levels.
🙋 You can deeply understand the concepts behind floating-point (FP32) and binary numbers by using our floating-point calculator and binary calculator. Moreover, if you want to learn more about LLM, check out our articles dedicated to:
As we pointed out, there are different techniques for model compression. Let us consider an example of quantization using the absolute max method. Suppose that you ask the following question to your favorite AI: Why is the sky blue?
To provide an answer, the LLM will convert your prompt into numerical vectors, known as embeddings. The embeddings will enter one of the LLM’s layers and be processed to find a proper answer. Let us consider that the FP32 embedding of the word blue is:
From this vector, we can see that its absolute maximum is AM=5.2 — this is the modulus of the largest value of this vector. Now, we can quantize this vector to INT8. The corresponding INT8 numbers of an embedding are in the range [−127,127].
The quantization procedure can be done by calculating the scaling factor, whose equation is:
where INT8max=127. Therefore, we find that S=18.41. Now, we multiply every number of our initial vector by the scaling factor and round them to their nearest integer. Thus, we find the following INT8 embedding:
You can see that the rounding process includes some precision loss in this mapping. This effect is a feature expected after the quantization procedure. The advantage is that we can now process a proper answer to the prompt using fewer computational resources.
One relevant question to raise is how much information we lose due to any quantization method. We will address this topic in the next sections of this report.
The Kullback-Leibler divergence, or KL divergence, is a statistical metric that measures the difference between two probability distributions. This metric is positive if the distributions differ, and null if they are equal. The KL divergence formula is such that:
where:
- P(x) — True probability distribution;
- Q(x) — Approximatelly probability distribution; and
- DKL — Kullback-Leibler divergence.
In the context of LLM quantization, the KL divergence is relevant to:
- Calibrate activations to find the optimal scaling factors during post-training quantization (PTQ); and
- Evaluate output degradation by comparing the token probabilities of the original model versus the quantized model.
As we can see, the KL divergence can be a useful metric for benchmarking. Remember that benchmarks are the ultimate test of an LLM’s capability, and they only consider the final outcome of a prompt, which means whether the model got the right or the wrong answer.
So, the computation of the KL Divergence during the quantization process can be highly relevant to benchmarks, as it serves as a predictive, diagnostic, and optimization tool that directly influences benchmark scores.

🙋 Learn more about probability and distributions with our probability calculator and normal distribution calculator. Feel free to also access our articles on these subjects:
Let us see in detail an example of the KL divergence in LLM quantization. In this case, we will compare two LLMs, an FP16 and an INT4, to find the answer to the following prompt: What is the largest ocean on Earth?
Suppose that the FP16 LLM gives us the following probabilities for each possible answer:
- Pacific: 0.90
- Atlantic: 0.05
- Indian: 0.03
- (Others): 0.02
And the probabilities for each possible answer generated by the INT4 are:
- Pacific: 0.75
- Atlantic: 0.15
- Indian: 0.05
- (Others): 0.05
Now, we can substitute these values into the formula for the KL divergence. In order to make the calculation clear, we present the KL divergence for the Pacific Ocean in detail:
Thus, the total KL divergence is such that:
The negative terms in the previous equation indicate that the quantized model is overestimating the wrong answers relative to the original LLM. In such cases, Q(x)>P(x) and ln(P(x)/Q(x))<0.
Since the KL divergence is nonzero, it means that part of the information was lost in the quantization process and that the distributions are not identical. We can verify that the original model was highly confident that the correct answer was Pacific (90%), whereas the INT4 model was less confident (75%). This drop in confidence had a heavier impact on the KL divergence.
Moreover, INT4 increased confidence in the wrongs answers, like “Atlantic”. This effect shows that the quantized model is more likely to produce erroneous or off‑target answers, often described as hallucinations, than the original one.
This report introduced the main concepts behind LLM quantization and how this process can be evaluated using statistical metrics such as the Kullback-Leibler divergence.
As discussed, quantizing LLMs offers benefits, such as enabling larger models to run on GPUs, increasing inference speed, reducing power consumption, and enhancing compatibility. However, this process also introduces challenges, including potential information loss and hallucinations. Consequently, quantization can significantly impact model performance on benchmarks.
We hope this journey helped you understand the LLM quantization and the KL divergence from a new perspective.
This article was written by João Rafael Lucio dos Santos and reviewed by Steven Wooding.