📖

What is Quantization?

Quantization in AI is the process of converting a model's weights and activations from high-precision numbers (such as 32-bit floating point) to lower-precision formats (such as 8-bit integers) so the model uses less memory and runs faster. It is a core model compression technique used to deploy large neural networks on resource-constrained hardware like phones, laptops, and edge devices. Quantization in AI is a model compression technique that lowers the numerical precision of weights and activations so neural networks run faster and use less memory, often with minimal accuracy loss.

Quantization in AI is a model compression technique that reduces the numerical precision of a neural network's weights and activations. Instead of storing every parameter as a 32-bit floating-point number, quantized models use 8-bit, 4-bit, or even lower formats. Because memory and compute scale with the number of bits, this single change can shrink a model by 2–8x and speed up inference, making it practical to run large models on phones, laptops, browsers, and embedded devices.

How quantization works

Each weight is originally a precise real number, but most of that precision is rarely needed. Quantization maps the original range of values onto a smaller set of representable levels. In post-training quantization (PTQ), a fully trained model is converted once, typically by scaling the float weights so they fit into a narrower integer range. A simple linear mapping of the form quantized = round(weight / scale) + zero_point does most of the work, and the same scale and zero_point are used to dequantize outputs back to floats during inference.

For example, an 8-bit integer can only represent 256 distinct values, so a layer whose weights originally span [-1.0, 1.0] in float32 must bucket them into 256 evenly spaced steps. The closer those steps are tuned to the actual weight distribution, the less accuracy is lost. For better results, quantization-aware training (QAT) simulates rounding errors during fine-tuning so the model adapts to the noise, often recovering nearly all of the original accuracy.

Why it matters

Quantization is what lets a multi-billion-parameter model fit into a few gigabytes of RAM and respond in well under a second on a laptop CPU. It cuts energy use, reduces server costs, and unlocks on-device AI for privacy-sensitive or offline use cases. It also interacts with hardware: modern GPUs, NPUs, and CPUs ship dedicated INT8 and INT4 matrix units, so a quantized model can run several times faster than the same model in float32.

Key types

  • Post-training quantization (PTQ): Converts an already-trained model. Cheapest option, small accuracy drop.
  • Quantization-aware training (QAT): Simulates quantization during training so weights adapt. Better accuracy, requires extra compute.
  • Dynamic quantization: Keeps weights in low precision but computes activations on the fly. Useful for NLP models with variable sequence lengths.
  • Weight-only quantization: Stores weights in 4-bit or lower, dequantizing on the fly. Common for serving large language models.
  • GPTQ, AWQ, GGUF: Popular algorithms and file formats for 4-bit LLM quantization that apply different schemes to preserve accuracy.

Quantization has become a default step in the AI deployment pipeline. Tools such as PyTorch's torch.quantization, NVIDIA TensorRT, and the ONNX Runtime bake these techniques into production stacks, letting teams trade a small amount of accuracy for substantial gains in speed, memory, and cost.

Frequently Asked Questions

Does quantization reduce AI accuracy?
It can, but usually only slightly. Aggressive quantization to 4-bit can cause noticeable drops on harder tasks, while 8-bit quantization typically preserves accuracy within about 1% of the original model. Quantization-aware training helps recover most of any lost accuracy.
What is the difference between quantization and pruning?
Quantization reduces the precision of each individual number, while pruning removes entire weights or neurons that contribute little. They are complementary compression techniques, and combining them can yield even smaller, faster models.
Why is 4-bit quantization popular for LLMs?
Large language models have billions of parameters, so halving the bit width roughly halves memory use and often lets a model that needs an 80GB GPU run on a single 24GB consumer GPU. Methods like GPTQ, AWQ, and GGUF tune the quantization to preserve quality at 4 bits.
Can quantization be undone?
No. Quantization is a lossy mapping, so original full-precision weights cannot be perfectly recovered. However, the dequantized values used at inference time are close enough that downstream outputs are usually indistinguishable from the original model.