Quantization in AI is a model compression technique that reduces the numerical precision of a neural network's weights and activations. Instead of storing every parameter as a 32-bit floating-point number, quantized models use 8-bit, 4-bit, or even lower formats. Because memory and compute scale with the number of bits, this single change can shrink a model by 2–8x and speed up inference, making it practical to run large models on phones, laptops, browsers, and embedded devices.
How quantization works
Each weight is originally a precise real number, but most of that precision is rarely needed. Quantization maps the original range of values onto a smaller set of representable levels. In post-training quantization (PTQ), a fully trained model is converted once, typically by scaling the float weights so they fit into a narrower integer range. A simple linear mapping of the form quantized = round(weight / scale) + zero_point does most of the work, and the same scale and zero_point are used to dequantize outputs back to floats during inference.
For example, an 8-bit integer can only represent 256 distinct values, so a layer whose weights originally span [-1.0, 1.0] in float32 must bucket them into 256 evenly spaced steps. The closer those steps are tuned to the actual weight distribution, the less accuracy is lost. For better results, quantization-aware training (QAT) simulates rounding errors during fine-tuning so the model adapts to the noise, often recovering nearly all of the original accuracy.
Why it matters
Quantization is what lets a multi-billion-parameter model fit into a few gigabytes of RAM and respond in well under a second on a laptop CPU. It cuts energy use, reduces server costs, and unlocks on-device AI for privacy-sensitive or offline use cases. It also interacts with hardware: modern GPUs, NPUs, and CPUs ship dedicated INT8 and INT4 matrix units, so a quantized model can run several times faster than the same model in float32.
Key types
- Post-training quantization (PTQ): Converts an already-trained model. Cheapest option, small accuracy drop.
- Quantization-aware training (QAT): Simulates quantization during training so weights adapt. Better accuracy, requires extra compute.
- Dynamic quantization: Keeps weights in low precision but computes activations on the fly. Useful for NLP models with variable sequence lengths.
- Weight-only quantization: Stores weights in 4-bit or lower, dequantizing on the fly. Common for serving large language models.
- GPTQ, AWQ, GGUF: Popular algorithms and file formats for 4-bit LLM quantization that apply different schemes to preserve accuracy.
Quantization has become a default step in the AI deployment pipeline. Tools such as PyTorch's torch.quantization, NVIDIA TensorRT, and the ONNX Runtime bake these techniques into production stacks, letting teams trade a small amount of accuracy for substantial gains in speed, memory, and cost.