Quantization

Supporting Technique

Quantization is a technique that reduces the precision of the numbers used to represent a model’s parameters.

Quantization is used in scenarios where reducing the model size and increasing inference speed are critical, such as deploying machine learning models on edge devices or mobile phones. It helps in optimizing models to run efficiently on hardware with limited computational resources.

Quantization works by mapping the continuous values of the model’s parameters to a finite set of discrete values, typically reducing the bit-width of the numbers. This process can significantly reduce the memory footprint and computational requirements of the model without substantially compromising its accuracy.

For example, a model with 32-bit floating-point weights can be quantized to use 8-bit integers, reducing the model size by a factor of four and potentially increasing the inference speed.

Alias
Model Quantization Parameter Quantization
Related terms
Compression Model Optimization Inference Acceleration