Monday, May 25, 2026

LLM Quantizations

Quantization in Large Language Models (LLMs) is a compression technique that reduces a model's memory footprint and computational requirements by converting its numerical values (weights and activations) from high precision to lower precision. It acts like resizing a massive image into a smaller file while preserving most of its quality.

Why is Quantization Used?

  • Memory Reduction: Storing models requires significant VRAM. Converting a model from 16-bit floating-point (FP16) to 4-bit integers (INT4) can reduce the model's size by up to 75%, often fitting massive models onto standard consumer hardware.
  • Faster Inference: Lower precision allows the processor to perform math much faster, resulting in quicker response times.
  • Accessibility: It allows developers and researchers to run capable AI models locally on laptops or less expensive hardware.

Common Data Formats

LLMs are usually trained using high precision formats like FP32 or FP16. Quantization maps these values to lower precision formats:

  • FP16 / BF16 (16-bit): Standard sizes where parameters occupy 2 bytes of memory.
  • INT8 (8-bit): Parameters occupy 1 byte.
  • INT4 (4-bit): Parameters occupy half a byte. This yields the highest compression but introduces a slight risk of losing accuracy.

How it Works

At its core, quantization maps a broad, continuous range of floating-point numbers into a smaller, discrete set of numbers. For example, instead of storing the exact value 0.123456789 (which takes up a lot of memory), the model rounds and stores an approximate, lower-precision number.

Two main approaches are used to achieve this:

  • Weight-Only Quantization: The model's static weights are converted to a lower precision format to save space. During generation, they are temporarily converted back to high precision to compute the response.
  • Weight and Activation Quantization: Both the weights and the dynamic calculations occurring as the model processes text are quantized, which provides faster speeds but requires specialized software support.

Popular Quantization Methods

Several advanced algorithms help maintain model intelligence during compression:

  • GGUF - GPT Generated Unified Format (formerly GGML - GPT-Generated Model Language): A file format widely used in desktop and local hardware applications that allows you to offload parts of the model onto a standard CPU.
  • GPTQ - Generative Post-Training Quantization:  A highly efficient method that compresses weights down to 3 or 4 bits, minimizing accuracy loss.
  • AWQ: (Activation-aware Weight Quantization) A technique that focuses on preserving the most important weights (those that activate during processing), allowing for excellent quality retention.

To explore and utilize these different quantization formats for local deployment, you can check out community-driven repositories such as Hugging Face Models to find optimized, ready-to-run versions of popular LLMs.

Summary of Quantization Algorithms

Precision Algorithm / Format Core Mathematical Approach Best Used For
4-Bit Q4_K_M (Quantization 4 bit, K-Quant, Medium) Block-wise mixed linear quantization GGUF/llama.cpp CPU & Mac inference
IQ4_NL (Importance Quantization 4 bit, Non Linear)  Non-linear grid mapping via Importance Matrix Maximizing accuracy in small 4-bit models
GPTQ (Generative Post Training Quantization) Second-order optimization using Hessian matrices Fast, static GPU inference
AWQ (Activation Aware Weight Quantization)  Activation-aware scaling protecting top 1% weights High-throughput GPU serving (vLLM)
NF4 (Normal float 4 bit)  Quantile distribution for normally distributed data Resource-efficient QLoRA fine-tuning
SpQR (Sparse Quantized Representation)  Outlier isolation (FP16) + base weight compression Extreme accuracy retention at low bits
QuIP / QuIP-Sharp (Quantization with Incoherence Processing) Random orthogonal transformations to smooth outliers Highly stable ultra-low bit pushes
8-Bit LLM.int8() Vector-wise separation of extreme outlier channels Standard zero-shot Hugging Face loading
SmoothQuant Mathematical migration of difficulty from activation to weight Fast INT8 matrix multiplication on GPUs
Q8_0 Uniform, symmetric block-wise linear quantization Baseline GGUF inference with near-zero loss
FP8 (E4M3 / E5M2)
* E :  Exponent
* M :  Mantissa
Dynamic floating-point exponent/mantissa splitting Native hardware acceleration (H100/Blackwell)
16-Bit FP16 (Floating Point 16 bit) 1 sign, 5 exponent, 10 mantissa down-casting Standard half-precision consumer GPU inference
BF16 (Brain Floating Point 16 bit)  1 sign, 8 exponent, 7 mantissa down-casting Training and inference without overflow risks

No comments:

Post a Comment

LLM Quantizations

Quantization in Large Language Models (LLMs) is a compression technique that reduces a model's memory footprint and comp...