LLM Quantizations

Monday, May 25, 2026

LLM Quantizations

Quantization in Large Language Models (LLMs) is a compression technique that reduces a model's memory footprint and computational requirements by converting its numerical values (weights and activations) from high precision to lower precision. It acts like resizing a massive image into a smaller file while preserving most of its quality.

Why is Quantization Used?

Memory Reduction: Storing models requires significant VRAM. Converting a model from 16-bit floating-point (FP16) to 4-bit integers (INT4) can reduce the model's size by up to 75%, often fitting massive models onto standard consumer hardware.
Faster Inference: Lower precision allows the processor to perform math much faster, resulting in quicker response times.
Accessibility: It allows developers and researchers to run capable AI models locally on laptops or less expensive hardware.

Common Data Formats

LLMs are usually trained using high precision formats like FP32 or FP16. Quantization maps these values to lower precision formats:

FP16 / BF16 (16-bit): Standard sizes where parameters occupy 2 bytes of memory.
INT8 (8-bit): Parameters occupy 1 byte.
INT4 (4-bit): Parameters occupy half a byte. This yields the highest compression but introduces a slight risk of losing accuracy.

How it Works

At its core, quantization maps a broad, continuous range of floating-point numbers into a smaller, discrete set of numbers. For example, instead of storing the exact value 0.123456789 (which takes up a lot of memory), the model rounds and stores an approximate, lower-precision number.

Two main approaches are used to achieve this:

Weight-Only Quantization: The model's static weights are converted to a lower precision format to save space. During generation, they are temporarily converted back to high precision to compute the response.
Weight and Activation Quantization: Both the weights and the dynamic calculations occurring as the model processes text are quantized, which provides faster speeds but requires specialized software support.

Popular Quantization Methods

Several advanced algorithms help maintain model intelligence during compression:

GGUF - GPT Generated Unified Format (formerly GGML - GPT-Generated Model Language): A file format widely used in desktop and local hardware applications that allows you to offload parts of the model onto a standard CPU.
GPTQ - Generative Post-Training Quantization: A highly efficient method that compresses weights down to 3 or 4 bits, minimizing accuracy loss.
AWQ: (Activation-aware Weight Quantization) A technique that focuses on preserving the most important weights (those that activate during processing), allowing for excellent quality retention.

To explore and utilize these different quantization formats for local deployment, you can check out community-driven repositories such as Hugging Face Models to find optimized, ready-to-run versions of popular LLMs.

Summary of Quantization Algorithms

Precision	Algorithm / Format	Core Mathematical Approach	Best Used For
4-Bit	Q4_K_M (Quantization 4 bit, K-Quant, Medium)	Block-wise mixed linear quantization	GGUF/llama.cpp CPU & Mac inference
	IQ4_NL (Importance Quantization 4 bit, Non Linear)	Non-linear grid mapping via Importance Matrix	Maximizing accuracy in small 4-bit models
	GPTQ (Generative Post Training Quantization)	Second-order optimization using Hessian matrices	Fast, static GPU inference
	AWQ (Activation Aware Weight Quantization)	Activation-aware scaling protecting top 1% weights	High-throughput GPU serving (vLLM)
	NF4 (Normal float 4 bit)	Quantile distribution for normally distributed data	Resource-efficient QLoRA fine-tuning
	SpQR (Sparse Quantized Representation)	Outlier isolation (FP16) + base weight compression	Extreme accuracy retention at low bits
	QuIP / QuIP-Sharp (Quantization with Incoherence Processing)	Random orthogonal transformations to smooth outliers	Highly stable ultra-low bit pushes
8-Bit	LLM.int8()	Vector-wise separation of extreme outlier channels	Standard zero-shot Hugging Face loading
	SmoothQuant	Mathematical migration of difficulty from activation to weight	Fast INT8 matrix multiplication on GPUs
	Q8_0	Uniform, symmetric block-wise linear quantization	Baseline GGUF inference with near-zero loss
	FP8 (E4M3 / E5M2) * E : Exponent * M : Mantissa	Dynamic floating-point exponent/mantissa splitting	Native hardware acceleration (H100/Blackwell)
16-Bit	FP16 (Floating Point 16 bit)	1 sign, 5 exponent, 10 mantissa down-casting	Standard half-precision consumer GPU inference
16-Bit	BF16 (Brain Floating Point 16 bit)	1 sign, 8 exponent, 7 mantissa down-casting	Training and inference without overflow risks

RS Chandras Tech Blog | AI, ML, Agentic AI

Monday, May 25, 2026