Tuesday, May 12, 2026

Norms

We frequently encounter the  terms L1 normalization, L2 regularization, etc. These terms originate from the concept of normalization ( hence the name norms) and Lp norms are very common type of normalization algorithms, although there are many others.


Norm / Concept Mathematical Definition Core Idea Behavior / Intuition Common Uses in ML / AI Notes
L0 “Norm” |x|₀ = #{i : xᵢ ≠ 0} Counts nonzero elements Measures sparsity directly Feature selection, Sparse coding, Compressed sensing Not a true mathematical norm
L1 Norm |x|₁ = Σ |xᵢ| Sum of absolute values Encourages sparsity Lasso, MAE, Sparse ML, NLP Robust to outliers
L2 Norm |x|₂ = √(Σ xᵢ²) Euclidean length Smooth penalty, geometric distance Ridge, Embeddings, Neural nets, FAISS Most common norm in ML
L∞ Norm (Max Norm / Supremum Norm) |x|∞ = maxᵢ |xᵢ| Largest absolute component Worst-case magnitude control Robust optimization, Adversarial ML Focuses only on maximum value
General Lp Norm |x|ₚ = (Σ |xᵢ|ᵖ)1/p Generalized norm family Controls geometry and smoothness Optimization theory, ML mathematics L1/L2/L∞ are special cases
Frobenius Norm |A|F = √(Σ Aᵢⱼ²) L2 norm for matrices Measures total matrix energy Deep learning, Matrix factorization Very common in linear algebra
Nuclear Norm |A|* = Σ σᵢ Sum of singular values Encourages low-rank matrices Recommendation systems, Matrix completion Convex approximation to matrix rank
Spectral Norm |A|₂ = σmax(A) Largest singular value Controls maximum amplification GAN stabilization, Deep learning Used in spectral normalization
Max Norm Constraint |w|₂ ≤ c Restricts weight magnitude Prevents exploding weights Neural network regularization Common in older deep learning methods
Elastic Net Penalty λ₁|w|₁ + λ₂|w|₂² Combines L1 + L2 Sparsity + stability Regression, Feature selection Useful with correlated features
Group Norms / Group Lasso Σg |w_g|₂ Regularize feature groups Selects groups instead of individual features Structured sparsity Used in advanced feature engineering
Cosine Normalization (Related Concept) cos(θ) = (x·y) / (|x|₂|y|₂) Compare vector direction Ignores vector magnitude Embeddings, RAG, Semantic search Usually combined with L2 normalization
Huber Loss (Hybrid L1/L2 Behavior) Lδ(a) = piecewise L2 near 0, L1 for large errors L2 near zero, L1 for large errors Robust + smooth optimization Regression, Deep learning Handles outliers better than MSE
Total Variation (TV) Norm TV(x) = Σ |xᵢ₊₁ - xᵢ| Measures signal/image smoothness Preserves edges while denoising Image processing, Diffusion models Common in computer vision
Operator Norm |A| = sup₍x≠0₎ |Ax| / |x| Maximum transformation strength Measures matrix amplification Numerical analysis, Stability theory General framework for matrix norms
Mahalanobis Distance (Norm-related) d(x,y)=√((x-y)TΣ⁻¹(x-y)) Distance accounting for covariance Scale-aware distance metric Anomaly detection, Gaussian models Generalizes Euclidean distance
Energy Norms E(x)=xᵀAx Measures system energy Physics-inspired optimization PDEs, Scientific ML Common in numerical optimization
Path Norms Σpaths Π |wᵢ| Measures neural network complexity Capacity control in deep nets Theoretical deep learning Research-oriented concept


L1/L2 norms are very frequently encountered in machine learning, here is a summary of them: 


Category L1 Usage / Concept L2 Usage / Concept Key Intuition / Effect Common Algorithms / Systems
General Norm Definition Absolute-value based magnitude Euclidean-length based magnitude Measures vector “size” differently Optimization, ML, Statistics
Mathematical Formula |x|₁ = Σ |xᵢ| |x|₂ = √(Σ xᵢ²) L1 = linear growth, L2 = squared growth Foundational mathematics
Regularization L1 Regularization (Lasso) L2 Regularization (Ridge / Weight Decay) Prevent overfitting Regression, Neural Networks
Regularization Penalty λ Σ |wᵢ| λ Σ wᵢ² Penalize large weights ML optimization
Effect on Weights Creates sparse models Smoothly shrinks weights Feature selection vs stability Sparse ML vs Deep Learning
Feature Selection Strongly used Rarely used Zeroes unimportant features Lasso, Sparse models
Distance Metrics Manhattan Distance Euclidean Distance Different geometry KNN, Clustering
Distance Formula d(x,y)=Σ |xᵢ-yᵢ| d(x,y)=√(Σ (xᵢ-yᵢ)²) L1 = grid movement, L2 = straight-line distance Vector search
Outlier Handling More robust Sensitive to outliers Squaring amplifies large errors Robust statistics
Loss Functions MAE Loss MSE Loss Regression error measurement Forecasting, Regression
Loss Formula MAE=(1/n) Σ |y-ŷ| MSE=(1/n) Σ (y-ŷ)² Linear vs quadratic penalty Neural nets, Regression
Optimization Behavior Non-smooth gradients Smooth gradients Optimization difficulty differs Gradient descent
Vector Normalization Sum becomes 1 Vector length becomes 1 Scale standardization Embeddings, NLP
Normalization Formula xnorm = x / |x|₁ xnorm = x / |x|₂ Makes vectors comparable Semantic search
Embeddings & RAG Sometimes used Extremely common Similarity computations FAISS, Vector DBs
Cosine Similarity Rarely used directly Usually paired with L2 normalization Direction-only comparison RAG retrieval
Sparse Machine Learning Core idea Less important Efficient sparse representations Sparse coding
Optimization Geometry Diamond-shaped constraint Circular/spherical constraint Affects solution structure Convex optimization
Deep Learning Occasional Extremely common Weight decay regularization CNNs, Transformers
Optimizer Usage Rare AdamW weight decay Stabilizes training Modern LLM training
Computer Vision Sharper reconstruction Smoother reconstruction Different image characteristics GANs, Diffusion
NLP Sparse bag-of-words systems Dense embeddings Sparse vs dense representations Transformers
Reinforcement Learning Occasionally used Frequently used Stabilization and constraints Policy optimization
Statistics / Bayesian View Laplace prior Gaussian prior Different assumptions about data Bayesian ML
Signal Processing Sparse recovery Energy minimization Compression vs smoothness Fourier, Denoising
KNN Manhattan KNN Euclidean KNN Different neighborhood geometry Classification
Clustering Robust clustering variants Standard K-Means Distance-driven grouping Unsupervised learning
SVM L1 regularized SVM L2 regularized SVM Margin regularization Classification
Linear Regression Lasso Regression Ridge Regression Sparse vs stable regression Predictive modeling
Logistic Regression Sparse classifier Smooth classifier Regularized classification Binary classification
Elastic Net Combines L1 Combines L2 Sparsity + stability High-dimensional ML
Autoencoders Sparse autoencoders Weight regularization Representation learning Deep learning
PCA Rarely used Fundamentally L2-based Variance maximization Dimensionality reduction
Transformers / LLMs Minimal usage Heavy usage Stable large-scale training GPT-style models
Core Intuition “Keep only important things” “Keep everything controlled” Sparsity vs smooth smoothness Foundational ML principle

No comments:

Post a Comment

Norms

We frequently encounter the  terms L1 normalization, L2 regularization, etc. These terms originate from the concept of normalization ( hence...