We frequently encounter the terms L1 normalization, L2 regularization, etc. These terms originate from the concept of normalization ( hence the name norms) and Lp norms are very common type of normalization algorithms, although there are many others.
| Norm / Concept | Mathematical Definition | Core Idea | Behavior / Intuition | Common Uses in ML / AI | Notes |
|---|---|---|---|---|---|
| L0 “Norm” | |x|₀ = #{i : xᵢ ≠ 0} | Counts nonzero elements | Measures sparsity directly | Feature selection, Sparse coding, Compressed sensing | Not a true mathematical norm |
| L1 Norm | |x|₁ = Σ |xᵢ| | Sum of absolute values | Encourages sparsity | Lasso, MAE, Sparse ML, NLP | Robust to outliers |
| L2 Norm | |x|₂ = √(Σ xᵢ²) | Euclidean length | Smooth penalty, geometric distance | Ridge, Embeddings, Neural nets, FAISS | Most common norm in ML |
| L∞ Norm (Max Norm / Supremum Norm) | |x|∞ = maxᵢ |xᵢ| | Largest absolute component | Worst-case magnitude control | Robust optimization, Adversarial ML | Focuses only on maximum value |
| General Lp Norm | |x|ₚ = (Σ |xᵢ|ᵖ)1/p | Generalized norm family | Controls geometry and smoothness | Optimization theory, ML mathematics | L1/L2/L∞ are special cases |
| Frobenius Norm | |A|F = √(Σ Aᵢⱼ²) | L2 norm for matrices | Measures total matrix energy | Deep learning, Matrix factorization | Very common in linear algebra |
| Nuclear Norm | |A|* = Σ σᵢ | Sum of singular values | Encourages low-rank matrices | Recommendation systems, Matrix completion | Convex approximation to matrix rank |
| Spectral Norm | |A|₂ = σmax(A) | Largest singular value | Controls maximum amplification | GAN stabilization, Deep learning | Used in spectral normalization |
| Max Norm Constraint | |w|₂ ≤ c | Restricts weight magnitude | Prevents exploding weights | Neural network regularization | Common in older deep learning methods |
| Elastic Net Penalty | λ₁|w|₁ + λ₂|w|₂² | Combines L1 + L2 | Sparsity + stability | Regression, Feature selection | Useful with correlated features |
| Group Norms / Group Lasso | Σg |w_g|₂ | Regularize feature groups | Selects groups instead of individual features | Structured sparsity | Used in advanced feature engineering |
| Cosine Normalization (Related Concept) | cos(θ) = (x·y) / (|x|₂|y|₂) | Compare vector direction | Ignores vector magnitude | Embeddings, RAG, Semantic search | Usually combined with L2 normalization |
| Huber Loss (Hybrid L1/L2 Behavior) | Lδ(a) = piecewise L2 near 0, L1 for large errors | L2 near zero, L1 for large errors | Robust + smooth optimization | Regression, Deep learning | Handles outliers better than MSE |
| Total Variation (TV) Norm | TV(x) = Σ |xᵢ₊₁ - xᵢ| | Measures signal/image smoothness | Preserves edges while denoising | Image processing, Diffusion models | Common in computer vision |
| Operator Norm | |A| = sup₍x≠0₎ |Ax| / |x| | Maximum transformation strength | Measures matrix amplification | Numerical analysis, Stability theory | General framework for matrix norms |
| Mahalanobis Distance (Norm-related) | d(x,y)=√((x-y)TΣ⁻¹(x-y)) | Distance accounting for covariance | Scale-aware distance metric | Anomaly detection, Gaussian models | Generalizes Euclidean distance |
| Energy Norms | E(x)=xᵀAx | Measures system energy | Physics-inspired optimization | PDEs, Scientific ML | Common in numerical optimization |
| Path Norms | Σpaths Π |wᵢ| | Measures neural network complexity | Capacity control in deep nets | Theoretical deep learning | Research-oriented concept |
L1/L2 norms are very frequently encountered in machine learning, here is a summary of them:
| Category | L1 Usage / Concept | L2 Usage / Concept | Key Intuition / Effect | Common Algorithms / Systems |
|---|---|---|---|---|
| General Norm Definition | Absolute-value based magnitude | Euclidean-length based magnitude | Measures vector “size” differently | Optimization, ML, Statistics |
| Mathematical Formula | |x|₁ = Σ |xᵢ| | |x|₂ = √(Σ xᵢ²) | L1 = linear growth, L2 = squared growth | Foundational mathematics |
| Regularization | L1 Regularization (Lasso) | L2 Regularization (Ridge / Weight Decay) | Prevent overfitting | Regression, Neural Networks |
| Regularization Penalty | λ Σ |wᵢ| | λ Σ wᵢ² | Penalize large weights | ML optimization |
| Effect on Weights | Creates sparse models | Smoothly shrinks weights | Feature selection vs stability | Sparse ML vs Deep Learning |
| Feature Selection | Strongly used | Rarely used | Zeroes unimportant features | Lasso, Sparse models |
| Distance Metrics | Manhattan Distance | Euclidean Distance | Different geometry | KNN, Clustering |
| Distance Formula | d(x,y)=Σ |xᵢ-yᵢ| | d(x,y)=√(Σ (xᵢ-yᵢ)²) | L1 = grid movement, L2 = straight-line distance | Vector search |
| Outlier Handling | More robust | Sensitive to outliers | Squaring amplifies large errors | Robust statistics |
| Loss Functions | MAE Loss | MSE Loss | Regression error measurement | Forecasting, Regression |
| Loss Formula | MAE=(1/n) Σ |y-ŷ| | MSE=(1/n) Σ (y-ŷ)² | Linear vs quadratic penalty | Neural nets, Regression |
| Optimization Behavior | Non-smooth gradients | Smooth gradients | Optimization difficulty differs | Gradient descent |
| Vector Normalization | Sum becomes 1 | Vector length becomes 1 | Scale standardization | Embeddings, NLP |
| Normalization Formula | xnorm = x / |x|₁ | xnorm = x / |x|₂ | Makes vectors comparable | Semantic search |
| Embeddings & RAG | Sometimes used | Extremely common | Similarity computations | FAISS, Vector DBs |
| Cosine Similarity | Rarely used directly | Usually paired with L2 normalization | Direction-only comparison | RAG retrieval |
| Sparse Machine Learning | Core idea | Less important | Efficient sparse representations | Sparse coding |
| Optimization Geometry | Diamond-shaped constraint | Circular/spherical constraint | Affects solution structure | Convex optimization |
| Deep Learning | Occasional | Extremely common | Weight decay regularization | CNNs, Transformers |
| Optimizer Usage | Rare | AdamW weight decay | Stabilizes training | Modern LLM training |
| Computer Vision | Sharper reconstruction | Smoother reconstruction | Different image characteristics | GANs, Diffusion |
| NLP | Sparse bag-of-words systems | Dense embeddings | Sparse vs dense representations | Transformers |
| Reinforcement Learning | Occasionally used | Frequently used | Stabilization and constraints | Policy optimization |
| Statistics / Bayesian View | Laplace prior | Gaussian prior | Different assumptions about data | Bayesian ML |
| Signal Processing | Sparse recovery | Energy minimization | Compression vs smoothness | Fourier, Denoising |
| KNN | Manhattan KNN | Euclidean KNN | Different neighborhood geometry | Classification |
| Clustering | Robust clustering variants | Standard K-Means | Distance-driven grouping | Unsupervised learning |
| SVM | L1 regularized SVM | L2 regularized SVM | Margin regularization | Classification |
| Linear Regression | Lasso Regression | Ridge Regression | Sparse vs stable regression | Predictive modeling |
| Logistic Regression | Sparse classifier | Smooth classifier | Regularized classification | Binary classification |
| Elastic Net | Combines L1 | Combines L2 | Sparsity + stability | High-dimensional ML |
| Autoencoders | Sparse autoencoders | Weight regularization | Representation learning | Deep learning |
| PCA | Rarely used | Fundamentally L2-based | Variance maximization | Dimensionality reduction |
| Transformers / LLMs | Minimal usage | Heavy usage | Stable large-scale training | GPT-style models |
| Core Intuition | “Keep only important things” | “Keep everything controlled” | Sparsity vs smooth smoothness | Foundational ML principle |
No comments:
Post a Comment