In linear algebra, a norm is a function that assigns a strictly positive length or size to each vector in a vector space. In machine learning, norms are extensively used in both loss functions (measuring the error between predictions and targets) and regularization (penalizing model complexity to prevent overfitting).

The general formula for an norm of a vector is:


1. The L1 Norm (Manhattan Norm)

The norm is the sum of the absolute values of the vector components:

Advantages & Properties

  • Sparsity (Feature Selection): When used as a regularization term (e.g., in LASSO Regression), penalizes non-zero weights. Because its geometric shape is a β€œdiamond” (in 2D), the contours of the loss function often intersect the penalty at the corners (where some axes are exactly zero). This forces less important feature weights to become exactly , performing automatic feature selection.
  • Robustness to Outliers: When used as a loss function (Mean Absolute Error, or MAE), the norm treats all errors proportionally. Unlike squared error, it doesn’t disproportionately panic over a few massive outliers, making it highly robust for noisy datasets.
  • Interpretability: Because it yields sparse models, the final model is often easier to interpret.

2. The L2 Norm (Euclidean Norm)

The norm is the standard concept of distance (the straight-line distance):

Advantages & Properties

  • Mathematical Convenience & Differentiability: The norm (specifically squared norm) is smooth and continuously differentiable everywhere. This makes analytical solutions (like the Normal Equation in Linear Regression) possible and gradients highly stable for backpropagation.
  • Prevents Over-reliance on Single Features: When used as a regularization term (Ridge Regression / Weight Decay), heavily penalizes massive outlier weights but rarely drives them exactly to zero. Instead, it shrinks all weights evenly, encouraging the network to use all features a little bit rather than relying heavily on just one. This handles multicollinearity well.
  • Strong Penalty for Large Errors: When used as a loss function (Mean Squared Error, or MSE), squaring the errors heavily punishes large mistakes while forgiving tiny ones. This is ideal when large errors are unacceptable.

3. L1 vs. L2: A Geometric Summary

| Feature | L1 Norm () | L2 Norm () | | :--- | :--- | :--- | | Shape (2D Space) | Diamond | Circle | | Regularization Name | LASSO | Ridge (Weight Decay) | | Loss Function Name | Mean Absolute Error (MAE) | Mean Squared Error (MSE) | | Effect on Weights | Drives many to exactly (Sparse) | Shrinks all towards (Dense) | | Robust to Outliers? | Yes (as a Loss Function) | No (Squares the outliers) | | Differentiability | Non-differentiable at | Differentiable everywhere | | Best Used When… | You need feature selection / Interpretability | You want overall stability and all features are somewhat useful |

For more details on how these are applied in regression, see Ridge Regression).