Linear Regression
| Type |
|---|
| Regression |
Prediction:
Linear regression assumes the target is a weighted average of input features.
Loss:
Why use squared loss?
- Makes the function convex
- Prevents errors from cancelling out
- Penalizes large errors
- Easily differentiable
Closed Form Solution (Normal Equation) L2 Loss (squared loss):
we have (solve for ${\vartheta L(w) \over \vartheta(w)}=0$),
Solution:
Linear Regression has a closed form solution because the loss function is convex and quadratic.
For small to medium feature sets, the closed-form solution is typically preferred since it’s exact and direct. For very large feature sets or when $X^T X$ isn’t invertible, gradient descent becomes more practical.
IMPORTANT NOTE: Only a handful of ML problems have closed-form solutions such as Linear Regression, Ridge Regression (Linear Regression with L2 Regularization), Linear Discriminant Analysis, some simple Bayesian models etc. Most problems do NOT have a closed form solution, hence we need to solve using iterative optimization algorithms like gradient descent. Examples to problems without closed form solutions: Logistic Regression, Neural Networks, Support Vector Machines etc.
Gradient Descent
Update weights using ($\alpha$ is the learning rate):
where the loss function $L(w)$ is defined as (matrix form):
or (component-wise):
Note: $\bigtriangledown L(w)$ = ${\vartheta L(w) \over \vartheta(w)}$
Gradient Descent Options:
- (Batch) Gradient Descent (calculate for all of the training set X at once)
- Can be very slow
- Datasets may not fit in memory
- Doesn’t allow online (on-the-fly) model updates
- Guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces
- Stochastic Gradient Descent (shuffle X and evaluate one row at a time, check for stopping condition)
- Allow for online update with new examples
- With a high variance that cause the objective function to fluctuate heavily
- Mini Batch Gradient Descent (use mini batches instead of evaluating one row at a time, check for stopping condition)
- Reduces the variance of the parameter/model updates, which can lead to more stable convergence
- Can make use of highly optimized matrix optimizations
- Adaptive learning rates (see AdaGrad algorithm)
Note: Closed form solution can be very slow with datasets that have many rows
L1/L2 Regularization (LASSO/Ridge Regression)
- L2:
- sum-of-squares error + $\alpha w^Tw$
- Can calculate the closed form solution
- Can use it with gradient descent
- Do not generally regularize the bias term
- Handles multicollinearity well
- L1:
- sum-of-squares error + $\alpha \vert w \vert$
- Non-zero coefficients indicate important features
- Must use iterative methods (no closed form solution)
Use L1 when:
- You have many features and suspect most are irrelevant
- You want interpretable models with feature selection
- You need to identify the most important predictors
Use L2 when:
- Most features are potentially useful
- You want stable coefficient estimates
- Computational efficiency is important
- Features are highly correlated