Regression

Linear Regression

Prediction:

\[fw(x) = y = w^Tx + \beta = w_1x + w_0\]

Loss:

\[L(x,y,w) = (fw(x) − y)^2\]

Why use squared loss?

Prevents errors from cancelling out
Penalizes large errors
Easily differentiable

Closed Form Solution (Normal Equation)

L2 Loss (squared loss):

\[L2 = {1 \over N} {\sum(w^⊤x − y)^2}\]

we have (solve for ${\vartheta L(w) \over \vartheta(w)}=0$),

\[{\vartheta L(w) \over \vartheta(w)} = X^TXw-y^TX\]

Solution:

\[w^∗ =(X^TX)^−1X^Ty\]

Linear Regression with Gradient Descent

Update weights using ($\alpha$ is the learning rate):

\[w:= w - \alpha * \bigtriangledown L(w)\]

where the loss function $L(w)$ is defined as (matrix form):

\[L(w) = {2\over N} * X^T(Xw-y)\]

or (component-wise):

\[L(w) = {2\over N} * \sum_i^n (x_i*w_i - y_i)x_i\]

Note: $\bigtriangledown L(w)$ = ${\vartheta L(w) \over \vartheta(w)}$

Gradient Descent Options:

Gradient Descent (calculate for all of the training set X at once)
- Can be very slow
- Datasets may not fit in memory
- Doesn’t allow online (on-the-fly) model updates
- Guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces
Stochastic Gradient Descent (shuffle X and evaluate one row at a time, check for stopping condition)
- Allow for online update with new examples
- With a high variance that cause the objective function to fluctuate heavily
(Batch) Gradient Descent (use mini batches instead of evaluating one row at a time, check for stopping condition)
- Reduces the variance of the parameter updates, which can lead to more stable convergence
- Can make use of highly optimized matrix optimizations
Adaptive learning rates (see AdaGrad algorithm)

Note: Closed form solution can be very slow with datasets that have many rows

L1/L2 Regularization (LASSO/Ridge Regression)

L2:
- sum-of-squares error + $\alpha w^Tw$
- Can calculate the closed form solution
- Can use it with gradient descent
- Do not generally regularize the bias term
- Handles multicollinearity well
L1:
- sum-of-squares error + $\alpha \vert w \vert$
- Non-zero coefficients indicate important features
- Must use iterative methods (no closed form solution)

Use L1 when:

You have many features and suspect most are irrelevant
You want interpretable models with feature selection
You need to identify the most important predictors

Use L2 when:

Most features are potentially useful
You want stable coefficient estimates
Computational efficiency is important
Features are highly correlated

Polynomial Regression

Classification

Linear Classification Algorithms

$w$ is the normal vector to the hyperplane which separates points into the positive class or negative class.
Least-Square Classification
Very similar to linear regression. Same principles apply, just assign {-1, 1} to classes. Then compute the least squares optimization. The decision boundary will always be $f(x)=0$, so the prediction will have a $sign$ function. $f(x) = \begin{cases} 0 & \text{decide } y_1 \\ <0 & \text{decide } y_2 \end{cases}$

2 classes:

3 classes (example why it might perform badly):

For >2 classes, use one-hot-encoding.

Cons:

Sensitive to outliers (squared error penalizes large mistakes heavily)
Not ideal for classification (treats it as regression)
Can give predictions outside [-1, +1] range
Often outperformed by logistic regression or SVM

Brain Teaser: You can also try to train two different linear models and try to interpret results.
Linear Discriminant Analysis / Fisher’s Linear Discriminant
Good Illustration: An illustrative introduction to Fisher’s Linear Discriminant - Thalles’ blog

Find an orientation along which the projected samples are well separated.

Fisher’s Linear Discriminant

We want class means to be as far as possible. The denominator is the total within-class scatter of the projected samples. $\arg\max_{w} \frac{\vert m_{+} - m_{-} \vert ^2}{s_{+}^2 + s_{-}^2}$ where $s_+^2$ or $s_-^2$ is (almost like variance), $s_{+}^2 = \sum_{x \in C_{+}} (w^T x - m_{+})^2$ $m_+ = \frac{1}{N_+} \sum_{x \in C_+} \mathbf{w}^T \mathbf{x} = \mathbf{w}^T \mathbf{m}_+$ Define following matrices: $S_+ = \sum_{x \in C_+} (\mathbf{x} - \mathbf{m}_+)(\mathbf{x} - \mathbf{m}_+)^T$ $S_W = S_+ + S_-$ $S_B = (m_+-m_-)(m_+-m_-)^T$ Using these we obtain, $J(\mathbf{w}) = \frac{\vert m_+ - m_- \vert ^2}{s_+^2 + s_-^2} = \frac{\mathbf{w}^T S_B \mathbf{w}}{\mathbf{w}^T S_W \mathbf{w}}$ $J$ is maximized when (take derivate with respect to w) $(w^TS_Bw)S_Ww = (w^TS_Ww)S_Bw$ Notice:
$w^TS_Ww$ and $w^TS_Bw$ are scalars, thus we get $\alpha S_Ww = \beta S_Bw$If we take $\lambda = \alpha / \beta$ we get $S_Bw=λS_Ww$ which the generalized eigenvalue problem: $S_W^{−1}S_Bw=λw$ From the generalized eigenvalue equation, we derive that the optimal $w$ is: $w∝ S_W^{−1}(m_+−m_−)$

[!NOTE] Yes but how?! Notice $(m_+−m_−)^Tw$ in $S_Bw=(m_+−m_−)(m_+−m_−)^Tw$ is a scalar. That means the equation looks like $S_Bw=c.(m_+−m_−)$. This shows that $S_Bw$ is always proportional to $(m_+−m_−)$ regardless of what $w$ is! Plug this into the equation above to get the generalized eigenvalue equation.

To summarize, Fisher’s Discriminant,

Gives the linear function with the maximum ratio of between-class scatter to within-class scatter
The problem, e.g. classification, has been reduced from a d-dimensional problem to a more manageable one-dimensional problem.
Can be extended to multiclass classification
Support Vector Machines
Rosenblatts’ Perceptron (not very popular right now?)
Logistic Regression (usually performs better than the algorithms above)
Where can linear classifiers fail?
k-NN for Estimation and Prediction
You can use k-NN for estimating missing values or prediction
Other
Bayesian Decision Methods
Given observation x, the decision is based on posterior probability:
Decide $y_1$, if $P(y_1 \vert x) > P(y_2 \vert x)$
Decide $y_2$, if $P(y_2 \vert x) > P(y_1 \vert x)$

Note that, $P(y_1 \vert x) = {P(x \vert y_1).P(y_1) \over P(x)}$ and $P(y_2 \vert x) = {P(x \vert y_2).P(y_2) \over P(x)}$ so the probability $P(x)$ does not matter in our decision (same for both).

Probability of error: $P(error \vert x) = \begin{cases} P(y_1 \vert x), & \text{if decide } y_2 \text{ but } y_1 \text{ is true} \\ P(y_2 \vert x), & \text{if decide } y_1 \text{ but } y_2 \text{ is true} \end{cases}$

Goal is to minimize error (based on single instance): $P(error \vert x) = min[P(y_1 \vert x), P(y_2 \vert x)]$

Minimizing average error: $P(error) = \int_{-\infty}^{\infty} P(error, x)\,dx = \int_{-\infty}^{\infty} P(error \vert x)p(x)\,dx$

Generalizing for more classes:

Feature vector $x = (x1,x2,…,x_d)$ ∈ $R^d$: allow use of more than one feature
$y1,y2,…,y_c$: finite set of c states of nature, i.e., categories (can be more than two)
$\alpha_1,\alpha_2,…,\alpha_a$: a finite set of possible actions
$λ(α_i \vert y_i)$: loss function, describes the loss incurred for taking action $\alpha_i$ when state of nature is $y_i$
$P(y_i)$: prior probability that state of nature is $y_i$
$p(x \vert yi)$: state conditional probability for $x$

The expected loss, or conditional risk, of taking action $\alpha_i$ is: $R(\alpha_i\vert x) = \sum_{j=1}^{c} \lambda(\alpha_i\vert y_j)P(y_j \vert x)$

Choose $\alpha(x)$ that minimizes overall risk:

$R = \int R(\alpha(x) \vert x)p(x)\,dx$ $R(\alpha_i \vert x) = \sum_{j=1}^{c} \lambda(\alpha_i \vert y_j)P(y_j \vert x)$ $\alpha^*=argmin_{\alpha_i}R(\alpha_i \vert x)$

Regression

Linear Regression

Linear Regression with Gradient Descent

L1/L2 Regularization (LASSO/Ridge Regression)

Polynomial Regression

Classification

Linear Classification Algorithms

Least-Square Classification

Linear Discriminant Analysis / Fisher’s Linear Discriminant

Support Vector Machines

Rosenblatts’ Perceptron (not very popular right now?)

Logistic Regression (usually performs better than the algorithms above)

Where can linear classifiers fail?

k-NN for Estimation and Prediction

Other

Bayesian Decision Methods

Principal Component Analysis (PCA)