3. Gradients

Loss derivative: n-dimensional, where each axis is a parameter of the model that we can tweak.
We try to find the global minimum, i.e. where the mean loss is the least.

Closed Form for MSE Regression

We try to minimize the MSE. Since mean = sum $\cdot$ constant, we minimize the loss sum:

i = 1 \sum n (y_{i} - w^{⊤} x_{i})^{2} = ∥ y - Xw ∥_{2}^{2}

where $y$ = output vector, $w$ = weights vector, and $X$ = input matrix. Therefore, for the optimal weights:

\overset{w}{^} = ar g w \in R^{d} min ∥ y - Xw ∥_{2}^{2}

Minima always have a slope of 0 $⟹$ we are looking for $\overset{w}{^}$ s.t.

\nabla_{w} ∥ y - X \overset{w}{^} ∥_{2}^{2} = 0 ⟺ 2 X^{⊤} (X \overset{w}{^} - y) = 0 ⟺ X^{⊤} (X \overset{w}{^} - y) = 0 ⟺ X^{⊤} X \overset{w}{^} = X^{⊤} y (chain rule)

which is the normal equation we already know 🥳.

Gradient Descent

It is not always possible to find a minimum directly. Therefore: Iteratively move towards minimum by moving in direction of gradient derivative, where $η$ is the step size:

w_{t + 1} = w_{t} - η \nabla L (w^{t})

Usually: interrupt when progress distance is low:

∥\nabla L (w_{t}) ∥ \leq ϵ / η

Justification:

We know that $w^{t + 1} = w^{t} + η v$ , where $v$ is some normalized vector. We estimate the loss of $w_{t + 1}$ using a first order Taylor expansion (b/c of viscinity to $w^{t}$ for sufficiently small $η$ ):
$L (w^{t + 1}) = L (w^{t} + η v) = L (w^{t}) + η \nabla L (w^{t})^{T} v + o (η)$
For small $η$ , we neglect $o (η) =$ higher order remainders:
$L (w^{t + 1}) \approx L (w^{t}) + η \nabla L (w^{t})^{T} v$
We now try find a $v$ of length 1 (we control length with $η$ ) that minimizes $η \nabla L (w^{t})^{T} v$ . By Cauchy-Schwarz:
$⟺ \nabla L (w^{t})^{T} v \leq ∥\nabla L (w^{t}) ∥ ∥ v ∥ \nabla L (w^{t})^{T} v \geq - ∥\nabla L (w^{t}) ∥ (∥ v ∥ = 1)$
Equality only holds if $\nabla L (w^{t})^{T}$ and $v$ point in the same direction. Hence:
$v = - \frac{\nabla L ( w ^{t} )}{∥\nabla L ( w ^{t} ) ∥}$
Plugging into $w^{t + 1} = w^{t} + η v$ :
$w^{t + 1} = w^{t} - η \frac{\nabla L ( w ^{t} )}{∥\nabla L ( w ^{t} ) ∥ _{2}}$
We skip the normalization to save computation and not overshoot minima.

Optimizer

Goal: Large steps in flat areas, small steps in high curvature ones, dampened oscillations

Larger step sizes make convergence faster but may lead to oscillation in ill-conditioned gradients (left).

Momentum: combine new step with weighted previous step. $w^{t + 1} = w_{t} - η \nabla_{w} L (w^{t}) + β (w^{t} - w^{t - 1})$

like a ball rolling downhill, it can overcome saddle points and shallow minima

Adaptive methods (Adam, AdaGrad, RMSProp): Different step size for different entries $w_{i}$ . Intuitively: the elements 𝑖 that already changed a lot have smaller step size.

w_{i}^{t + 1} = w_{i}^{t} - \frac{η}{previous_change _{i} + γ} \frac{\partial L}{\partial w _{i}} (w^{t})

$previous_change = (w_{i}^{t} - w_{i}^{t - 1})^{2}$ $γ > 0$ to prevent division by 0.

Mini Batching

Continuously adjust the gradient for a subset of the training data.

Pros:

Less memory (although same as calculating gradients for subsets first and then averaging)
Converges faster
Variance helps escape poor local minima

Cons:

Never settles (can be solved with optimizer)

Convexity

Every slice roughly looks like either a quadratic or linear function (i.e. looks like a polynomial with only even degrees)

More convex $⟹$ better for learning (less likely to get stuck in local minima)

**Conditions for all $w, v$ :

$L (λ w + (1 - λ) v) \leq λ L (w) + (1 - λ) L (v)$ for $0 \leq λ \leq 1$
- Pick any two inputs $w$ and $v$
- The line between $L (w)$ and $L (v)$ is $λ L (w) + (1 - λ) L (v)$
- For any input between $w$ and $v$ , $λ w + (1 - λ) v$ , the loss must be below
$L (v) \geq L (w) + \nabla L (w)^{T} (v - w)$
- The graph of the function lies above all its tangent planes
- We check where we would land if we moved from $L (w)$ in the direction of $\nabla L (w)$
$\nabla^{2} L (w)$ is PSD
- Hessian contains 2nd derivatives along diagonal.

Strong convexity: no linear parts

Uni Notes

Explorer