3. Gradients
- Loss derivative: n-dimensional, where each axis is a parameter of the model that we can tweak.
- We try to find the global minimum, i.e. where the mean loss is the least.
Closed Form for MSE Regression
We try to minimize the MSE. Since mean = sum constant, we minimize the loss sum:
where = output vector, = weights vector, and = input matrix. Therefore, for the optimal weights:
Minima always have a slope of 0 we are looking for s.t.
which is the normal equation we already know 🥳.
Gradient Descent
It is not always possible to find a minimum directly. Therefore: Iteratively move towards minimum by moving in direction of gradient derivative, where is the step size:
Usually: interrupt when progress distance is low:
Justification:
We know that , where is some normalized vector. We estimate the loss of using a first order Taylor expansion (b/c of viscinity to for sufficiently small ):
For small , we neglect higher order remainders:
We now try find a of length 1 (we control length with ) that minimizes . By Cauchy-Schwarz:
Equality only holds if and point in the same direction. Hence:
Plugging into :
We skip the normalization to save computation and not overshoot minima.
Optimizer
Goal: Large steps in flat areas, small steps in high curvature ones, dampened oscillations
Larger step sizes make convergence faster but may lead to oscillation in ill-conditioned gradients (left).
Momentum: combine new step with weighted previous step.
Adaptive methods (Adam, AdaGrad, RMSProp): Different step size for different entries . Intuitively: the elements 𝑖 that already changed a lot have smaller step size.
to prevent division by 0.
Mini Batching
Continuously adjust the gradient for a subset of the training data.
Pros:
- Less memory (although same as calculating gradients for subsets first and then averaging)
- Converges faster
- Variance helps escape poor local minima
Cons:
- Never settles (can be solved with optimizer)
Convexity
Every slice roughly looks like either a quadratic or linear function (i.e. looks like a polynomial with only even degrees)
More convex better for learning (less likely to get stuck in local minima)
**Conditions for all :
- for
- Pick any two inputs and
- The line between and is
- For any input between and , , the loss must be below
-
- The graph of the function lies above all its tangent planes
- We check where we would land if we moved from in the direction of
- is PSD
- Hessian contains 2nd derivatives along diagonal.
Strong convexity: no linear parts