3. Gradients

  • Loss derivative: n-dimensional, where each axis is a parameter of the model that we can tweak.
  • We try to find the global minimum, i.e. where the mean loss is the least.

Closed Form for MSE Regression

We try to minimize the MSE. Since mean = sum constant, we minimize the loss sum:

where = output vector, = weights vector, and = input matrix. Therefore, for the optimal weights:

Minima always have a slope of 0 we are looking for s.t.

which is the normal equation we already know 🥳.

Gradient Descent

It is not always possible to find a minimum directly. Therefore: Iteratively move towards minimum by moving in direction of gradient derivative, where is the step size:

Usually: interrupt when progress distance is low:

Optimizer

Goal: Large steps in flat areas, small steps in high curvature ones, dampened oscillations

Larger step sizes make convergence faster but may lead to oscillation in ill-conditioned gradients (left).

Momentum: combine new step with weighted previous step.

Adaptive methods (Adam, AdaGrad, RMSProp): Different step size for different entries . Intuitively: the elements 𝑖 that already changed a lot have smaller step size.

to prevent division by 0.

Mini Batching

Continuously adjust the gradient for a subset of the training data.

Pros:

  • Less memory (although same as calculating gradients for subsets first and then averaging)
  • Converges faster
  • Variance helps escape poor local minima

Cons:

  • Never settles (can be solved with optimizer)

Convexity

Every slice roughly looks like either a quadratic or linear function (i.e. looks like a polynomial with only even degrees)

More convex better for learning (less likely to get stuck in local minima)

**Conditions for all :

  1. for
    • Pick any two inputs and
    • The line between and is
    • For any input between and , , the loss must be below
    • The graph of the function lies above all its tangent planes
    • We check where we would land if we moved from in the direction of
  2. is PSD
    • Hessian contains 2nd derivatives along diagonal.

Strong convexity: no linear parts