Regression
Loss Functions
Def: Characterize how “bad” a prediction is. Of format , where is the residual.
- Square/L2 loss:
- is a constant factor and therefore scales all losses equally. It is typically included b/c
- Very sensitive to outliers since squared
- Absolute/L1 loss:
- Not differentiable at zero.
- Huber loss:
- Uses the square loss for and a linear loss for the rest
- For the graph to be smooth (continuously differentiable):
- The first derivative of linear parts must match at the transition as slope of the abs part since
- The y-coordinate must match. for the square part, but we have to subtract
- Advantage: Less sensitive to outliers while still differentiable
- Asymmetric loss (diff than graphic):
- Like absolute loss, but with two different slopes on both sides of the y-axis.
- Higher correspond to a steeper graph when over-shooting, lower to a steeper graph when under-shooting.
For multiple datapoints: Calculate the average loss. (e.g. mean squared error or mean huber loss).
Gradient
- Loss gradient: n-dimensional, where each axis is a parameter of the model that we can tweak.
- We try to find the global minimum, i.e. where the mean loss is the least.
Closed Form for MSE Regression
is the -th output and
The optimal weights are defined by:
Since minima always have a slope of 0, we are looking for s.t.
which is the normal equation we already know 🥳.