1. Loss Functions

Def: Characterize how “bad” a prediction is.

Regression

Of format $ℓ (r)$ , where $r = y - \overset{y}{^}$ is the residual.

Square/L2 loss: $ℓ (r) = \frac{1}{2} r^{2}$
- $\frac{1}{2}$ is a constant factor and therefore scales all losses equally. It is typically included b/c $L_{square}^{'} = r$
- Very sensitive to outliers since squared
Absolute/L1 loss: $ℓ_{ab s} (r) = ∣ r ∣$
- Not differentiable at zero.
Huber loss: $ℓ_{huber} (r) = {\frac{1}{2} r^{2} δ ∣ r ∣ - \frac{1}{2} δ^{2} ∣ r ∣ \leq δ, ∣ r ∣ > δ .$
- Uses the square loss for $[- δ, δ]$ and a linear loss for the rest
- For the graph to be smooth (continuously differentiable):
  - The first derivative of linear parts must match at the transition $⟹ δ$ as slope of the abs part since $L_{square}^{'} = r$
  - The y-coordinate must match. $\frac{1}{2} (\pm δ)^{2} = \frac{1}{2} δ^{2}$ for the square part, but $δ ∣ \pm δ ∣ = δ^{2}$ $⟹$ we have to subtract $\frac{1}{2} δ^{2}$
- Advantage: Less sensitive to outliers while still differentiable
Asymmetric loss (diff than graphic): $ℓ_{τ} (r) = τ max {r, 0} + (1 - τ) max {- r, 0}$
- Like absolute loss, but with two different slopes on both sides of the y-axis.
- Higher $τ$ correspond to a steeper graph when over-shooting, lower $τ$ to a steeper graph when under-shooting.

For multiple datapoints: Calculate the average loss (e.g. mean squared error/MSE or mean Huber loss). Usually denoted by $L$ .

$f (x) :=$ the output of the fitted function.
For binary classification: $\overset{y}{^} = sgn (f (x))$ and $y, \overset{y}{^} \in {- 1, 1}$

ℓ_{0 - 1} (\overset{y}{^}, y) = I_{\overset{y}{^} \neq = y} = {1, 0, if \overset{y}{^} \neq = y; otherwise.

The 0-1 loss is not differentiable. Therefore: surrogate loss for training.

Linear loss
- $ℓ_{lin} (y, x) = - y \cdot f (x)$
- Drawback: does not work if data is unbalanced.
Exponential loss
- $ℓ_{exp} (y, x) = e^{- y \cdot f (x)}$
- Drawback: too sensitive to outliers (func values explode)
Logistic loss
- $ℓ_{l o g} (y, x) = lo g (1 + e^{- y \cdot f (x)})$
- Asymptotes: $0, f (x)$

For unbalanced data, the exp loss (left) works better than the linear loss (right) b/c it does not equally reward correctly predicted datapoints.

The exponential loss (red) is too sensitive to the teal outlier. The logistic loss (blue) is better.