1. Loss Functions

Def: Characterize how “bad” a prediction is.

Regression

Of format , where is the residual.

  • Square/L2 loss:
    • is a constant factor and therefore scales all losses equally. It is typically included b/c
    • Very sensitive to outliers since squared
  • Absolute/L1 loss:
    • Not differentiable at zero.
  • Huber loss:
    • Uses the square loss for and a linear loss for the rest
    • For the graph to be smooth (continuously differentiable):
      • The first derivative of linear parts must match at the transition as slope of the abs part since
      • The y-coordinate must match. for the square part, but we have to subtract
    • Advantage: Less sensitive to outliers while still differentiable
  • Asymmetric loss (diff than graphic):
    • Like absolute loss, but with two different slopes on both sides of the y-axis.
    • Higher correspond to a steeper graph when over-shooting, lower to a steeper graph when under-shooting.

For multiple datapoints: Calculate the average loss (e.g. mean squared error/MSE or mean Huber loss). Usually denoted by .

Classification

Let be the output of the fitted function.

Metric for Evaluation (0-1-Loss)

as

Surrogate Loss

The 0-1 loss is not differentiable. Therefore: surrogate loss for training.

  • Linear loss
    • Drawback: does not work if data is unbalanced.
  • Exponential loss
    • Drawback: too sensitive to outliers (func values explode)
  • Logistic loss
    • Asymptotes:

For unbalanced data, the exp loss (left) works better than the linear loss (right) b/c it does not equally reward correctly predicted datapoints.

The exponential loss (red) is too sensitive to the teal outlier. The logistic loss (blue) is better.