7. NN

z: pre-activation value v: post-activation value

ReLU

  • weights going in to neural net decides how inputs are weighted
  • bias decides when the linear part starts
  • weight going out of neuron decides if the lin part points up or down + how steep it is

combining yields piecewise linear function

Any function in a hypercube like can be uniformly approximated this way (diff get arbitrarily small with increasing )

Increasing depth: Each segment of next layer calls previous layer as subroutine

great graphic in slides/script

Backpropagation

Use chain rule to move from back to front, updating weights and biases in direction of gradient

Via chain rule: each stage needs previous ones.

Sigmoid dervative:

Weight-Space Symmetries

  • You can permute a layer’s weight. Leads to same result.

Initialization

  • 0 is bad b/c the gradient is 0 everywhere

  • Use certain activation functions like ReLU can help avoid

  • Scale error signal by

Expected values/Variance proof

Learning Rate

  • Often use decaying learning rate scheduler, e.g. piecewise constant
  • Monitor ratio of weight change to weight magnitude

Regularization

  • Regularization (weight decay):
    • LASSO or Ridge with vectorized weights (write all weights into one huge vector)
    • Multiply weights with constants slightly lower than 1 during each iteration (weight decay)
  • Early stopping: Stop once validation error “stops to decrease,” ex. by saving checkpoints

Dropout

  • Some units only activate for very specific training examples
  • During training: don’t use all units for each mini batch. Only activate each unit by a certain probability
  • During inference: use all units, but multiply the weights by

Why it’s allowed:

Batch Normalization

  • Normalize inputs in layer according to minibatch statistics, i.e. s.t. the outputs of the mini-batch are

During training:

  1. Input: activation values of one unit across minibatch, ex. in minibatch
  2. Learnable parameters:
  3. Batch normalization: by
    1. Mean:
    2. Variance:
    3. Normalize: where is a very small value to avoid div. by zero
    4. Scale and shift:
  4. Output: in the minibatch

During inference:

  • Use running averages across training

Benefits:

  • Larger step sizes
  • Initialization less relevant
  • Avoid overfitting