7. NN
z: pre-activation value v: post-activation value
ReLU
- weights going in to neural net decides how inputs are weighted
- bias decides when the linear part starts
- weight going out of neuron decides if the lin part points up or down + how steep it is
combining yields piecewise linear function
Any function in a hypercube like can be uniformly approximated this way (diff get arbitrarily small with increasing )
Increasing depth: Each segment of next layer calls previous layer as subroutine
great graphic in slides/script
Backpropagation
Use chain rule to move from back to front, updating weights and biases in direction of gradient
Via chain rule: each stage needs previous ones.
Sigmoid dervative:
Weight-Space Symmetries
- You can permute a layer’s weight. Leads to same result.
Initialization
-
0 is bad b/c the gradient is 0 everywhere
-
Use certain activation functions like ReLU can help avoid
-
Scale error signal by
Expected values/Variance proof
Learning Rate
- Often use decaying learning rate scheduler, e.g. piecewise constant
- Monitor ratio of weight change to weight magnitude
Regularization
- Regularization (weight decay):
- LASSO or Ridge with vectorized weights (write all weights into one huge vector)
- Multiply weights with constants slightly lower than 1 during each iteration (weight decay)
- Early stopping: Stop once validation error “stops to decrease,” ex. by saving checkpoints
Dropout
- Some units only activate for very specific training examples
- During training: don’t use all units for each mini batch. Only activate each unit by a certain probability
- During inference: use all units, but multiply the weights by
Why it’s allowed:
Batch Normalization
- Normalize inputs in layer according to minibatch statistics, i.e. s.t. the outputs of the mini-batch are
During training:
- Input: activation values of one unit across minibatch, ex. in minibatch
- Learnable parameters:
- Batch normalization: by
- Mean:
- Variance:
- Normalize: where is a very small value to avoid div. by zero
- Scale and shift:
- Output: in the minibatch
During inference:
- Use running averages across training
Benefits:
- Larger step sizes
- Initialization less relevant
- Avoid overfitting