7. NN

z: pre-activation value v: post-activation value

ReLU

weights going in to neural net decides how inputs are weighted
bias decides when the linear part starts
weight going out of neuron decides if the lin part points up or down + how steep it is

$⟹$ combining yields piecewise linear function

Any function in a hypercube like $[0, 1]$ can be uniformly approximated this way (diff get arbitrarily small with increasing $n$ )

Increasing depth: Each segment of next layer calls previous layer as subroutine

great graphic in slides/script

Use chain rule to move from back to front, updating weights and biases in direction of gradient

Via chain rule: each stage needs previous ones.

Sigmoid dervative: $σ (z) (1 - σ (z))$

Expected values/Variance proof

Regularization (weight decay):
- LASSO or Ridge with vectorized weights (write all weights into one huge vector)
- Multiply weights with constants slightly lower than 1 during each iteration (weight decay)
Early stopping: Stop once validation error “stops to decrease,” ex. by saving checkpoints

Some units only activate for very specific training examples
During training: don’t use all units for each mini batch. Only activate each unit by a certain probability $p$
During inference: use all units, but multiply the weights by $p$

Why it’s allowed:

E [z] = E [i = 1 \sum w_{i} v_{i} S_{i}] = i \sum w_{i} v_{i} p E [S_{i}] = p \cdot (w^{T} v) = (pw)^{T} v where S_{i} \sim Ber (p) = \cases 0 if “dropped out” 1 otherwise (w_{i}, v_{i} const. \forall i)

Normalize inputs in layer according to minibatch statistics, i.e. s.t. the outputs of the mini-batch are

During training:

Input: activation values of one unit across minibatch, ex. $v_{i} \forall i \in S$ in minibatch $S$
Learnable parameters: $γ, β$
Batch normalization: $\overset{v}{ˉ} = BN (v; γ, β)$ by
1. Mean: $μ_{S} = \frac{1}{∣ S ∣} i \in S \sum v_{i}$
2. Variance: $σ_{S}^{2} = \frac{1}{∣ S ∣} i \in S \sum (v_{i} - μ_{S})^{2}$
3. Normalize: $\overset{v_{i}}{^} = \frac{v _{i} - μ _{S}}{σ _{S}^{2} + ϵ}$ where $ϵ$ is a very small value to avoid div. by zero
4. Scale and shift: $\overset{v_{i}}{ˉ} = γ \overset{v_{i}}{^} + β$
Output: $\overset{v_{i}}{ˉ} \forall i \in S$ in the minibatch $S$

During inference:

Benefits: