Uni Notes

❯

❯

❯

4. Data

Samples are picked from some unknown distribution + some noise

$L (\hat{f}, P) = E_{X} (\hat{f} (X) - f^{*} (X))^{2} + σ^{2}$ where $σ^{2} = E_{X} (Y - f^{*} (X))^{2}$

$E =$ expectation

Bias and variance

K-fold Cross Validation

Split training data into $k$ subsets
Train $K$ models: Hold back one as validation set, train on other
Average validation error
Pick hyperparameters with lowest validation error
Train on entire dataset
Evaluate on unbiased test data

Best: $K = ∣ D_{u se} ∣$ b/c $D ’$ would be very similar to $D_{u se}$ . Typical in practice: $K =$ 5 or 10.

Limit Model Complexity

Limit the search space of possible functions by:

Restrict polynomial degree
LASSO (L1 regularization): Limit L1 norm $argmin_{w \in R^{d}} ∥ y - Xw ∥_{2}^{2} s.t. ∥ w ∥_{1} \leq R$ $⟺ argmin_{w \in R^{d}} ∥ y - Xw ∥_{2}^{2} + λ ∥ w ∥_{1}$ for some (unknown) $λ$ (lasso)
Ridge (L2 regularization): Limit L2 norm $argmin_{w \in R^{d}} ∥ y - Xw ∥_{2}^{2} s.t. ∥ w ∥_{2} \leq R$ $⟺ argmin_{w \in R^{d}} ∥ y - Xw ∥_{2}^{2} + λ ∥ w ∥_{2}^{2}$ for some unknown $λ$

Key distinction: LASSO → sparse (feature selection), Ridge → small but nonzero weights.

4. Data
K-fold Cross Validation
Limit Model Complexity

Backlinks

index

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community