9. Clustering

Hierarchical Clustering

Represent each cluster by a single point (center)
Assign points to closest center $μ_{i} \in R^{d}$
- Induces Voronoi partition
- For two: equivalent to max-margin between two centers

Goal: Pick centers to minimize the sum of squared distances (NP hard ☠️)

minimize \hat{R} (μ) := \hat{R} (μ_{1}, \dots, μ_{k}) = i = 1 \sum k j \in {1, \dots, k} min ∥ x_{i} - μ_{j} ∥_{2}^{2}

Problem:

Determining the # clusters difficult
- Trivial way to just put center at ever data point.
Cannot model clusters of arbitrary shape well
- Euclidean distance favors spheres

In practice: converges very quickly If trying hard: converges in exponential time

z_{i} \leftarrow ar g min

something with z and two steps

$⟶$ converges to local optimum

Initialization difficult:

Adaptive Seeding with K-Means++:

Start with random data point as center $μ_{1}^{(0)} := x_{i}, i \sim Uniform ({1, \dots, n})$
Add centers 2 to k randomly, proportionally to squared distance to closest selected center. Given $μ_{1 : j}^{(0)}$ , pick $μ_{j + 1}^{(0)}$ where $Prob (i) \sim$

Heuristic quality measure
- “Diminishing returns” on the loss functions by # clusters (provenly concave)
- Pick $k$ s.t. increasing $k$ leads to negligible decrease in loss
Regularization (favor “simple” models with few parameters by penalizing complex models)
Information theoretic basis ()

Validation usually sets don’t work: Higher density centers $⟹$ centers closer, even to validation data