9. Clustering
Hierarchical Clustering
- Build a tree representing, always merging branches with close distance
Partitioning Approaches
- Build a graph between nodes
- Cut longest edges
Model-Based Approaches
K-means
- Represent each cluster by a single point (center)
- Assign points to closest center
- Induces Voronoi partition
- For two: equivalent to max-margin between two centers
- Assumes poins are in Euclidean space
- Represent clusters as centers
- Each point is assigned to closest center
Goal: Pick centers to minimize the sum of squared distances (NP hard ☠️)
Problem:
- Determining the # clusters difficult
- Trivial way to just put center at ever data point.
- Cannot model clusters of arbitrary shape well
- Euclidean distance favors spheres
K-Means Algorithm (Lloyd’s Heuristic)
In practice: converges very quickly If trying hard: converges in exponential time
- Initialize cluster centers
- While not converged
- Assign each point to closest center
something with z and two steps
converges to local optimum
Initialization difficult:
- Too far away: outliers
- Gausian probability: prefers large clusters
Adaptive Seeding with K-Means++:
- Start with random data point as center
- Add centers 2 to k randomly, proportionally to squared distance to closest selected center. Given , pick where
Model Selections
- Heuristic quality measure
- “Diminishing returns” on the loss functions by # clusters (provenly concave)
- Pick s.t. increasing leads to negligible decrease in loss
- Regularization (favor “simple” models with few parameters by penalizing complex models)
- Information theoretic basis ()
Validation usually sets don’t work: Higher density centers centers closer, even to validation data