9. Clustering

Hierarchical Clustering

  • Build a tree representing, always merging branches with close distance

Partitioning Approaches

  • Build a graph between nodes
  • Cut longest edges

Model-Based Approaches

K-means

  1. Represent each cluster by a single point (center)
  2. Assign points to closest center
    • Induces Voronoi partition
    • For two: equivalent to max-margin between two centers
  • Assumes poins are in Euclidean space
  • Represent clusters as centers
  • Each point is assigned to closest center

Goal: Pick centers to minimize the sum of squared distances (NP hard ☠️)

Problem:

  • Determining the # clusters difficult
    • Trivial way to just put center at ever data point.
  • Cannot model clusters of arbitrary shape well
    • Euclidean distance favors spheres

K-Means Algorithm (Lloyd’s Heuristic)

In practice: converges very quickly If trying hard: converges in exponential time

  1. Initialize cluster centers
  2. While not converged
    1. Assign each point to closest center

something with z and two steps

converges to local optimum

Initialization difficult:

  • Too far away: outliers
  • Gausian probability: prefers large clusters

Adaptive Seeding with K-Means++:

  1. Start with random data point as center
  2. Add centers 2 to k randomly, proportionally to squared distance to closest selected center. Given , pick where

Model Selections

  1. Heuristic quality measure
    • “Diminishing returns” on the loss functions by # clusters (provenly concave)
    • Pick s.t. increasing leads to negligible decrease in loss
  2. Regularization (favor “simple” models with few parameters by penalizing complex models)
  3. Information theoretic basis ()

Validation usually sets don’t work: Higher density centers centers closer, even to validation data