Locally Adaptive Clustering (LAC)

Type
Clustering

Locally Adaptive Clustering is an extension of traditional clustering (like K-Means) that addresses the problem of data having different shapes or “strengths” across different dimensions. While standard clustering treats all features equally, LAC recognizes that some features are more relevant to specific clusters than others.

The Core Problem: The “Curse of Dimensionality”

In high-dimensional datasets, many features are often irrelevant or “noisy” for certain groups of data.

Global Weighting: Standard algorithms apply the same importance to a feature across the entire dataset.
Local Relevance: In reality, “Feature A” might be very important for identifying Cluster 1, but completely irrelevant for Cluster 2.

How LAC Works

LAC identifies clusters by associating a weight with each dimension, and these weights are local to each individual cluster.

Distance Metric: Instead of using standard Euclidean distance, LAC uses a weighted distance. The distance from a point $x$ to a cluster center $c_j$ is modified by a weight vector $w_j$:

$d_{w}(x, c_j) = \sqrt{\sum_{i=1}^{D} w_{ji} (x_i - c_{ji})^2}$

Weight Assignment: Dimensions with small variances (where data points are tightly packed) are assigned higher weights, as they are more “reliable” for defining that specific cluster. Dimensions with high variance (spread out/noisy) are assigned lower weights.

The Algorithm Steps

LAC typically follows an iterative process similar to K-Means or EM:

Initialization: Start with initial cluster centers and equal weights for all dimensions.
Partitioning: Assign each data point to the cluster with the smallest weighted distance.
Updating Centers: Recalculate the cluster centroids based on the points assigned to them.
Updating Weights: Adjust the weights for each cluster. If the points in Cluster $j$ are very close together along Dimension $i$, increase $w_{ji}$.
Iteration: Repeat until the cluster assignments and weights stabilize.

Key Advantages

Subspace Clustering: LAC can find clusters that exist in different subspaces of the data (e.g., Cluster A is defined by height and weight, while Cluster B is defined by age and income).
Noise Robustness: By assigning low weights to irrelevant features locally, the algorithm is less likely to be “distracted” by noise in those dimensions.
Discovery of Elongated Shapes: Unlike K-Means, which looks for spherical blobs, LAC can find clusters that are elongated along specific axes.

Real-World Applications

Genomics: Different sets of genes might be responsible for different types of diseases; LAC can identify which genes are relevant for which specific disease cluster.
Document Categorization: In a large set of articles, “sports” words are relevant for one cluster, while “finance” words are relevant for another. LAC adapts the weights to focus on the vocabulary that matters for each category.