Vote utilisateur: 0 / 5

k-means is a kind of clustering algorithms, which belong to the family of unsupervised machine learning models. It aims at finding $k$ groups of similar data (clusters) in an unlabeled multidimensional dataset.

## The k-means minimization problem

Let $(x_1, ..., x_n)$ be a set of $n$ observations with $x_i \in \mathbb{R}^{d}$, for $1 \leq i \leq n$. The aim of the k-means algorithms is to find a disjoint partition $S={S_1, ..., S_k }$ of the $n$ observations into $k \leq n$ clusters, minimizing $D$ the within-cluster distance to center: $$D(S) = \sum_{i=1}^k \sum_{x \in S_i} | x - \mu_i |^2$$ where $\mu_i$ is the $i$-th cluster center (i.e. the arithmetic mean of the cluster observations): $\mu_i = \frac{1}{|S_i|} \sum_{x_j \in S_i} x_j$, for $1 \leq i \leq n$.

Unfortunately, finding the exact solution of this problem is very tough (NP-hard) and a local minimum is generally sought using a heuristic.

## The algorithm

Here is a simple description of the algorithm taken from the book "Data Science from Scratch" by Joel Grus (O'Reilly):

1. Start with a set of k-means, which are $k$ points in $d$-dimensional space.
2. Assign each point to the mean to which it is closest.
3. If no point’s assignment has changed, stop and keep the clusters.
4. If some point’s assignment has changed, recompute the means and return to step 2.

This algorithm is an iterative refinement procedure. In his book "Python Data Science Handbook" (O'Reilly), Jake VanderPlas refers to this algorithm as kind of Expectation–Maximization (E–M). Since step 1 is the algorithm initialization and step 3 the stopping criteria, we can see that the algorithm consists in only two alternating steps:

step 2. is the Expectation:

"updating our expectation of which cluster each point belongs to".

step 4. is the Maximization:

"maximizing some fitness function that defines the location of the cluster centers".

This is described with more details in the following link.

An interesting geometrical interpretation is that step 2 corresponds to partitioning the observations according to the Voronoi diagram generated by the centers computed previously (either on step 1 or 4). This is also why the standard k-means algorithm is also called Lloyd's algorithm, which is a Voronoi iteration method for finding evenly spaced sets of points in subsets of Euclidean spaces.

### Voronoi diagram

Let us have a look at the Voronoi diagram generated by the $k$ means.

As in Jake VanderPlas' book, we generate some fake observation data using scikit-learn 2-dimensional blobs, in order to easily plot them.

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs

k = 20
n = 1000
X, _ = make_blobs(n_samples=n, centers=k, cluster_std=0.70, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=5);

The k-means implementation from scikit-learn is used in the following, in order to compute the clusters and the centers:

#### Iteration #1

A single E-M iteration is performed in the following:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k, init='random', n_init=1, random_state=0, max_iter=1)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)  # cluster index for each observation
centers = kmeans.cluster_centers_  # cluster center coordinates
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=5, cmap='summer')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=100, alpha=0.5)

<matplotlib.collections.PathCollection at 0x7fbf04d7da90>

Now we compute the Voronoi diagram generated by the centers with scipy and observe that data is partitioned according to this diagram.

from scipy.spatial import Voronoi, voronoi_plot_2d
vor = Voronoi(centers)
voronoi_plot_2d(vor)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=5, cmap='summer')
plt.show()

#### At "convergence"

kmeans = KMeans(n_clusters=k, init='random', n_init=1, random_state=0, max_iter=1000)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)  # cluster index for each observation
centers = kmeans.cluster_centers_  # cluster center coordinates
vor = Voronoi(centers)
voronoi_plot_2d(vor)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=5, cmap='summer')
plt.show()

Note that once the fit method from scikit-learn has been called, the predict one will just assign the given points to the closest centers computed in the fitting step.

### Initialization

The initialization step 1 is very important for the algorithm to perform well, especially because only a local optimum is reached at the end, which strongly depends on the starting point. The initialization method used above is the Random Partition, which randomly assigns a cluster to each observation, thus placing all means close to the center of the data set. As stated by Jake VanderPlas:

"[...] although the E–M procedure is guaranteed to improve the result in each step, there is no assurance that it will lead to the global best solution. For example, if we use a different random seed in our simple procedure, the particular starting guesses lead to poor results."

Some methods have been proposed to initialize the means in order to guarentee that the algorithm will reach a "correct" solution, such as K-means++, which is the default initialization method of sklearn.cluster.KMeans:

kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)  # cluster index for each observation
centers = kmeans.cluster_centers_  # cluster center coordinates
vor = Voronoi(centers)
voronoi_plot_2d(vor)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=5, cmap='summer')
plt.show()

Also, the method is usually performed several times (default is 10 times in scikit_learn) with different centroid seeds (different initializations). Only the best result is returned, in terms of within-cluster sum of squared $D(S)$, also called inertia.

print(kmeans.inertia_)

838.5283322649027

This inertia here is significantly better than the one we would get by using a single run of k-means (instead of 10) with a random partition initialization:

kmeans = KMeans(n_clusters=k, init='random', n_init=1, random_state=0)
kmeans.fit(X)
print(kmeans.inertia_)

1133.214871399982

### Computational considerations

Data sets are characterized by their cardinality $n$ and dimensionality $d$, among other things...

When $d$ is "large", one may deal with a lot of zeros. An example of sparse observations can be found in this case of text documents clustering by topics using a bag-of-words approach: same word occurences tend to be relatively infrequent throughout the corpus, which means that a lot of zeros would be found in the full numpy array representation. This is why a [compressed sparse row](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format) matrix format is used in this case. In that case a special implementation of k-means algorithm is used in scikit-learn. This allow to save some memory but also probably save some computational time.

When the data is not sparse, one may use the Elkan version , which uses the triangle inequality to accelerate the algorithm. The author states in his paper:

Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number $k$ of clusters increases. For $k \geq 20$ it is many times faster than the best previously known accelerated k-means method.

However the algorithm (algorithm="elkan") does not support sparse data in scikit-learn, but this is the default version when dealing with dense data, as we did above.

When dealing with large a large cardinality $n$, then the mini-batch k-means clustering should be used. Here is the paper introducing this algorithm that scales well when dealing with large data sets. This works for both sparse anf full input data format. As for the k-means algorithm, the initialization is run several times (10 by default), but not the algorithm, which is run only once.

Also, note that the scikit-learn implementation of k-means may be run in parallel, running each instance of k-means (algo + instance of initialization) on a thread.