The k-means algorithm is used to find the centres of k clusters within a set of vectors. These can then be given classification labels; optionally, the resulting classification clusters can be used as the basis for some other algorithm such as Nearest Neighbour.

The algorithm works iteratively. Initially, k points are distributed randomly within the vector space to form the initial cluster centres, then:

  1. each input item is assigned to the cluster to whose centre it has the lowest mean distance;
  2. the cluster centres are moved so they have the lowest mean distance to the points within their respective clusters (i.e. into the geometric centres);
  3. the points 1. and 2. are repeated until the algorithm converges and the cluster centres stop moving.

The main drawbacks with k-means are:

  • because the distance is minimized over all dimensions, k-means only works well where the clusters are close to spherical. It performs poorly with elongated clusters where the variance is much higher in some dimensions than in others. If the space cannot be warped to give a relatively constant variance over all dimensions before the algorithm starts, expectation maximization is likely to give better results provided that the other constraints it places on the data are met;
  • you have to get the value of k (a hyperparameter) right before you start. If you use the algorithm with k=3 on data that is actually arranged in 4 clearly defined, dense clusters, one of the centres will end up half way between two of the clusters. One answer to this problem lies with the G-means algorithm, which tries out progressively larger values of k until one is found where the items around each centre are arranged in a Gaussian distribution. Like expectation maximization but unlike standard k-means, G-means assumes a Gaussian distribution around the cluster centres.
  • it is not guaranteed that k-means will always converge to the optimal solution. Different solutions can emerge with different random initial values.

See also k-medoids and k-medians, which follow the same basic principle but use different distance measures, as well as spherical k-means, where the direction but not the magnitude of vectors is used for clustering.

Lloyd's algorithm
has functional building block
has input data type
IDT_Vector of quantitative variables
has internal model
has output data type
ODT_Vector of quantitative variables ODT_Classification
has learning style
has parametricity
PRM_Nonparametric with hyperparameter(s) PRM_Nonparametric
has relevance
sometimes supports
ALG_Nearest Neighbour
mathematically similar to
ALG_Expectation maximization