This blog post implements a basic k-means clustering algorithm, which can be applied to either a scalar number or 2-d data (x and y component). Graphs of the clustered data and algorithm convergence (as measured by the changes in cluster membership of the data samples between consecutive iterations) are displayed below. For a more general (and better performing) k-means calculator, you can see a much more recent blog post here
The cluster centres (or centroids) are initialised using the k-means++ algorithm as proposed by David Arthur and Sergei Vassilvitski in 2007.
Please enter the numbers in the text areas below - either one number per line or two comma separated numbers per line. There must be no new line after the last number.
Alternatively you can choose to load a CSV file, which must be either a single column of numbers (for a real only input) or two comma-separated columns of numbers - the first line can be a comment line, starting with the character #.
To perform the k-means clustering, please enter the number of clusters and the number of iterations in the appropriate fields, then press the button labelled "Perform k-means clustering" below - the results will populate the textareas below labelled "Output" and "Centroid values". The "Output" textarea will list the sample values and the cluster/centroid index each sample belongs to, while the "Centroid values" textarea will list the centroid index and the value of the centroids (or cluster centres).
Note that the k-means algorithm can converge to a local minimum, and also exhibit degeneracy, whereby one of the clusters has no members. Should these scenarios occur, simply re-run the algorithm, by clicking on the button labelled "Perform k-means clustering".
An example two-dimensional dataset has been loaded, with three clusters of 200 samples - the number of iterations is set to ten. Pressing the "Perform k-means clustering" can result in a local minima being reached, which will be obvious to spot from the Cluster Visualisation display. In this case, simply re-run the algorithm.
Cluster Visualisation | ||
Value | ||
Samples | ||
Algorithm convergence | ||
Change | ||
Iteration number |
Summary statistics pending..