SciStatCalc: October 2020

Sunday, 4 October 2020

Cluster Number for K-means algorithm

This blog post plots the total Within-cluster Sum of Squares (WSS) against the number of clusters, for the k-means algorithm. By examining how this parameter decreases with the increasing number of clusters, an intuition can be gained over how many clusters are required for the dataset. The elbow method can be used, whereby increasing the number of clusters after a certain cluster number does not significantly decrease the total WSS. For a k-means calculator with 3d plot display, you can have a look here.

The cluster centres (or centroids) are initialised using a variant of the k-means++ algorithm as proposed by David Arthur and Sergei Vassilvitski in 2007.

Please enter lines of comma separated numbers in the text areas below - after the last number in each line there must be no trailing comma. In addition, there must be no new line after the last sample. In addition, the maximum number of clusters in the appropriate field needs to be entered.

Alternatively you can choose to load a CSV file, that must contain only comma separated numbers.

To perform the k-means clustering for cluster size varying from 1 to the maximum specified number, simply press the button labelled "Perform k-means over multiple cluster numbers" below. The results for the maximum cluster number will populate the textareas below labelled "Label and data sample" and "Label and Centroid values". Most importantly, a graph plotting the varation of total WSS with cluster number will be updated.

An example three-dimensional dataset has been loaded, with three clusters of 200 samples, as guidance. The maximum number of clusters is set to 6, so that one can visualise the elbow of the graph at cluster number 3.

Input

Friday, 2 October 2020

k-means calculator for arbitrary sized vectors

This blog post implements a basic k-means clustering algorithm, which can be applied to arbitrary sized vectors. The clusters will be displayed graphically using a 3d plot once the calculation button is clicked (using a great library from Plotly, where you can zoom and turn the plot amongst other handy capabilities). This will be useful for visualising the spatial clustering of vectors of size 1 to 3 inclusive - the centroids are higlighted in black, while each cluster has its own colour.

The cluster centres (or centroids) are initialised using a variant of the k-means++ algorithm as proposed by David Arthur and Sergei Vassilvitski in 2007.

Alternatively you can choose to load a CSV file, that must contain only comma separated numbers.

To perform the k-means clustering, please enter the number of clusters in the appropriate field, then press the button labelled "Perform k-means clustering" below - the results will populate the textareas below labelled "Label and data sample" and "Label and Centroid values". In addition a 3-d graph plot will appear.

Note that the k-means algorithm can converge to a local minimum, and also exhibit degeneracy, whereby one of the clusters has no members, although the kmeans++ initialisation should somewhat mitigate this. Should these scenarios occur, simply re-run the algorithm, by clicking on the button labelled "Perform k-means clustering".

An example three-dimensional dataset has been loaded, with three clusters of 200 samples, as guidance.

Input

Enter number of clusters (k value):-

Label and data sample:-

Label and Centroid values:-

Sunday, 4 October 2020

Cluster Number for K-means algorithm

Friday, 2 October 2020

k-means calculator for arbitrary sized vectors

Logistic Regression Calculator and ROC Curve Plotter