Friday, 24 January 2014

k-means clustering calculator

This blog post implements a basic k-means clustering algorithm, which can be applied to either a scalar number or 2-d data (x and y component). Graphs of the clustered data and algorithm convergence (as measured by the changes in cluster membership of the data samples between consecutive iterations) are displayed below.

The cluster centres (or centroids) are initialised using the k-means++ algorithm as proposed by David Arthur and Sergei Vassilvitski in 2007.

Please enter the numbers in the text areas below - either one number per line or two comma separated numbers per line. There must be no new line after the last number.

Alternatively you can choose to load a CSV file, which must be either a single column of numbers (for a real only input) or two comma-separated columns of numbers - the first line can be a comment line, starting with the character #.

To perform the k-means clustering, please enter the number of clusters and the number of iterations in the appropriate fields, then press the button labelled "Perform k-means clustering" below - the results will populate the textareas below labelled "Output" and "Centroid values". The "Output" textarea will list the sample values and the cluster/centroid index each sample belongs to, while the "Centroid values" textarea will list the centroid index and the value of the centroids (or cluster centres).

Note that the k-means algorithm can converge to a local minimum, and also exhibit degeneracy, whereby one of the clusters has no members. Should these scenarios occur, simply re-run the algorithm.



Input




Enter number of clusters (k value):-

Enter number of iterations:-





Output:-
Centroid values:-



Cluster Visualisation
Value
Samples
Algorithm convergence
Change
Iteration number

Summary statistics pending..

Sunday, 5 January 2014

Chi-squared Test for Independence

In this blog post I will discuss Pearson's Chi-squared test for independence, using an example.

Pearson's Chi-squared test for independence is applied to outcomes that are arranged in a tabular form and tests for independence between two factors. The rows correspond to the levels for one factor and the columns to the levels for the other factor. The Null Hypothesis is that the two factors are independent.

For example, we could have two rows corresponding to gender (Female and Male), and three columns corresponding to what products were purchased from a supermarket (Product A, Product B and Product C). The six entries would correspond to which gender bought which product. The Null Hypothesis is that the proportion of people who bought each of the three products is independent of their gender. For the Null Hypothesis to be valid, the proportion of females who bought Product A is equal to the proportion of males and females who bought Product A (out of the total population of males and females), the proportion of males who bought Product A is equal to the proportion of males and females who bought Product A (out of the total population of males and females), and so on. To elaborate on this further, we can represent the outcomes in the table below.


Product A Product B Product C
Female $a$ $b$ $c$
Male $d$ $e$ $f$

For the Null Hypothesis to hold, the proportion of females who purchased Product A would be equal to the proportion of both males and females who purchased Product A out of the total population:-

$\Large \frac{a}{a+b+c}=\frac{a+d}{a+b+c+d+e+f}$

Rearranging the above equation, we obtain the expected value of $a$ as

$\Large \hat{a}=\frac{(a+d)(a+b+c)}{a+b+c+d+e+f}$

and the proportion of males who purchased Product A would be equal to the proportion of both males and females who purchased Product A out of the total population:-

$\Large \frac{d}{d+e+f}=\frac{a+d}{a+b+c+d+e+f}$

Rearranging the above equation in a similar fashion, we obtain the expected value of $d$ as

$\Large \hat{d}=\frac{(a+d)(d+e+f)}{a+b+c+d+e+f}$

Going through all the entries, the expected value of each entry turns out to be the product of the entry's column sum and the entry's row sum divided by the total population.

Once all the expected values of the entries are computed, the sum of squares of the difference between the actual entry outcome and the expected value of that entry divided by the expected value of that entry is used to calculate the chi-squared test statistic. Mathematically, this can be represented as


$\Large \chi^2=\sum_{m=1}^{r}\sum_{n=1}^{c}\frac{(O(m,n)-E(m,n))^2}{E(m,n)}$

where $r$ is the number of rows, $c$ is the number of columns, $O(m,n)$ is the outcome in row $m$, column $n$ and $E(m,n)$ the corresponding expected value. The degree of freedom for this statistic is $(r-1)(c-1)$. Based on the chi-squared statistic and the degrees of freedom we can calculate the p-value, and if this value is less than a significance level of our choice (a commonly used value is 0.05) we reject the Null Hypothesis that the proportion of people who bought each of the three products is independent of their gender.