Saturday, 21 September 2013

Critical value of W statistic for Shapiro-Wilk test


One of the tests in SciStatCalc is the Shapiro-Wilk test for Gaussianity/Normality. Below is a table of the number of samples against the critical value of the statistic $W$ for this test, derived from Table 10-3 in Appendix D, from the document that is accessible at the following website

http://www.epa.gov/osw/hazard/correctiveaction/resources/guidance/sitechar/gwstats/unified-guid-app.pdf


The alpha level is 5%, for a two-tailed test. The $W$ statistic needs to be greater than the critical value for the null hypothesis that the samples are drawn from a Gaussian distribution not to be rejected. Sample values up to 50 are shown. 

Number or Samples W(critical)
3 0.767
4 0.748
5 0.762
6 0.788
7 0.803
8 0.818
9 0.829
10 0.842
11 0.850
12 0.859
13 0.866
14 0.874
15 0.881
16 0.887
17 0.892
18 0.897
19 0.901
20 0.905
21 0.908
22 0.911
23 0.914
24 0.916
25 0.918
26 0.920
27 0.923
28 0.924
29 0.926
30 0.927
31 0.929
32 0.930
33 0.931
34 0.933
35 0.934
36 0.935
37 0.936
38 0.938
39 0.939
40 0.940
41 0.941
42 0.942
43 0.943
44 0.944
45 0.945
46 0.945
47 0.946
48 0.947
49 0.947
50 0.947

Tuesday, 17 September 2013

Numerical Estimate of the Inverse Error Function

The function that numerically approximates the inverse error (erf) function $[\large \frac{2}{\sqrt{\pi}}\int_0^xe^{-t^2}dt]^{-1}$ is captured in the C function below - it's a fragment of the source code of my (now free) iOS App SciStatCalc. Note the one iteration of Halley's method to refine the result.

Feel free to use it, if you wish to do so.

double inverf(double x)
{   
    double w,p;
    
    w = -log((1.00-x)*(1.00+x));
    
    if ( w < 5.000000 ) {
        w = w - 2.500000;
        p= 2.81022636e-08;
        p= 3.43273939e-07 + p*w;
        p= -3.5233877e-06 + p*w;
        p = -4.39150654e-06 + p*w;
        p = 0.00021858087 + p*w;
        p = -0.00125372503 + p*w;
        p = -0.00417768164 + p*w;
        p = 0.246640727 + p*w;
        p = 1.50140941 + p*w;
    } else
    {
        w = sqrtf(w) - 3.000000;
        p =  -0.000200214257;
        p = 0.000100950558 + p*w;
        p = 0.00134934322 + p*w;
        p = -0.00367342844 + p*w;
        p = 0.00573950773 + p*w;
        p = -0.0076224613 + p*w;
        p = 0.00943887047 + p*w;
        p = 1.00167406 + p*w;
        p = 2.83297682 + p*w;
    }
    
    double res_ra = p*x; // assign to rational estimate variable
    
    // Halley's method to refine estimate of inverse erf
    double res_hm = 0;
    double fx = (erf(res_ra) - x);
    double df = (2.0/sqrt(M_PI))*exp(-(res_ra * res_ra));
    double d2f = -2 * res_ra * df;
    
    res_hm = res_ra - (2*fx*df)/((2*df*df) - (fx*d2f));
    return res_hm;
}

The above function can be used to estimate the inverse cumulative distribution function (quantile) of the Gaussian distribution (mu is the mean, and sig the standard deviation of the distribution), as below:-

double gauss_icdf(double p, double mu, double sig)
{
    double res;
    res = (inverf((2*p) - 1) *sig*sqrt(2.0)) + mu;
    return res;
}


Monday, 16 September 2013

SciStatCalc is now free

The price of SciStatCalc (what an unwieldy name!) has been reduced by a 100% on the iTunes App Store - bringing advanced statistical functions and numerous statistical tests to the masses for free! 

Friday, 6 September 2013

Correlations

In this post, two types of Correlations will be discussed, the non-parametric correlation called Spearman's Rank Correlation, and the parametric correlation called the Pearson Correlation. 

Spearman's Rank Correlation

Spearman's Rank Correlation $\rho$ is a non-parametric measure of the relationship between two variables, and tests whether these variables have a monotonic relationship.

Suppose two populations, with the same number of samples, have a monotonic relationship. If these two populations were paired together, and the pairs sorted by the magnitude of the first element in ascending order, the second elements would be ordered by magnitude in ascending order too. Conversely if the first element of the pair was used to order the pairs in descending order, the second element of the pair would be ordered in descending order too. This relationship would hold true if we  ordered the second element of the pair, and examined the ordering of the first element.  Thus, each of the pair of samples would be ordered in the same direction.

If two populations are anti-monotonic with respect to each other, then if these two populations were paired together, and the pairs sorted by the magnitude of the first element in ascending order, the second elements would be ordered by magnitude in descending order. Conversely if the first element of the pair was used to order the pairs in descending order, the second element of the pair would be ordered in  ascending order. This relationship would hold true if we ordered the second element of the pair, and examined the ordering of the first element.  Thus, each of the pair of samples would be ordered in the opposite direction.

Spearman's Rank Correlation is bounded in value, between $-1$ and $+1$, for when the variables are related in either an anti-monotonic or monotonic manner with each other respectively, and is given by the following equation:-

$\Large \rho=\frac{\sum_{i=1}^{n}{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}$ (Eq. 1)

where $x_i$ and $y_i$ are the result of converting $n$ (raw) samples $x[i]$ and $y[i]$ to their ranks respectively, and $\bar{x}$ and $\bar{y}$ are the mean of the ranks $x_i$ and $y_i$ respectively.

If there are no duplicate values of the raw samples ($x[i]$ and $y[i]$), i.e. no tied ranks, the above formula (Eq. 1) simplifies to

$\Large \rho=1-\frac{6\sum_{i=1}^{n}d_i^2}{n(n^2-1)}$ (Eq. 2)

where $d_i=x_i-y_i$. A derivation of this follows.

When there are no ties

$\sum_{i=1}^{n}(x_i-\bar{x})^2=\sum_{i=1}^{n}(y_i-\bar{y})^2$ (Eq. 3)

Also, $\bar{x}=\bar{y}$ regardless of whether there are ties or not. 


Expanding the left hand side of (Eq. 3), we have

$\large \sum_{i=1}^{n}(x_i-\bar{x})^2=\sum_{i=1}^{n}(x_i^2-2x_i\bar{x}+\bar{x}^2)$

$\large =\frac{n(n+1)(n+2)}{6}-\frac{n(n+1)^2}{2}+\frac{n(n+1)^2}{4}$

$\large =\frac{n(n^2-1)}{12}$ (Eq. 4)

where we have used the identities 

$\large \sum_{i=1}^{n}x_i^2=1^2+2^2+...+n^2=\frac{n(n+1)(n+2)}{6}$

and 

$\large \sum_{i=1}^{n}x_i=1+2+...+n=\frac{n(n+1)}{2}$

Thus, the denominator of (Eq. 1) is $\large \frac{n(n^2-1)}{12}$.

Examining the numerator of (Eq. 1), we can expand this to 

$\sum_{i=1}^{n}(x_i-\bar{x})^2$ - $\sum_{i=1}^{n}(x_i-\bar{x})^2$ + $\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})$


Using (Eq. 3) , we can further expand the numerator  to 

$\sum_{i=1}^{n}(x_i-\bar{x})^2$ - $0.5\sum_{i=1}^{n}(x_i-\bar{x})^2$ $0.5\sum_{i=1}^{n}(y_i-\bar{y})^2$  + $\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})$

Dividing the expanded numerator by the denominator, (Eq. 1) becomes 

$\Large \rho=1-\frac{\sum_{i=1}^{n}(x_i-\bar{x}-y_i+\bar{y})^2}{2\sum_{i=1}^{n}(x_i-\bar{x})^2}$

Substituting (Eq. 4), we have

$\Large \rho=1-\frac{6\sum_{i=1}^{n}(x_i-y_i)^2}{n(n^2-1)}$

If we denote $d_i=x_i-y_i$, we end up with

$\Large \rho=1-\frac{6\sum_{i=1}^{n}d_i^2}{n(n^2-1)}$

which is the simplified expression for $\rho$ when no ties are present.

As Spearman's Rank Correlation is a non-parametric test, it will be robust to the presence of outliers in the datasets.

Spearman's Rank Correlation - A worked example

Consider the raw sample pairs ($x[i]$,$y[i]$) as below

  1 , 1
  5 , 4
  3 , 2
  3 , 3

Their ranks ($x_i$,$y_i$) are on the right:-
                           
($x[i]$,$y[i]$)    ($x_i$,$y_i$)      ($x_i$-$\bar{x}$)     ($y_i$-$\bar{y}$)           
  1 , 1         1, 1          -1.5         -1.5
  5 , 4         4, 4           1.5           1.5
  3 , 2      2.5, 2              0          -0.5
  3 , 3      2.5, 3              0           0.5

The numerator of $\rho$ is ($-1.5 \times -1.5 + 1.5 \times 1.5 + 0 \times -0.5 + 0 \times 0.5$), which is $4.5$. The denominator of $\rho$ is $\sqrt{(2.25 + 2.25 + 0 + 0) \times (2.25 + 2.25 + 0.25 + 0.25)}$ which is $4.743416$.  Thus $\rho$ is $4.5/4.743416$, which is $0.948683$.

There is a Spearman Rank Correlation Calculator in this blog, which can be found here.

Pearson Correlation

The Pearson Correlation is the parametric equivalent of the Spearman Rank Correlation, and is given by the equation

$\Large r=\frac{\sum_{i=1}^{n}{(x[i]-\bar{x})(y[i]-\bar{y})}}{\sqrt{\sum_{i=1}^{n}(x[i]-\bar{x})^2\sum_{i=1}^{n}(y[i]-\bar{y})^2}}$

where $x[i]$ and $y[i]$ are the raw samples from the two populations that are to be correlated against, and $\bar{x}$ and $\bar{y}$ their respective means.

Like the Spearman Rank Correlation, the Pearson Correlation has a minimum of $-1$ and a maximum of +1. Unlike the Spearman Rank Correlation, it is sensitive to the value of the sample, and will be $+1$ only if the two populations are exactly linearly related with positive gradient, and $-1$ only if the two populations are exactly linearly related with negative gradient.  

It is worth noting that the square of the Pearson Correlation ($r^2$) is the Coefficient of Determination, that was encountered in the Linear Regression post, in this blog. There is a Pearson Correlation Calculator in this blog, which can be found here

Thursday, 5 September 2013

Wilcoxon Signed Rank Test Table


Below is a table of the number of samples against the critical value of the statistic $W$ for the Wilcoxon Signed Rank Test. The alpha level is 5%, for a two-tailed test.  Sample values up to 25 are shown - for more than 25 samples, calculating the z-score and the resulting p-value will yield accurate enough results. 



Number or Samples W(critical)
6 0
7 2
8 4
9 6
10 8
11 11
12 14
13 17
14 21
15 25
16 30
17 35
18 40
19 46
20 52
21 58
22 65
23 73
24 81
25 89

Wednesday, 4 September 2013

Linear Regression

The linear regression model is a fairly simple model, nevertheless it is used in many applications. 

What follows is a fairly mathematical derivation of the linear regression parameters.

Let us denote the observations as $y[n]$ and $x[n]$ for $n=0,1,..,N-1$ for the model 

$y[n] = ax[n]+b$

We assume that we can measure both $x[n]$ and $y[n]$ - the unknowns are $a$ and $b$, the gradient and intercept respectively. We can use the least squares error criterion to find estimates for $a$ and $b$, from data $x[n]$ and $y[n]$.

Let  $e[n] = y[n] - ax[n]-b$ be the error. We can represent the samples $e[n]$, $x[n]$ and $y[n]$ as length $N$ column vectors $\underline{e}$,$\underline{x}$ and $\underline{y}$ respectively.

Thus the error sum of squares (which can be regarded as a cost function $C$ to minimise over) will be

$\underline{e}^T\underline{e}=(\underline{y}-a\underline{x}-b\underline{1})^T(\underline{y}-a\underline{x}-b\underline{1}$).

where $\underline{1}$ is a column vector with all elements set to 1.

Expanding this cost function results in:-

$C=\underline{y}^T\underline{y}-2a\underline{x}^T\underline{y}-2b\underline{y}^T\underline{1}+a^2(\underline{x}^T\underline{x})+2ab(\underline{x}^T\underline{1})+b^2(\underline{1}^T\underline{1})$

Differentiating $C$ w.r.t. to $a$ and $b$:-

$\frac{\partial C}{\partial a}=2a(\underline{x}^T\underline{x})+2b(\underline{x}^T\underline{1})-2\underline{x}^T\underline{y}$

$\frac{\partial C}{\partial b}=2a(\underline{x}^T\underline{1})+2b(\underline{1}^T\underline{1})-2\underline{y}^T\underline{1}$

In order to estimate $\hat{a}$ and $\hat{b}$ (the least squares estimates of $a$ and $a$ respectively) we need to set the above two partial differential equations to 0, and solve for $a$ and $b$. We need to solve the following matrix equation

$\left[\begin{array}{cc}\underline{x}^T\underline{x}&\underline{x}^T\underline{1}\\ \underline{x}^T\underline{1}&\underline{1}^T\underline{1}\end{array}\right]$$\left[\begin{array}{c}\hat{a}\\ \hat{b}\end{array}\right]$ = $\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$

Inverting the matrix and pre multiplying both sides of the above matrix equation by the inverse results in:

$\left[\begin{array}{c}\hat{a} \\ \hat{b} \end{array}\right]$ = $\frac{1}{N(\underline{x}^T\underline{x})-(\underline{x}^T\underline{1})^2}$$\left[\begin{array}{cc}\underline{1}^T\underline{1}&-\underline{x}^T\underline{1}\\ -\underline{x}^T\underline{1}&\underline{x}^T\underline{x}\end{array}\right]$$\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$

If we revert to a more explicit notation in terms of the samples, we have

$\hat{a}=\frac{1}{N(\sum_{n=0}^{N-1}(x[n])^2)-(\sum_{n=0}^{N-1}x[n])^2}[N(\sum_{n=0}^{N-1}x[n]y[n])-(\sum_{n=0}^{N-1}y[n])(\sum_{n=0}^{N-1}x[n])]$

$\hat{b}=\frac{1}{N(\sum_{n=0}^{N-1}(x[n])^2)-(\sum_{n=0}^{N-1}x[n])^2}[(\sum_{n=0}^{N-1}(x[n])^2)(\sum_{n=0}^{N-1}y[n])-(\sum_{n=0}^{N-1}x[n])(\sum_{n=0}^{N-1}x[n]y[n])]$

where the denominator of the fraction in the above two equations is the matrix determinant.

Coefficient of Determination

Once we have calculated $\hat{a}$ and $\hat{b}$, we can see how well the linear model accounts for the data. This is achieved by finding the Coefficient of Determination ($r^2$). This is the ratio of the residual sum of squares ($rss$) to the total sum of squares ($tss$),

$\Large r^2=\frac{rss}{tss}$

and has a maximum value of 1, which is attained when the model is exactly linear. The larger $r^2$ is, the better the linear model fits the data. We can think of $r^2$ as the proportion of variation in data explained by the linear model, or the "goodness of fit" of the linear model.

The total sum of squares is given by

$tss=\sum_{n=0}^{N-1}(y[n]-\bar{y})^2$

where

$\bar{y}=\frac{1}{N}\sum_{n=0}^{N-1}y[n]$ is the mean of $y$.

The residual sum of squares ($rss$) is the total sum of squares ($tss$) minus the sum of squared errors ($sse$)

where

$sse=\sum_{n=0}^{N-1}(y[n]-\hat{a}x[n]-\hat{b})^2$

If the data perfectly fits the linear model, $sse$ will equal 0, so that $tss$ will equal $rss$ - the coefficient of determination $r^2$ will be $1$.

Summarising, we have

$\Large r^2=\frac{\sum_{n=0}^{N-1}(y[n]-\bar{y})^2-\sum_{n=0}^{N-1}(y[n]-\hat{a}x[n]-\hat{b})^2}{\sum_{n=0}^{N-1}(y[n]-\bar{y})^2}$

A worked example

Now, let us consider the following example:-

$x=[1,2,3,4]$, $y=[1.5,6,7,8]$

The matrix determinant will be $(4\times (1 + 2^2+3^2+4^2))-(1+2+3+4)^2=20$.

The least squares estimate of $a$ is

$\hat{a}=[4\times((1\times 1.5) + (2\times 6) + (3\times 7) + (4\times 8)) \\ -( (1.5+6+7+8)\times(1+2+3+4))]/20 \\ =2.05$

The least squares estimate of $b$ is

$\hat{b}=[(1+2^2+3^2+4^2)\times(1.5+6+7+8)-((1+2+3+4)\times((1\times 1.5) \\+ (2\times 6) + (3\times 7) +(4\times 8)))]/20 \\ = (675 - 665)/20 \\ =0.5$

The $tss$ is

$tss=(1.5-5.625)^2+(6-5.625)^2+(7-5.625)^2+(8-5.625)^2=24.688$

where the average of $y$ is $5.625$.

The $sse$ is

$(1.5 - (2.05\times 1) - 0.5)^2+(6 - (2.05\times 2) - 0.5)^2\\+(7 - (2.05\times 3) - 0.5)^2+(8 - (2.05\times 4) - 0.5)^2=3.675$

So that the $rss$ is $24.688-3.675=21.013$.

The coefficient of determination is

$r^2=21.013/24.688=0.8511$

This result is quite high, and is fairly indicative of a linear model.

There is a Linear Regression Calculator in this blog, which can be found here

Tuesday, 3 September 2013

Wilcoxon Signed Rank Test

The Wilcoxon Signed Rank test is a non-parametric test for paired/matched samples, and can be regarded as the non-parametric counterpart to the Paired Student's  t-test.  A worked example follows.

Suppose we have two groups of matched samples:

Group 1: 1,2,4,5,6,7,8,3

Group 2: 2,5,1,3,9,10,3,3

We first find the difference between these matched samples (Group 1 sample - Group 2 sample)

Group 1     Group 2      Difference
1                  2                     -1    
2                  5                     -3
4                  1                    +3
5                  3                    +2
6                  9                     -3
7                10                     -3
8                  3                      5
3                  3                      0

We discard the case where there is 0 difference (the last pair), so that the sample size $N_r$ is $7$. Then, we sort the difference by absolute value and rank them, and highlight where there are ties (i.e. the same absolute values of the difference).

Rank      Difference
  1               -1
  2              +2
  3               -3
  4               -3
  5               -3
  6              +3
  7              +5

The ranks for the ties are averaged out, resulting in

Rank      Difference
  1               -1
  2              +2
  4.5            -3
  4.5            -3
  4.5            -3
  4.5           +3
  7              +5

Next we sum the ranks for the negative difference (which is $14.5$), and the sum of the positive difference can be found by subtracting 14.5 from the total rank ($7\times 8 \times 0.5=28$), which is $13.5$.

The value of $W$ (the Wilcoxon Signed-Rank statistic) is the lower of the (positive/negative) sum of the ranks, hence $13.5$.

We need to find the critical value of $W$, $Wcrit$, as there are only 7 samples, so that a z-score will be highly inaccurate. We do this by referring to a table for a two-tailed test at 5% level, and obtain $Wcrit=2$. As $W$ is greater than $Wcrit$, the result is not significant.

Even though the z-score will be inaccurate (on SciStatCalc, if the number of samples exceeds 25, the z-score is considered accurate enough, and the $Wcrit$ value is no longer used, to limit the size of the table storing $Wcrit$), it is nevertheless worth evaluating this as an exercise. The z-score is given by equation
$\Large \frac{W-\mu}{\sigma}$

where $\mu$ is the mean and $\sigma$ is the standard deviation.

The $\mu$ value is $\frac{Nr\times (N_r+1)}{4}$, and so will be $7\times 8\times 0.25 = 14$ for our example.

The $\sigma$ is given by $\sqrt{\frac{N_r(N_r+1)(2N_r+1)}{24}}$, and will be $\sqrt{7\times 8 \times 15 \times (1/24)}=5.9169$.

The z-score is thus $(13.5 - 14)/5.9169=-0.084515$, resulting in a p-value of 0.93, which is considerably larger than  the critical 0.05 value. The result is not significant by quite some margin.

Below is a screenshot of SciStatCalc with the results of the Wilcoxon signed rank test.

Wilcoxon Signed Rank test ios iphone app SciStatCalc


  

Monday, 2 September 2013

Mann-Whitney U test

The Mann-Whitney U (MWU) test is a non-parametric statistical test that is widely used in many fields (e.g in medical statistics and psychology), and tests whether two groups (a.k.a datasets or populations) are the same (the null hypothesis) against the alternative hypothesis that they are not so.

The following assumptions need to be made:

  • The independent variable is the two groups. 
  • The dependent variable are the observations which are either continuous or ordinal (i.e. they can be sorted by some criterion, such as a level).
  • All observations from both groups are independent of one another. For example, if patients' blood pressure is being measured, each observation should come from a distinct patient. 
  • The data need not be normally distributed.

The MWU test is more robust than the Unpaired Student's t-test for data that is not normally distributed. We can think of the Unpaired Student's t-test as the parametric counterpart to the MWU test.

The basic idea is that that the data for each group is ranked (from 1,2,..) in ascending order of magnitude, regardless of which group the data belongs to. If the populations/groups are very different, say Group 2 has samples much larger than that of population 1, then the sum of the ranks for Group 2 will be much larger than that for Group 1. A U statistic is calculated, which can be intuitively thought of as the difference between the total rank and the larger of the group rank sum. The smaller this difference, the greater the disparity in rank sums between the two groups, and the less likely it is that the population rank differences are due to chance.

Consider Group 1 with $N1$ samples, and Group 2 with $N2$ samples.

Let $T1$ be the sum of ranks for Group 1, and $T2$ be the sum of ranks for Group 2.

We denote $TX$ as the larger of $T1$ and $T2$, and $NX$ as the number of samples of the group with the larger sum of ranks.

The U-statistic is given by:-

$\large U=(N1 \times N2) + \frac{NX(NX+1)}{2} - TX$

The value of $U$ is compared against a critical value $Ucrit$, and if $U$ is less than the critical value, the result is considered significant (note that many tests have a statistic that has to exceed a critical value for the result to be considered significant - the MWU test is almost the odd one out!).

The value of $Ucrit$ is found out using a U-test table, and is determined by the values of $N1$ and $N2$ and the confidence level (say 5%), and whether we want a two-tailed test or a one-tailed one.

If the values of $N1$ and $N2$ are high, we can use the z-test, as the U statistic can be reasonably well approximated by a Normal/Gaussian distribution  with mean $\frac{N1N2}{2}$ and variance $\frac{N1N2(N1+N2+1)}{12}$.

If there are tied ranks, the variance needs to be slightly modified to

$\Large \frac{N1N2[(N1+N2+1)-\frac{\sum_{i=1}^{L}(t_i^3-t_i)}{(N1+N2)(N1+N2-1)}]}{12}$,

where $L$ is the number of groups of ties, and $t_i$ the number of ties for the $ith$ group.


Worked Example

Suppose we have two groups, with the following samples :

Group 1: 1,4,6,7,8,3,2,1

Group 2: 3,3,3,8,10,16,18,70,30

So $N1=8$, $N2=9$.

We first rank all the data as follows (ignore ties for the moment- they are highlighted in blue):-

Rank        Data            Group
 1               1                  1
 2               1                  1
 3               2                  1
 4               3                  2
 5               3                  2
 6               3                  2
 7               3                  1
 8               4                  1
 9               6                  1
10              7                  1
11              8                  1
12              8                  2
13            10                  2
14            16                  2
15            18                  2
16            30                  2
17            70                  2

We deal with the ties by calculating the average of the ranks for the ties:-

Rank        Data            Group

 1.5            1                  1
 1.5            1                  1
 3               2                  1
 5.5            3                  2
 5.5            3                  2
 5.5            3                  2
 5.5            3                  1
 8               4                  1
 9               6                  1
10              7                  1
11.5           8                  1
11.5           8                  2
13            10                  2
14            16                  2
15            18                  2
16            30                  2
17            70                  2

Lets colour code by Group membership:-

Rank        Data            Group

 1.5            1                  1
 1.5            1                  1
 3               2                  1
 5.5            3                  2
 5.5            3                  2
 5.5            3                  2
 5.5            3                  1
 8               4                  1
 9               6                  1
10              7                  1
11.5           8                  1
11.5           8                  2
13            10                  2
14            16                  2
15            18                  2
16            30                  2
17            70                  2

The sum of ranks for Group 1 is:-
1.5 + 1.5 + 3 + 5.5 + 8 + 9 + 10 + 11.5 = 50

We can easily calculate the sum of ranks for Group 2. As we have 17 samples ($N1+N2$), the total of the ranks for the entire data is $\frac{(N1 + N2)(N1+N2+1)}{2}$, which is $17\times 18 \times 0.5$, i.e. 153.

So the Group 2 sum of ranks is $153-50=103$, which is greater than that of Group 1.

Thus $NX=9$, and $TX=103$, i.e. the number of samples and sum of ranks for the larger sum of ranks  Group (i.e. Group 2).

We are now in a position to calculate U as follows:-

$U=(9 \times 8) + \frac{9(9+1)}{2} - 103 = 14$

Suppose we are interested in an alpha of 5%, for a two-tail test - going through a table (say one at http://www.lesn.appstate.edu/olson/stat_directory/Statistical%20procedures/Mann_Whitney%20U%20Test/Mann-Whitney%20Table.pdf), for  $N1=8$ and $N2=9$, we find that $Ucrit=15$.

As $U<Ucrit$ we reject the null hypothesis that the groups are the same - the result is significant, but only just!

Next we come to the z-score calculation - which will be inaccurate as we only have 8 and 9 samples for Group 1 and 2 respectively.

We need to transform $U$ to a z-score, which will be a standard normal distribution with zero mean and unity variance.

For SciStatCalc, the following technique is used (which does not account for the ties for a more conservative result - also the results from GNU Octave were used to baseline against). We take the sum of ranks for Group 1 ($50$), and subtract $N1\times (N1+N2+1)\times(0.5) =8(18)(0.5)=72$, resulting in $-22$. We need to normalise this by the standard deviation, which is $\sqrt{\frac{N1\times N2\times (N1+N2+1)}{12}}=\sqrt{\frac{9\times 8\times (18)}{{12}}}=10.3923$.

So, the z-score is $-22/10.3923$, which is $-2.11695$. The p-value for this z-score is $0.034264$, which is less than our critical value of $0.05$, vindicating that our result is indeed marginally significant.

If the correction to the variance based on ties were to be applied, the standard deviation would have been $\sqrt{\frac{9\times 8\times (18 - (72/(16 \times 17)))}{{12}}}=10.3156$, because there are three groups of ties with 2, 4 and 2 ties, leading to $(4^3) - 4 + (2^3) -2 + (2^3) - 2 = 72$, so that the correction term is $72/((9 + 8)\times(9+8-1))$. The z-score would then be $-22/10.31526=-2.1327$, corresponding to a p-value of $0.03295$, which is in agreement with the results obtained from using R. Nevertheless, the p-value is still less than our 5% level, so the result is still significant.

Below is a screenshot of the Mann-Whitney U-test on SciStatCalc:-



There is a Mann Whitney U-test Calculator in this blog, which can be found here.

Calculating the p-value for a z-score

To calculate the p-value for a z-score, you need to take the absolute value of the z-score, and calculate the area under a standard Gaussian/Normal distribution (mean 0, variance 1) from $abs(zscore)$ to infinity - i.e. calculate the area under the tail. Given the symmetric nature of the standard Normal distribution, you would end up with the same result from -infinity to $-abs(zscore)$. For a two-tailed result you need to multiply this result by 2.

A final note - the MWU test is also known as the Wilcoxon rank sum test.