Saturday, 2 November 2013

Home

This blog is dedicated predominantly to various aspects of probability and statistics, with some aspects of digital signal processing and machine learning. This blog reflects my interests and specialties outside of my work as an FPGA Engineer.

The blog arose from my need to create a support website for an iOS App I developed back in 2013, hence the name "SciStatCalc", which is the name of my App. I had opted for the path of least resistance, and judged that using a blogging site would be the easiest route to setting up my support website. Whilst my activity on the App has waned (there is a drop-down menu "SciStatCalc" that gives a timeline for the evolution of the App), this blog has taken a life of its own and hopefully will remain a continual work in progress.

The blog includes

  1. Statistical hypothesis testing, with online implementations of many commonly used tests
  2. The calculation of the Cumulative Density Function (CDF) and Quantiles (the inverse CDF) of various probability density functions.
  3. Data visualisation tools, including graph, histogram and CDF plotting, with zoom facility and multiple data series plot capability.
  4. Fast Fourier Transform (FFT) calculator.
  5. k-means Clustering calculator.
You can navigate through this blog by a variety of means, including using the above drop-down menus.

The FFT is used as the basis of my online Power Spectral Density Estimator, which can be found on another Engineering-centric blog I have created dspfpgatools.blogspot.co.uk.

Statistics Theory

There are a few blog posts, including What is hypothesis testing?, and tables of Density Equations and CDFs for a variety of distribution functions in here.

Statistical Test Procedures

So far the blog includes descriptions and step-by-step guides to the following statistical tests:-

  1. Shapiro-Wilk test for Normality
  2. Bartlett's test for equality/homogeneity of variances for three or more groups.
  3. Mann-Whitney U test: a non-parametric test for two independent groups of data.
  4. Wilcoxon Signed Rank test: a non-parametric test for two matched groups of data.
  5. Unpaired Student's t test: a parametric test for two independent groups of data.
  6. Paired Student's t test: a parametric test for two matched groups of data.
  7. Linear Regression test: includes a matrix-based derivation of the Ordinary Least Squares algorithm - as applied to a single dataset.
  8. Pearson (Product moment) Correlation: a parametric correlation test for two matched datasets/groups
  9. Spearman Rank Correlation: a non-parametric rank based correlation test for two matched datasets/groups

Online Calculators

There are javascript based calculators that can be grouped into six categories :-

  1. CDF and Quantile Calculators for a variety of Probability Density Functions (PDF) and Probability Mass functions (PMF).
  2. Statistical test calculators
  3. Critical value calculators
  4. Medical Diagnostic calculator
  5. Digital Signal Processing (DSP) calculator
  6. Machine learning calculators

CDF and Quantile Calculators

For the PDFs and PMFs, you need to fill in all the relevant parameter fields. For the PDFs you must fill in any two out of the following three fields: Lower Limit,Upper Limit and Probability - pressing the calculate button will result in the single missing field being filled in. As for the PMFs, you must fill in one of the following two fields:Upper Limit and Probability - the missing field will be updated.

Online CDF and Quantile Calculators for the following PDFs have been implemented:-

  1. Gaussian Distribution: includes error and inverse error function calculator near the top of the blog post.
  2. Log-normal Distribution
  3. Gamma Distribution: includes evaluation of the Gammma function ($\Gamma(x)$) near the bottom of the blog post.
  4. Student's t-Distribution
  5. Beta Distribution
  6. F Distribution (also known as Snedecor's F)
  7. Chi-Squared Distribution
  8. Exponential Distribution
  9. Logistic Distribution
  10. Laplace Distribution
  11. Cauchy Distribution (also known as the Cauchy-Lorentz Distribution)
  12. Rayleigh Distribution
  13. Weibull Distribution

For the Gaussian/Normal, Student-t, F and Chi-squared distributions, (1 - probability) and 2$\times$(1-probability) are calculated as well - this is useful for calculating the one and two tail probabilities associated with various Statistical Tests. The Gaussian Distribution is used for calculating the p-value from the z-score, whilst the Student-t distribution is used for the (parametric) Student's t-test. The F distribution is used for many tests, ANOVA being one of the most widely known test.

Online CDF and Quantile Calculators for the following PMFs have been implemented:-

  1. Hypergeometric Distribution
  2. Binomial Distribution
  3. Pascal Distribution (also known as the Negative Binomial Distribution)
  4. Poisson Distribution

All the CDF and Quantile Calculators have plots of the PDF/PMF encompassing the limits specified by the user. The upshot of this is that you can investigate the effect of varying various parameters of a particular distribution on the shape of that distribution, whilst keeping the limits the same.

Statistical Tests Calculators

Calculators for the following Statistical Tests have been implemented:-

  1. Shapiro-Wilk Test
  2. Levene's Test
  3. Bartlett's Test
  4. Two-Sample Kolmogorov-Smirnov Test
  5. Chi-Squared Test for Independence
  6. Linear Regression
  7. Logistic Regression
  8. Pearson Correlation
  9. Spearman Rank Correlation
  10. Mann-Whitney U Test
  11. Wilcoxon Signed Rank Test
  12. Unpaired Student's t Test: includes option of implementing Welch's test for unequal variances.
  13. Paired Student's t Test
  14. Fisher's Exact Test (2 $\times$ 2 contingency table)
  15. Barnard's Test (2 $\times$ 2 contingency table)
  16. McNemar's Test (2 $\times$ 2 contingency table)
  17. Cochran's Q Test
  18. Kruskal-Wallis Test: applicable to three or more groups
  19. One-way ANOVA Test: applicable to three or more groups - also includes post-hoc analysis for a significant result
  20. Two-way ANOVA Test with replication: applicable to three or more groups, examining the effect of two independent variables and the interaction between them
  21. Two-way ANOVA Test without replication: applicable to three or more groups, examining the effect of two independent variables

The Statistical Calculators have been designed for ease of use, with the aim of yielding useful results with minimal effort on the part of the user. All the calculators take in raw data as inputs, which can be entered directly in the relevant textareas as comma separated numbers.

Alternatively you can load in a CSV file by pressing the "Choose File" button - the calculators can parse out a comment line (starting with character #, for example), if this occurs as the first line in the CSV file. In addition, there are either histograms or scatter plots for many of the tests. The purpose of these forms of data visualisation is two-fold: (i) to yield useful information not present in, say, the p-value of a test, (ii) to act as a sanity check on the p-value calculated.

For tests that require three or more datasets (such as the ANOVA tests, and the Kruskal Wallis tests for example), the method of dynamic textboxes is implemented, where clicking on a link adds an extra text entry field. This gives a lot of flexibility in terms of the number of datasets the user wishes to process.

Critical Value Calculators

So far three critical value calculators have been implemented, whereby the relevant values are calculated based on the user specified significance level.

  1. Critical z-score Calculator
  2. Critical t value Calculator
  3. Critical F value Calculator
Medical Diagnostic Calculator

So far a single calculator has been implemented.

  1. Evidence Based Medicine (EBM) Systematic Review Calculator
DSP Calculator

So far a single calculator has been implemented.

  1. FFT Calculator
Machine learning Calculators

So far two calculators implementing k-means clustering, and a Cosine similarity calculator have been implemented.

  1. k-means Clustering Calculator
  2. k-means Clustering for arbitrary size vectors
  3. Cosine similarity calculator.
Data Visualisation

In addition, there are Data Visualisation tools (all with Zoom capabilities, and capable of representing multiple datasets) implementing the following functions:-

  1. Online Graph plotter: You can select between line and scatter plots for representing multiple datasets. In addition you can select subsets of the datasets to plot by entering dataset/group indices in two fields of a table labelled X-axis group index and Y-axis group index. Using the X-axis field and Y-axis field, it is possible to plot 2-D data as a scatter plot - this could be useful for a basic visual cluster analysis.
  2. Online Histogram plotter
  3. Online Empirical Cumulative Density Function (CDF) plotter
  4. Online Quantile-Quantile (Q-Q) plotter for the Gaussian Distribution
For each dataset/group, the Histogram and CDF plotters display a summary statistics table containing the following information:-
  1. Number of samples
  2. Minimum
  3. Maximum
  4. Mean
  5. Geometric Mean (if all samples are greater than zero)
  6. Harmonic Mean (if all samples are greater than zero)
  7. Variance
  8. Standard Deviation
  9. Median
  10. Skewness
  11. Excess Kurtosis

If you are a researcher in the medical field, a data scientist or statistician , or work in the social sciences area, such as Psychology, I hope that you find some of the entries in this blog useful and interesting. If you are simply curious about probability and statistics, you are more than welcome.

All the online calculators are free to use, and the javascript source code is clearly accessible for the curious. Some testing has been performed on most of the CDF/Quantile Calculators, benchmarking against results generated by GNU Octave (I have endeavoured to achieve double precision accuracy). As regards the Statistical Test Calculators, I have checked my results against other online Calculators available on the web where possible, and against R. However, I will assume no responsibility for the accuracy of the results - use the calculators at your own risk.

Logistic Regression Calculator and ROC Curve Plotter

This blog post implements a Logistic Regression calculator for a binary output. Consider a binary outcome response variable \(Y\...