Before starting to work with a data set, you have to figure out what kind of data it is. Goodness of fit tests report if your data is normal or matches with expected data. Then you are able to pick the best statistical test to run on your data. Different data types have different goodness of fit tests and each test is used to draw different conclusions.
[1] Understand what a goodness of fit test is and how to apply it.
[2] Be able identify which test is most appropriate for your data set.
# G test
library(DescTools)
library(curl)
## Using libcurl 8.1.2 with LibreSSL/2.8.3
# Anderson Darling test
library(nortest)
Like the Chi-squared test in Module 14, the G test allows you to determine if there is a significant difference in proportions of a categorical variable to the theoretical expectation. Chi-squared test give approximately the same results as G test but you can perform more elaborate statistical analyses with G test results. Downside is that less people are familiar with the G test compared to the Chi-squared test. As they give similar results, it’s best to pick one method that is most applicable to your data set rather than using both methods. Pick one and stick with it.
Both G test and Chi-squared Test data needs to fulfill certain assumptions:
It’s more beneficial to use the G Test over the Chi-Squared Test if:
Null Hypothesis: The observed data has the same
proportions as the expected theoretical population.
Alternate Hypothesis: The observed data doesn’t have
the same proportions as the expected theoretical population.
[1] Understand when it is appropriate to use the G test.
[2] Be able to run the G test by hand and through an R package.
O = Observed
E = Expected
I was out in the field and collected data about 1200 Irises. I noted the color, size, species, and location of each flower. From a glance I see four flower colors that are roughly equally but I want to test statistically there actually is proportionally equal amounts of each flower color.
library(curl)
f <- curl("https://raw.githubusercontent.com/YYangEmily/Chisquare-repo/main/Iris.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
## X Flower.color Flower.size Flower.species Flower.location
## 1 1 purple Small Japanese West
## 2 2 purple Large Japanese South
## 3 3 white Large Siberian West
## 4 4 blue Large Siberian East
## 5 5 purple Large Japanese South
## 6 6 blue Large Bearded West
table(d$Flower.color) # Count how many of each color
##
## blue orange purple white
## 121 59 628 392
# Find the G value
G <- 2 * ((121 * log(121/300)) + (59 * log(59/300)) + (628 * log(628/300)) + (392 * log(392/300))) #log function in R is equivalent to ln
G # Use G value to find p-value
## [1] 725.9525
pchisq(725.9525, df=3, lower.tail = FALSE) # Finding the p-value
## [1] 4.947761e-157
We got an G value of 725.95 and the degree of freedom is options-1 = 3. Using those we calculated a p-value of 4.9e-157 which is below the critical p-value = 0.05. We can reject the null hypothesis and conclude that the flower color proportions of the Irises observed are not equally distributed.
library(DescTools)
#perform the G-test
GTest(x = c(121, 59, 628, 392), #observed values
p = c(1/4, 1/4, 1/4, 1/4)) #expected proportions
##
## Log likelihood ratio (G-test) goodness of fit test
##
## data: c(121, 59, 628, 392)
## G = 725.95, X-squared df = 3, p-value < 2.2e-16
Once again, the p-value is below 0.05. We can reject the null hypothesis and conclude that the flower color proportions of the Irises I observed are not equally distributed.
I want to see if the data fits my expectation that there are equal proportions of large and small flowers.
table(d$Flower.size) # Count how many of each size
##
## Large Small
## 577 623
G <- 2 * ((577 * log(577/600)) + (623 * log(623/600)))
G
## [1] 1.763765
pchisq(1.763765, df=1, lower.tail = FALSE) # Finding the p-value
## [1] 0.1841556
GTest(x = c(577, 623), # observed values
p = c(1/2, 1/2), # expected proportions
correct = "none") # no correction
##
## Log likelihood ratio (G-test) goodness of fit test
##
## data: c(577, 623)
## G = 1.7638, X-squared df = 1, p-value = 0.1842
As you can see this time the p-value = 0.1842 which is above the critical p-value of 0.05 and we fail to reject the null hypothesis. There are equal proportions of each flower size.
Perform the G test with following columns in the iris data set:
Flower.location and Flower.species.
In addition to goodness of fit test you can also use the G test for Independence. For example I want to test if flower color and flower size are independent variables.
Null Hypothesis: The two test variables are
independent.
Alternate Hypothesis: The two
variables are dependent.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
d$Flower.color <- as.factor(d$Flower.color)
d$Flower.size <- as.factor(d$Flower.size)
d %>%
count(Flower.color, Flower.size)
## Flower.color Flower.size n
## 1 blue Large 57
## 2 blue Small 64
## 3 orange Large 37
## 4 orange Small 22
## 5 purple Large 301
## 6 purple Small 327
## 7 white Large 182
## 8 white Small 210
obs <- rbind(c(64, 22, 327, 210), # Small
c(57, 37, 301, 182)) # Large
GTest(obs, # Compare small vs large
correct = "none") # no correction
##
## Log likelihood ratio (G-test) test of independence without correction
##
## data: obs
## G = 5.5756, X-squared df = 3, p-value = 0.1342
We find a p-value of 0.1342 which is above the critical p-value of 0.05 and fail to reject the null hypothesis. We conclude that the two variables are independent.
We can use the Kolmorgorov-Smirnov Test as a way to reject null hypotheses. By using this statistical method, we are aiming to prove that the two data sets we are working with do not come from the same distribution.
The most common way to reject the null is with a student’s t-test, however, this can only be accurately used when we know our data set is normal. If we are using data that we aren’t sure about, or know isn’t of a normal distribution, it can be trickier to reject this hypothesis. This test is more effective with larger data sets (n ≥ 50) when compared to the Shapiro-Wilks test which is only effective for smaller data sets.
If we are running a one-sample K-S Test:
If we are running a two-sample K-S Test:
Why is it important to disprove the null? Failing to reject a null hypothesis means there is no sufficient evidence for the expected or the observed results we see in our studies.
[1] Understand what a Kolmogorov-Smirnov Test is and how to interpret the results.
[2] Learn what conditions are ideal for a K-S Test.
F0(x) = the total observed frequency distribution of a random sample
F0(x) = k⁄n where k = the number of observations and n is the total number of predicted observations
Fr(x) = the theoretical frequency distribution
D = the critical value between 0 and 1 indicating the magnitude of the observed difference in distributions. Values closer to 1 indicate high difference, while values closer to 0 indicate high likeness. Unlike a p-value which has a set threshold for whether something is significant or not–although this is also not necessarily the be-all-end-all of significance!–the D value produced from the K-S test is relative to each individual distribution. Different data sets will produce different D values!
If the calculated value is less than the critical value D we must accept the null hypothesis
If the calculated value is greater than the critical value we can reject the null hypothesis
Let’s do an example!
We are zoologists studying turtles and we want to conduct an experiment to see if different species of turtle prefer to eat cucumbers or lettuce when presented with both. We will be including 5 different species and counting how many turtles go for each food choice first in each population (n = 150).
# create matrix with 5 columns and 2 rows
data <- matrix(c(16, 12, 1, 22, 11, 14, 18, 29, 8, 19), nrow = 2, ncol = 5, byrow = TRUE)
# specify the column names and row names of matrix
colnames(data) <- c('A.marmorata ','C.fimbriata ','C.picta ','C.flavomarginata ', 'C.amboinensis')
rownames(data) <- c('Cucumber','Lettuce')
# assign to table
final = as.table(data)
# display
final
## A.marmorata C.fimbriata C.picta C.flavomarginata C.amboinensis
## Cucumber 16 12 1 22 11
## Lettuce 14 18 29 8 19
H0 = There is no difference among turtle species with respect to their choice of cucumber over lettuce.
H1 = There is a difference among turtle species with respect to their choice of cucumber over lettuce
Let’s say we predict 18 turtles from each species choose cucumber and make a new table with our prediction.
#appending our original table
final_new <- final
#assigning a new row name and filling in the data using the list function
final_new <- rbind(final_new, 'Predictions' = list(18, 18, 18, 18, 18))
#ready to print!
final_new
## A.marmorata C.fimbriata C.picta C.flavomarginata C.amboinensis
## Cucumber 16 12 1 22 11
## Lettuce 14 18 29 8 19
## Predictions 18 18 18 18 18
calculations <- matrix(c(16, 18, '16/150', '18/150', '2/150', 12, 18, '28/150', '36/150', '8/150', 1, 18, '29/150', '54/150', '25/150', 22, 18, '51/150', '72/150', '21/150', 11, 18, '62/150', '90/150', '28/150'), nrow = 5, ncol = 5, byrow = TRUE)
colnames(calculations) <- c('Observed ','Predictions ','F0(X)', 'Fr(X)', '|F0(X) -Fr(X)|')
rownames(calculations) <- c('A.marmorata ','C.fimbriata ','C.picta ','C.flavomarginata ', 'C.amboinensis')
table = as.table(calculations)
table
## Observed Predictions F0(X) Fr(X) |F0(X) -Fr(X)|
## A.marmorata 16 18 16/150 18/150 2/150
## C.fimbriata 12 18 28/150 36/150 8/150
## C.picta 1 18 29/150 54/150 25/150
## C.flavomarginata 22 18 51/150 72/150 21/150
## C.amboinensis 11 18 62/150 90/150 28/150
Now it’s time to analyze! To find our critical D value we need to look at our table and find the maximum difference between F0(x) and Fr(x). We can see that our maximum value is that of C. amboinensis at 28/150. Our critical D value that we will use to compare our values to is 0.1867. Now we can look at our table and choose a value of alpha to work with. Say we are 5% certain we are going to make a Type 1 error, i.e. detecting a difference under the assumption of the null, we can use our table to choose our critical value.
#denote the total number of turtles
n <- 150
#grab our formula
D <- 1.36/sqrt(n)
#calculate!
D
## [1] 0.1110435
Our critical value for our data set at an alpha level of 5% is 0.11, which is smaller than our critical value of 0.1867. Because our value is greater than the critical value, we can reject the null and conclude that there is difference among turtle species with their respect to choosing cucumber over lettuce for their snack of choice!
Assuming an alpha level of 0.01, what is the critical D value for our sample of turtles? Can we still reject the null?
#denote the total number of turtles
n <- 150
#grab our formula
D <- 1.63/sqrt(n)
#calculate!
D
## [1] 0.1330889
YES! 0.133 ≤ 0.1867, therefore we can still reject the null.
Now let’s learn the easy way! Our KS Test is in base R so we can just go ahead and use it as is.
First, let’s make two data sets, one random, and one normal.
#make this example reproducible
set.seed(0)
#generate two data sets. We are using rpois() to ensure a non-normal distribution, and rnorm() to ensure a normalized one.
data1 <- rpois(n=50, lambda=5)
data2 <- rnorm(100)
#using our pre-made function, we are going to fill both arguments with the data sets to ensure a comparison
ks.test(data1, data2)
##
## Exact two-sample Kolmogorov-Smirnov test
##
## data: data1 and data2
## D = 0.95, p-value = 1.481e-13
## alternative hypothesis: two-sided
To visually represent why D is interpreted the way it is, let’s take a look at this graph.
Here we can see that D is showing us the maximum distance between 2 graphs. If the two data sets come from the exact same distribution, our D value (i.e. distance between the graphs) would be small (theoretically 0). If we have data coming from two very distributions our maximum distance between our graphs is large (maximum of 1).
Objectives
Learn about the Anderson Darling Equation and understand how it can be used in R.
Go through the Anderson Darling Test step-by-step and how to interpret the results.
Use the ad.test() built in function with Poisson functions.
The Anderson Darling test is a type of the goodness of fit test used to determine if a sample of data comes follows a specific kind of distribution, more commonly normal distribution. It is a modification of the Kolmogorov-Smirnov test that focuses more on whether or not the tails follow normal distribution, rather than the peak.
It is important to note that the length of the sample data must be greater than 7, but is advised to be greater than 20. The smaller the sample of data, the more likely it is to get false results.
Hypotheses
The equation below gives you the test statistic for your specified distribution.
Once the test statistic is calculated, you can use it to determine whether we can reject the null hypothesis or fail to reject the null. This is done by comparing it to a critical value that is determined by the kind of distribution you are testing for. In this case, we will be testing for normality, so the critical value is 0.787.
The critical value is published in the following article: Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons, Journal of the American Statistical Association, 69, pp. 730-737.
If the test statistic is less then 0.787, we can fail to reject the null hypothesis, meaning that the data follows normal distribution. If the critical value is greater than 0.787, the null hypothesis is rejected and the sample of data does not follow normal distribution.
However, the smaller the sample of data, the more likely you are to get false readings of normality. Therefore, it is best to use the p-value as your indicator in most scenarios or to confirm your results of the test statistic.
You will start by modifying the test statistic using the following equation: Z = AD(1.0 + 0.75/n + 2.25/n^2). From here, the p-value can be calculated based on an equation that is determined by the modified test statistic which we will discuss below. Once we calculate the p-value, we can determine whether we can reject the null hypothesis or fail to reject the null.
The test rejects the null hypothesis of normality when the p-value is less than or equal to the alpha value of 0.05 and accepts the null hypothesis (fails to reject the null) if it is greater than the alpha value of 0.05.
Newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Their weights in grams after six weeks are given along with feed types.
set.seed(1)
d <- chickwts
head(d)
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
a <- d$weight #assign the weight variable to a vector
a
## [1] 179 160 136 227 217 168 108 124 143 140 309 229 181 141 260 203 148 169 213
## [20] 257 244 271 243 230 248 327 329 250 193 271 316 267 199 171 158 248 423 340
## [39] 392 339 341 226 320 295 334 322 297 318 325 257 303 315 380 153 263 242 206
## [58] 344 258 368 390 379 260 404 318 352 359 216 222 283 332
The several equations below are derived from the formula depicted above and taken from ad.test(), the built in R function we will use later in this module:
x <- sort(a) #a is the data set that you want to test, sort() puts the data in ascending order, which is necessary to compute the anderson darling test statistic
x
## [1] 108 124 136 140 141 143 148 153 158 160 168 169 171 179 181 193 199 203 206
## [20] 213 216 217 222 226 227 229 230 242 243 244 248 248 250 257 257 258 260 260
## [39] 263 267 271 271 283 295 297 303 309 315 316 318 318 320 322 325 327 329 332
## [58] 334 339 340 341 344 352 359 368 379 380 390 392 404 423
n <- length(x) #determine the length of the sample of data
n #n is 71
## [1] 71
logp1 <- pnorm((x - mean(x))/sd(x), log.p = TRUE) #F(Xi) is the cumulative distribution and pnorm() allows us to calculate the cumulative probability for each data point using the mean and standard deviation, which are used to define normal distribution, log.p() gives each of these probabilities as a log()
logp1
## [1] -3.69751578 -3.23621401 -2.91425020 -2.81145133 -2.78610263 -2.73582529
## [7] -2.61257413 -2.49279405 -2.37646196 -2.33088919 -2.15404421 -2.13254785
## [13] -2.08995895 -1.92495515 -1.88503299 -1.65651337 -1.54924135 -1.48027517
## [19] -1.42987556 -1.31664473 -1.26996936 -1.25465553 -1.17990637 -1.12227085
## [25] -1.10815964 -1.08029194 -1.06653474 -0.91049018 -0.89822839 -0.88607892
## [31] -0.83859602 -0.83859602 -0.81551781 -0.73816842 -0.73816842 -0.72754756
## [37] -0.70662323 -0.70662323 -0.67602336 -0.63667261 -0.59895033 -0.59895033
## [43] -0.49524221 -0.40503449 -0.39124772 -0.35193782 -0.31560764 -0.28213146
## [49] -0.27682060 -0.26642363 -0.26642363 -0.25632230 -0.24651179 -0.23233062
## [55] -0.22322578 -0.21439463 -0.20165026 -0.19348172 -0.17417117 -0.17049452
## [61] -0.16687823 -0.15638537 -0.13090592 -0.11140167 -0.08980016 -0.06811938
## [67] -0.06638164 -0.05091853 -0.04821581 -0.03438676 -0.01936601
logp2 <- pnorm(-(x - mean(x))/sd(x), log.p = TRUE) #using pnorm() to find -F(Xn-i+1) for each data point
logp2
## [1] -0.02509734 -0.04010605 -0.05577140 -0.06200060 -0.06364405 -0.06703815
## [7] -0.07617449 -0.08629742 -0.09747900 -0.10226450 -0.12331408 -0.12616987
## [13] -0.13203789 -0.15768627 -0.16466718 -0.21171298 -0.23877643 -0.25822044
## [19] -0.27356710 -0.31202001 -0.32967173 -0.33571621 -0.36716912 -0.39384247
## [25] -0.40072540 -0.41475270 -0.42189818 -0.51471143 -0.52305102 -0.53148614
## [31] -0.56619222 -0.56619222 -0.58413168 -0.65006583 -0.65006583 -0.65989093
## [37] -0.67985032 -0.67985032 -0.71056934 -0.75300304 -0.79714864 -0.79714864
## [43] -0.94013087 -1.09947409 -1.12766823 -1.21511414 -1.30691239 -1.40313347
## [49] -1.41960507 -1.45292366 -1.45292366 -1.48674473 -1.52107064 -1.57351109
## [55] -1.60910906 -1.64521975 -1.70035188 -1.73775378 -1.83353865 -1.85308852
## [61] -1.87276993 -1.93260586 -2.09801545 -2.24979674 -2.45473259 -2.72035988
## [67] -2.74534199 -3.00287964 -3.05607935 -3.38722770 -3.95390325
h <- (2 * seq(1:n) - 1) * (logp1 + rev(logp2)) #summation part of the equation, rev() computes the function described in logp2 of each point of data in descending order, rather than ascending
A <- -n - mean(h) #computing the test statistic by combining the first half of the equation with the second half
A # The test statistic is computed
## [1] 0.4648545
The test statistic is less than the critical value of 0.787 meaning
we can fail to reject the null - meaning the sample of data follows
normal distribution. Because the sample of data is on the smaller side,
71, we can use the p-value to confirm our results.
# Z = AD(1.0 + 0.75/n + 2.25/n^2)
AA <- (1 + 0.75/n + 2.25/n^2) * A
AA #The modified test statistic is computed
## [1] 0.4699724
The following formulas are taken from Agostino and Stephen’s Goodness of Fit Techniques:
Use this if AA > 10
pval <- 3.7e-24
Use this if AA < 0.2
pval <- 1 - exp(-13.436 + 101.14
* AA - 223.73 * AA^2)
Use this if AA < 0.34
pval <- 1 - exp(-8.318 + 42.796
* AA - 59.938 * AA^2)
Use this if AA < 0.6
pval <- exp(0.9177 - 4.279 * AA -
1.38 * AA^2)
Use this if AA < 10
pval <- exp(1.2937 - 5.709 * AA +
0.0186 * AA^2)
pval <- exp(0.9177 - 4.279 * AA - 1.38 * AA^2)
pval #The p-value is computed
## [1] 0.2470614
It can be concluded that chick weight data does follow normal distribution because the p-value is greater than 0.05.
Determine the test statistic (A), modified test statistic (AA), and p-value of a sample of the chickwts data set. Is this sample of the data set originating from normal distribution? How do you know this?
set.seed(1)
b <- sample(a, size=71, replace=TRUE, prob=NULL) #I chose 71 because that is the length of the initial data set
b
## [1] 216 392 179 171 320 141 258 303 244 153 108 423 423 171 295 199 158 283 226
## [20] 340 257 250 257 295 283 339 295 248 283 392 303 226 168 230 267 141 160 334
## [39] 169 271 318 283 283 181 339 318 243 193 181 271 250 318 199 334 244 316 148
## [58] 318 404 368 303 171 140 179 320 258 327 260 344 193 230
x <- sort(b)
n <- length(x)
logp1 <- pnorm((x - mean(x))/sd(x), log.p = TRUE)
logp2 <- pnorm(-(x - mean(x))/sd(x), log.p = TRUE)
h <- (2 * seq(1:n) - 1) * (logp1 + rev(logp2))
A <- -n - mean(h)
A
## [1] 0.5180958
The test statistic is less than the critical value of 0.787 meaning we can fail to reject the null - meaning the sample of data follows normal distribution. Let’s confirm this with our p-value!
AA <- (1 + 0.75/n + 2.25/n^2) * A
AA
## [1] 0.5237999
pval <- exp(0.9177 - 4.279 * AA - 1.38 * AA^2)
pval
## [1] 0.1822702
The p-value is greater than 0.05, so the test fails to reject the null hypothesis meaning the sample of chick weight data follows normal distribution.
library(nortest) #make sure your package is loaded
set.seed(1) #Use the set.seed() function to allow the data to repeatable
ndistribution <- rnorm(100, 5, 1) #The rnorm() function generates a random sample that is normally distributed
ndistribution
## [1] 4.373546 5.183643 4.164371 6.595281 5.329508 4.179532 5.487429 5.738325
## [9] 5.575781 4.694612 6.511781 5.389843 4.378759 2.785300 6.124931 4.955066
## [17] 4.983810 5.943836 5.821221 5.593901 5.918977 5.782136 5.074565 3.010648
## [25] 5.619826 4.943871 4.844204 3.529248 4.521850 5.417942 6.358680 4.897212
## [33] 5.387672 4.946195 3.622940 4.585005 4.605710 4.940687 6.100025 5.763176
## [41] 4.835476 4.746638 5.696963 5.556663 4.311244 4.292505 5.364582 5.768533
## [49] 4.887654 5.881108 5.398106 4.387974 5.341120 3.870637 6.433024 6.980400
## [57] 4.632779 3.955865 5.569720 4.864945 7.401618 4.960760 5.689739 5.028002
## [65] 4.256727 5.188792 3.195041 6.465555 5.153253 7.172612 5.475510 4.290054
## [73] 5.610726 4.065902 3.746367 5.291446 4.556708 5.001105 5.074341 4.410479
## [81] 4.431331 4.864821 6.178087 3.476433 5.593946 5.332950 6.063100 4.695816
## [89] 5.370019 5.267099 4.457480 6.207868 6.160403 5.700214 6.586833 5.558486
## [97] 3.723408 4.426735 3.775387 4.526599
ad.test(ndistribution) #that was easy!
##
## Anderson-Darling normality test
##
## data: ndistribution
## A = 0.16021, p-value = 0.9471
Because the p-value is greater than 0.05 and the test statistic is less than the critical value of 0.787, it can be determined that the data follows normal distribution and that the null hypothesis has failed being rejected.
hist(ndistribution, main = "Histogram Showing Normal Distribution", prob = TRUE, col="blue")
lines(density(ndistribution), col = 4, lwd = 2)
Create a data set using a poisson function that is not normally distributed. Use the ad.test() to confirm that the data is not normally distributed. Plot the data using a histogram to visualize your results.
set.seed(1)
nodistribution <- rpois(100, 5) #rpois() can be used to create a sample that is not normally distributed
nodistribution
## [1] 4 4 5 8 3 8 9 6 6 2 3 3 6 4 7 5 6 11 4 7 9 3 6 3 4
## [26] 4 1 4 8 4 5 5 5 3 7 6 7 2 6 4 7 6 7 5 5 7 1 5 6 6
## [51] 5 7 4 3 2 2 4 5 6 4 8 4 5 4 6 3 5 7 2 8 4 7 4 4 5
## [76] 8 7 4 7 9 4 6 4 4 6 3 6 2 3 3 3 2 6 8 7 7 5 4 7 5
ad.test(nodistribution)
##
## Anderson-Darling normality test
##
## data: nodistribution
## A = 1.3077, p-value = 0.002035
Because the p-value is less than or equal to 0.05 and the test statistic is greater than the critical value of 0.787, it can be determined that the data is different than normal and the null hypothesis is rejected.
hist(nodistribution, main= "Histogram Showing No Normal Distribution", prob = TRUE, col="red")
lines(density(nodistribution), col = 4, lwd = 2)
ad.test(rnorm(100, mean = 5, sd = 3))
##
## Anderson-Darling normality test
##
## data: rnorm(100, mean = 5, sd = 3)
## A = 0.61673, p-value = 0.1057
The data set follows normal distribution because the p-value >0.05 and the test statistic is less than the critical value of 0.787.
ad.test(rpois(100,5))
##
## Anderson-Darling normality test
##
## data: rpois(100, 5)
## A = 2.7062, p-value = 7.252e-07
The data set does not follow normal distribution because the p-value <0.05 and the test statistic is greater than the critical value of 0.787.
Objectives:
[1] Learn the Shapiro-Wilk Function and how it is derived.
[2] Use the shapiro.test() function in R and interpret the results.
Like the other Goodness of Fit tests, the Shapiro-Wilk test is a test of normality that is used to determine whether a sample came from a normal distribution. The S-W test compares the actual SD of data to the computed SD from the slope of a QQ plot for the data and looks at its ratio. If the data are sampled from a normal distribution, the values should be really similar and the ratio will be close to 1.0.
To perform the S–W test for normality, assume that the sample is composed of n independent and identically distributed observations (x1,x2,…xn) from a normal distribution with unspecified mean and variance. If x[1],x[2],…x[n] represents the n observations arranged in ascending sequence, the test statistic is:
The denominator is essentially the sum of squares equation which calculates variances while the numerator is the estimate slope of the Q-Q plot. The a values are constants generated from the means, variances and co-variances of the order statistics of a sample size (n) from a normal distribution. Ideally, the slope of the Q-Q plot should be equal to the standard deviation if it is normally distributed so if you were to square this value, then it should equal to variance.
In order to calculate the W statistic, here is the following procedure:
If the test statistic is smaller than the critical threshold then the assumption of a normal distribution is rejected.
Let’s go through with a very small list of values to see how the W statistic can be calculated. We will work to a 95% degree of confidence (0.05).
example <- c(20, 20, 21, 26, 43, 43, 54, 54, 55) # list of numbers in chronological order
hist(example) # looking at the data
qqnorm(example)
qqline(example, col = "dodgerblue4", lwd = 2)
# using coefficients for n=9 to calculate the numerator
0.5888*(55-20) + 0.3244*(54-20) + 0.1976*(54-21) + 0.0947*(43-26)
## [1] 39.7683
# calculating the standard deviation
sd(example)
## [1] 15.52417
# calculating the W statistic
((39.7683)^2) / ((9-1)*15.52^2)
## [1] 0.8207306
Our calculated value of 0.8207 is less than the critical threshold for n=9 (0.829) which means that our data is not normally distributed.
As you can see, the Shapiro-Wilk Test can be really difficult to calculate, especially if you have more values, so there’s a simple command in R to conduct this test!
shapiro.test(example)
##
## Shapiro-Wilk normality test
##
## data: example
## W = 0.82098, p-value = 0.03535
Although the W statistic is pretty close to 1, it is under the critical threshold and the p-value is under 0.05 which reinforces the idea that this data is not normally distributed.
library(ggpubr)
## Loading required package: ggplot2
library(dplyr)
set.seed(123)
normdata <- rnorm(35, mean = 25, sd = 3) # making a normally distributed graph
hist(normdata)
qqnorm(normdata)
qqline(normdata, col = "dodgerblue4", lwd = 2)
As we can see, the data set we created follows the guidelines of being normally distributed data but let’s see how to test its normality with the Shapiro-Wilk test.
shapiro.test(normdata)
##
## Shapiro-Wilk normality test
##
## data: normdata
## W = 0.98495, p-value = 0.9027
Here, the W statistic is really close to 1 and the p-value is greater than 0.05 so the null hypothesis is not rejected.
ToothGrowth is a data set in R showing the length of teeth of guinea pigs at three Vitamin C dosage levels(0.5, 1 and 2mg) with two delivery methods(orange juice or ascorbic acid). The column ‘len’ shows the tooth length. Test the normality of the distribution of these guinea pigs’ teeth lengths.
data("ToothGrowth")
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
hist(ToothGrowth$len)
qqnorm(ToothGrowth$len)
qqline(ToothGrowth$len, col = "dodgerblue4", lwd = 2) #looks fairly normal but might have some difference
shapiro.test(ToothGrowth$len)
##
## Shapiro-Wilk normality test
##
## data: ToothGrowth$len
## W = 0.96743, p-value = 0.1091
Looks to be very normally distributed!
goodness_of_fit <- c(4.942893, 5.503608, 6.085769, 4.309046, 3.715401, 5.046726, 4.764293, 4.457112, 4.566690, 4.350528, 5.726751, 6.151912, 5.992160, 4.570487, 6.238304, 4.720654, 6.757903, 5.560746, 4.547216, 4.167957, 3.833429, 3.934409, 3.436218, 6.156537, 5.832047, 4.772671, 5.266137, 4.623297, 7.441365, 4.204661, 4.945123)
goodness_of_fit
## [1] 4.942893 5.503608 6.085769 4.309046 3.715401 5.046726 4.764293 4.457112
## [9] 4.566690 4.350528 5.726751 6.151912 5.992160 4.570487 6.238304 4.720654
## [17] 6.757903 5.560746 4.547216 4.167957 3.833429 3.934409 3.436218 6.156537
## [25] 5.832047 4.772671 5.266137 4.623297 7.441365 4.204661 4.945123
For the following challenge, you will be using the “goodness_of_fit” data set that was created above. Determine which type of Goodness of Fit Tests we have described above would be best to use to test for normality. Compare these tests along with one other test that may not work as well.
First, we need to determine how many data points there are. You can count them by hand or use the length() function:
length(goodness_of_fit)
## [1] 31
The two tests that remain that are able to be used with data sets below 50 are the Anderson Darling test and the Shapiro-Wilk Test. Let’s compare their results using the built in R functions!
ad.test(goodness_of_fit)
##
## Anderson-Darling normality test
##
## data: goodness_of_fit
## A = 0.43394, p-value = 0.283
shapiro.test(goodness_of_fit)
##
## Shapiro-Wilk normality test
##
## data: goodness_of_fit
## W = 0.9651, p-value = 0.3952
Both p-values are relatively close being only 0.1122 points away from one another. Although the values are close, the Shapiro Wilk test is more powerful than the Anderson Darling test and would be more accurate in this scenario. For example, the p-value being higher for the Shapiro Wilk test indicates that it has more significance.
The p-value for each test is greater than 0.05 meaning that the null hypothesis can be accepted meaning the data set comes from normal distribution. This makes sense because it was created using the rnorm() function!
Now, let’s compare it with the Kolmorgorov-Smirnov Test:
ks.test(goodness_of_fit, "rnorm")
##
## Exact one-sample Kolmogorov-Smirnov test
##
## data: goodness_of_fit
## D = 2.2584, p-value = 3.331e-16
## alternative hypothesis: two-sided
The p-value appears to be far below 0.05, which would mean that the null hypothesis is rejected and that this test deems the data set to be normally distributed. However, our D value is above 1, meaning the test did not run properly as we have too few samples. This test is very sensitive to sample size and can be easily thrown into disarray by incorrect sample sizing. This is an example as to why it is important to select the right goodness of fit test when analyzing data.
G test
https://www.statology.org/g-test/
https://www.statstest.com/g-test/
https://www.biostathandbook.com/gtestgof.html
https://www.biostathandbook.com/gtestind.html
https://www.r-bloggers.com/2022/05/calculate-the-p-value-from-chi-square-statistic-in-r/
https://stats.libretexts.org/Bookshelves/Applied_Statistics/Biological_Statistics_(McDonald)/02%3A_Tests_for_Nominal_Variables/2.04%3A_GTest_of_Goodness-of-Fit
Kolmogorov-Smirnov
https://www.tutorialspoint.com/statistics/kolmogorov_smirnov_test.htm
https://towardsdatascience.com/understanding-kolmogorov-smirnov-ks-tests-for-data-drift-on-profiled-data-5c8317796f78
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
https://www.researchgate.net/post/Interpret_P_and_D_values_in_Two-Sample_Kolmogorov-Smirnov_Test
https://www.statisticshowto.com/kolmogorov-smirnov-test/
https://blogs.sas.com/content/iml/2019/05/15/kolmogorov-d-statistic.html
https://www.sciencedirect.com/topics/earth-and-planetary-sciences/kolmogorov-smirnov-test
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ks.test.html
https://www.statology.org/kolmogorov-smirnov-test-r/
Anderson Darling
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm
https://www.statology.org/anderson-darling-test-r/
https://variation.com/wp-content/distribution_analyzer_help/hs140.htm
https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/chickwts
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=d973dfd21d5c3ee5b5860191ee0de9635111dda1
https://search.r-project.org/CRAN/refmans/nortest/html/ad.test.html
https://www.statisticshowto.com/anderson-darling-test/
https://welcometochickenlandia.com/blog/the-age-old-practice-of-cold-brooding-baby-chicks/
https://www.jstor.org/stable/pdf/2286009.pdf?refreqid=fastly-default%3A97956dda812cf4fdf7108dba8f38485c&ab_segments=&origin=&initiator=&acceptTC=1
Shapiro-Wilk
http://www.statistics4u.info/fundstat_eng/ee_shapiro_wilk_test.html#
http://blog.excelmasterseries.com/2015/05/how-to-create-completely-automated_4.html
https://www.graphpad.com/guides/prism/latest/statistics/stat_choosing_a_normality_test.htm
https://www.sciencedirect.com/topics/mathematics/wilk-test
http://collbiom.up.lublin.pl/pdf/cb41_211.pdf