The objective of this module is to continue our discussion of simple linear regression analysis to understand how the object of regression is to partition variance in the response variable among different sources, i.e., into that explanined by the regression model itself (and, we’ll see later on in our discussion of multivariate regression, among different factors in that model) versus the left-over error or residual variance. We also go through how to calculate the standard errors for our various regression coefficients and for the predicted values of our response variable based on our regression model, which, as we have seen are returned by the
lm()
function. Finally, we also briefly discuss ways to transform non-normally distributed data to make them more appropriate for analysis using linear regression.
In our linear models, we can separate or “partition” the total variation in our y variable (the sum of squares of y, or SSY) into that explained by our model (the regression sum of squares, or SSR) and that which is left over as “error” (the error sum of squares, or SSE): \(SSY\) = \(SSR\) + \(SSE\).
Graphically…
Let’s make sure we have our zombie survivor data loaded…
library(curl)
## Using libcurl 7.64.1 with LibreSSL/2.8.3
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/zombies.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
Now, we’ll run a straightforward bivariate regression model (as in Module
12) and, using the raw data (which are duplicated in the
$model
data structure within our model object), we will
calculate the various sums of squares of our variables and identify the
numbers of degrees of freedom associated with each source of variation.
This allows us to generate the ANOVA table for our model, which is a
summary of how variance is partitioned among different sources.
m <- lm(data = d, height ~ weight)
SSY <- sum((m$model$height - mean(m$model$height))^2) # height - mean(height)
SSY
## [1] 18558.61
SSR <- sum((m$fitted.values - mean(m$model$height))^2) # predicted height - mean height
SSR
## [1] 12864.82
SSE <- sum((m$model$height - m$fitted.values)^2) # height - predicted height
SSE
## [1] 5693.785
From here, we can calculate the variance in each of these components, typically referred to as the mean square, by dividing each sum of squares by its corresponding degrees of freedom (recall that a variance can be thought of as an average “sum of squares”).
The degrees of freedom for the regression sum of squares (SSR) is equal to the number of predictor variables, which in this case is one (given our regression equation, we need to know only one piece of information, the value of our predictor variable, in order to calculate the predicted value of the response variable). The number of degrees of freedom for the error sum of squares (SSE) is equal to \(n-2\). This is because we need to estimate two parameters (\(\beta_0\) and \(\beta_1\)) from our data before we can calculate the error sum of squares. Finally, the number of degrees of freedom for the total sum of squares (SSY) is \(n-1\)… we need to estimate one parameter from our data (the mean value of y) before we can calculate SSY.
df_regression <- 1
df_error <- 998
df_y <- 999
MSR <- SSR/df_regression
MSE <- SSE/df_error
MSY <- SSY/df_y
The last item we need to calculate is the F ratio, the ratio of the variance explained by the regression model to the remaining, unexplained variance: MSR/MSE.
fratio <- MSR/MSE
fratio
## [1] 2254.931
Together, the values we have calculated above form the main entries in the ANOVA Table for our regression.
Source | Sum of Squares | Degrees of Freedom | Mean Squares (SS/df) | F Ratio |
---|---|---|---|---|
Regression | SSR = 12864.82 | 1 | MSR = 12864.82 | 2254.139 |
Error | SSE = 5693.79 | \(n-2\) = 998 | MSE = 5.7072 | |
Total | SSY = 18558.61 | \(n-1\) = 999 | MSY = 18.57719 |
We can test the overall significance of our regression model by
evaluating our F ratio test statistic against the F distribution, taking
into account the number of degrees of freedom in each. The F
distribution is a continuous probability distribution, defined for \(x≥0\) and governed by two parameters, df1
and df2. The critical value above which we would reject the idea that
the variance in our two sources is comparable is given by
qf(p,df1,df2)
, where p is 1-\(\alpha\) and df1 and
df2 are the degrees of freedom in the sources being
compared (regression versus error).
curve(df(x, df = 1, df2 = 1), col = "green", lty = 3, lwd = 2, xlim = c(0, 10),
main = "Some Example F Distributions\n(vertical line shows critical value for df1=1,df2=998)",
ylab = "f(x)", xlab = "x")
curve(df(x, df = 2, df2 = 2), col = "blue", lty = 3, lwd = 2, add = TRUE)
curve(df(x, df = 4, df2 = 4), col = "red", lty = 3, lwd = 2, add = TRUE)
curve(df(x, df = 8, df2 = 6), col = "purple", lty = 3, lwd = 2, add = TRUE)
curve(df(x, df = 1, df2 = 998), col = "black", lwd = 3, add = TRUE)
legend("top", c("df1=1,df2=1", "df1=2,df2=2", "df1=4,df2=4", "df1=8,df2=6",
"df1=1,df2=998"), lty = 3, lwd = 2, col = c("green", "blue", "red", "purple",
"black"), bty = "n", cex = 0.75)
fcrit <- qf(p = 0.95, df1 = 1, df2 = 998)
fcrit
## [1] 3.850793
abline(v = fcrit)
abline(h = 0)
polygon(cbind(c(fcrit, seq(from = fcrit, to = 10, length.out = 1000), 8), c(0,
df(seq(from = fcrit, to = 8, length.out = 1000), df1 = 1, df2 = 998), 0)),
border = "black", col = "grey")
For our data, then, the value for the F ratio well exceeds this critical value.
Alternatively, we can use the following formulation to directly estimate a p value associated with our value for the F ratio:
1 - pf(q = fratio, df1 = 1, df2 = 998)
## [1] 0
… and we see that the p value associated with this high of an F ratio is infintessimally small.
As usual, R can handle all of the
calculations above easily with a built-in function. The
aov()
function, like the lm()
function,
returns a model object that we can use summary()
on to look
at the results we want. Alternatively, we can run the function
summary.aov()
using the model object resulting from
lm()
as an argument. In either case, the results returned
are the same as we calculated by hand above.
a <- aov(data = d, height ~ weight)
summary(a)
## Df Sum Sq Mean Sq F value Pr(>F)
## weight 1 12865 12865 2255 <2e-16 ***
## Residuals 998 5694 6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## weight 1 12865 12865 2255 <2e-16 ***
## Residuals 998 5694 6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Recall that the results returned by summary()
of our
regression model also shows the coefficient of determination, or the
“R-squared value”, which we defined above as the fraction of the total
variation explained by the model. We can calculate this value directly
from our ANOVA table as simply SSR/SSY. The correlation coefficient,
\(\rho\), between our response and
predictor variable is simply the square root of this value.
rsquared <- SSR/SSY
rsquared
## [1] 0.6931998
rho <- sqrt(rsquared)
rho
## [1] 0.8325862
Recall that lm()
returned the standard errors associated
with each of the various components of our regression model, i.e., the
slope and intercept and each predicted value of y. We
can calculate standard errors directly to show how
R is deriving them.
The formula for the standard error of the regression slope, \(\beta_1\), is calculated as:
Using our data…
SSX <- sum((m$model$weight - mean(m$model$weight))^2)
SEbeta1 <- sqrt(MSE/SSX)
SEbeta1
## [1] 0.004106858
The standard error of the intercept, \(\beta_0\), is calculated as:
SEbeta0 <- sqrt((MSE * sum(m$model$weight^2))/(1000 * SSX))
SEbeta0
## [1] 0.5958147
Finally, the standard error of each predicted value of y is calculated as:
SEyhat <- sqrt(MSE * (1/1000 + (m$model$weight - mean(m$model$weight))^2/SSX))
head(SEyhat) # just the first 6 rows
## [1] 0.08978724 0.07620966 0.08414480 0.09533986 0.08904151 0.08341218
Again, these same standard errors for \(\beta_0\) and \(\beta_1\) are exactly what are returned by
the lm()
function.
summary(m)
##
## Call:
## lm(formula = height ~ weight, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1519 -1.5206 -0.0535 1.5167 9.4439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.565446 0.595815 66.41 <2e-16 ***
## weight 0.195019 0.004107 47.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.389 on 998 degrees of freedom
## Multiple R-squared: 0.6932, Adjusted R-squared: 0.6929
## F-statistic: 2255 on 1 and 998 DF, p-value: < 2.2e-16
So far, we’ve derived a bunch of summary statistics describing our model and we’ve looked at ways of testing whether those summary statistics are significantly different from zero. That is…
What we haven’t done yet, however, is checked our model fit critically in other ways… particularly, we haven’t seen whether two assumptions of linear modeling are met: that our residuals (or errors) are normally distributed and that there is constancy of variance in our y values across the range of xs.
We can investigate our residuals as one way of assessing model fit.
Calculate the residuals from the regression of zombie height on weight and plot these in relation to weight (the x variable). There are lots of ways to do this quickly.
m <- lm(data = d, height ~ weight)
plot(x = d$weight, y = m$residuals)
# or
e <- resid(m)
plot(x = d$weight, y = e)
Now, plot a histogram of your residuals… ideally they are normally distributed!
hist(e, xlim = c(-4 * sd(e), 4 * sd(e)), breaks = 20, main = "Histogram of Residuals")
An additional way to quickly examine your residuals is to use the
plot()
function with your model as an argument. This prints
out 4 plots that each tell you something.
plot(m$model$weight, m$residuals)
par(mfrow = c(2, 2))
plot(m)
The first plot of fitted values () versus residuals should, like the plot of x versus residuals, not show any structure. We hope to see equally spread residuals around a horizontal line without distinct patterns. The second plot is a Q-Q plot of theoretical quantiles versus standardized quantiles for the residual values. These should fall on roughly a straight line, if the residuals are normally distributed. The third plot graphs the square root of the standardized residuals versus x and shows whether or not residuals are spread equally along the ranges of xs. It is good if you see a horizontal line with equally spread points rather than a decrease or increase in spread with x, which would indicate that the error variance increases or decreases with x. The fourth plot highlights whether there are particular observations that influence the results. In particular, we look to see if there are cases that fall in the upper or lower right portion of the plot.
We can also do a QQ plot of our residuals:
qqnorm(m$residuals)
Perhaps a more helpful QQ plot can be found in the {car} package. The
function qqPlot()
provides a trend line and confidence
intervals that allow us to see exactly which points make the sample fall
outside of normality (if any). Let’s take a look:
library(car)
## Loading required package: carData
qqPlot(m$residuals)
## [1] 954 799
Finally, there are a number of tests for normality that we can run within the R framework and using other packages. A Shapiro-Wilk Normality Test is perhaps the most widely used, where a low p value would indicate deviation from normality (technically, a measure of how far the trend line of the residuals deviates from the qqplot line).
s <- shapiro.test(m$residuals)
s
##
## Shapiro-Wilk normality test
##
## data: m$residuals
## W = 0.99713, p-value = 0.07041
As you can see, although there are some points at the higher quantiles that suggest non-normality, the Shapiro-Wilks test suggests that it’s not quite non-normal, so our use of parametric statistics should be ok.
Some other popular tests for normality, and the cases in which they’re best used, are listed below:
ad.test()
stat0032.MartinezIglewicz()
lillie.test()
dagoTest()
For a good discussion/demonstration of the relative power of each of these tests (meaning the probability that the test will correctly reject the null hypothesis) at different sample sizes, check this out, especially the tables on 670-8 and 670-9, and plots on 670-10. This can help you better understand which test is best for a given sample size, and how much faith to put in these tests given your sample!
Load in the “KamilarAndCooper.csv” dataset and develop a linear model to look at the relationship between “weaning age” and “female body mass”. You will probably need to look at the data and variable names again to find the appropriate variables to examine.
lm()
function. Are the regression coefficients
estimated under a simple linear model statistically significantly
different from zero?lm()
and then looking at
summary.aov(lm())
.plot()
command on the result of
lm()
and examine the 4 plots produced. Again, based on
examination of the residuals and the results of Shapiro-Wilks test, does
it look like your model has good fit?f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/KamilarAndCooperData.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
## Scientific_Name Family Genus Species
## 1 Allenopithecus_nigroviridis Cercopithecidae Allenopithecus nigroviridis
## 2 Allocebus_trichotis Cercopithecidae Allocebus trichotis
## 3 Alouatta_belzebul Atelidae Alouatta belzebul
## 4 Alouatta_caraya Atelidae Alouatta caraya
## 5 Alouatta_guariba Atelidae Alouatta guariba
## 6 Alouatta_palliata Atelidae Alouatta palliata
## Brain_Size_Species_Mean Brain_Size_Female_Mean Brain_size_Ref
## 1 58.02 53.70 Isler et al 2008
## 2 NA NA
## 3 52.84 51.19 Isler et al 2008
## 4 52.63 47.80 Isler et al 2008
## 5 51.70 49.08 Isler et al 2008
## 6 49.88 48.04 Isler et al 2008
## Body_mass_male_mean Body_mass_female_mean Mass_Dimorphism
## 1 6130 3180 1.928
## 2 92 84 1.095
## 3 7270 5520 1.317
## 4 6525 4240 1.539
## 5 5800 4550 1.275
## 6 7150 5350 1.336
## Mass_Ref MeanGroupSize AdultMales AdultFemale AdultSexRatio
## 1 Isler et al 2008 NA NA NA NA
## 2 Smith and Jungers 1997 1.00 1.00 1.0 NA
## 3 Isler et al 2008 7.00 1.00 1.0 1.00
## 4 Isler et al 2008 8.00 2.30 3.3 1.43
## 5 Isler et al 2008 6.53 1.37 2.2 1.61
## 6 Isler et al 2008 12.00 2.90 6.3 2.17
## Social_Organization_Ref
## 1
## 2 Kappeler 1997
## 3 Campbell et al 2007
## 4 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## 5 Campbell et al 2007
## 6 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## InterbirthInterval_d Gestation WeaningAge_d MaxLongevity_m LitterSz
## 1 NA NA 106.15 276.0 1.01
## 2 NA NA NA NA 1.00
## 3 NA NA NA NA NA
## 4 337.62 187 323.16 243.6 1.01
## 5 NA NA NA NA NA
## 6 684.37 186 495.60 300.0 1.02
## Life_History_Ref GR_MidRangeLat_dd Precip_Mean_mm Temp_Mean_degC AET_Mean_mm
## 1 Jones et al. 2009 -0.17 1574.0 25.2 1517.8
## 2 -16.59 1902.3 20.3 1388.2
## 3 -6.80 1643.5 24.9 1286.6
## 4 Jones et al. 2009 -20.34 1166.4 22.9 1193.1
## 5 -21.13 1332.3 19.6 1225.7
## 6 Jones et al. 2009 6.95 1852.6 23.7 1300.0
## PET_Mean_mm Climate_Ref HomeRange_km2 HomeRangeRef DayLength_km
## 1 1589.4 Jones et al. 2009 NA NA
## 2 1653.7 Jones et al. 2009 NA NA
## 3 1549.8 Jones et al. 2009 NA NA
## 4 1404.9 Jones et al. 2009 NA 0.40
## 5 1332.2 Jones et al. 2009 0.03 Jones et al. 2009 NA
## 6 1633.9 Jones et al. 2009 0.19 Jones et al. 2009 0.32
## DayLengthRef Territoriality Fruit Leaves Fauna DietRef1
## 1 NA NA
## 2 NA NA
## 3 NA 57.3 19.1 0.0 Campbell et al. 2007
## 4 Nunn et al. 2003 NA 23.8 67.7 0.0 Campbell et al. 2007
## 5 NA 5.2 73.0 0.0 Campbell et al. 2007
## 6 Nunn et al. 2003 0.6506 33.1 56.4 0.0 Campbell et al. 2007
## Canine_Dimorphism Canine_Dimorphism_Ref Feed Move Rest Social
## 1 2.210 Plavcan & Ruff 2008 NA NA NA NA
## 2 NA NA NA NA NA
## 3 1.811 Plavcan & Ruff 2008 13.75 18.75 57.30 10.00
## 4 1.542 Plavcan & Ruff 2008 15.90 17.60 61.60 4.90
## 5 1.783 Plavcan & Ruff 2008 18.33 14.33 64.37 3.00
## 6 1.703 Plavcan & Ruff 2008 17.94 12.32 66.14 3.64
## Activity_Budget_Ref
## 1
## 2
## 3 Campbell et al. 2007
## 4 Campbell et al. 2007
## 5 Campbell et al. 2007
## 6 Campbell et al. 2007
plot(data = d, WeaningAge_d ~ Body_mass_female_mean)
model <- lm(data = d, WeaningAge_d ~ Body_mass_female_mean)
summary(model)
##
## Call:
## lm(formula = WeaningAge_d ~ Body_mass_female_mean, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -720.39 -115.48 -54.05 58.11 471.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.016e+02 2.012e+01 10.02 <2e-16 ***
## Body_mass_female_mean 2.013e-02 1.927e-03 10.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 183.4 on 114 degrees of freedom
## (97 observations deleted due to missingness)
## Multiple R-squared: 0.489, Adjusted R-squared: 0.4845
## F-statistic: 109.1 on 1 and 114 DF, p-value: < 2.2e-16
plot(model)
qqPlot(model$residuals)
## 97 44
## 48 24
s <- shapiro.test(model$residuals)
s
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.86291, p-value = 5.825e-09
Recall that for linear regression modeling to be appropriate, two important conditions need to be met: [1] our variables (and the error variance in our variables) should be normally distributed and [2] there should be homogeneity of variance in our response variable around the range of our predictor variable.
In many cases, these conditions may not be met… for example, the continuous metric data we have may not, in fact, be normally distributed. Which data points are pushing this dataset into non-normality? Can we justify their exclusion to achieve normality if we’re interested in understanding life history variation in the realtionship between female body size and age at weaning? Thankfully, we can often apply some kind of mathematical transformation to our data to change their distribution to more closely approximate the normal without excluding what may initially appear to be outliers.
The logarithmic or “log” transformation (where we take the log value of each data point) is often applied to positive numeric variables with heavy skew to dramatically reduce the overall range of the data and bring extreme observations closer to a measure of centrality. The logarithm for a number is the power to which you must raise a base value (e.g., \(e\), the natural log) in order to obtain that number. This is an example of a “power transformation”, other examples of which include the square root transformation and the reciprocal (or multiplicative inverse) transformation.
Return to the “KamilarAndCooper.csv” dataset you were looking at above and log transform both of your variables and then run a simple bivariate linear model. Do you notice a difference between these results and those obtained using untransformed variables?
d$logWeaningAge <- log(d$WeaningAge_d)
d$logFemaleBodyMass <- log(d$Body_mass_female_mean)
plot(data = d, logWeaningAge ~ logFemaleBodyMass)
model <- lm(data = d, logWeaningAge ~ logFemaleBodyMass)
summary(model)
##
## Call:
## lm(formula = logWeaningAge ~ logFemaleBodyMass, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10639 -0.32736 0.00848 0.32214 1.11010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7590 0.2196 8.011 1.08e-12 ***
## logFemaleBodyMass 0.4721 0.0278 16.983 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4532 on 114 degrees of freedom
## (97 observations deleted due to missingness)
## Multiple R-squared: 0.7167, Adjusted R-squared: 0.7142
## F-statistic: 288.4 on 1 and 114 DF, p-value: < 2.2e-16
plot(model)
qqPlot(model$residuals)
## 44 213
## 24 116
s <- shapiro.test(model$residuals)
s
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.99367, p-value = 0.8793
The following chart and graphs shows some other common numerical transformations that are often useful for changing a variable’s distribution to more closely approximate the normal.
par(mfrow = c(1, 2))
a <- 2
b <- 2
# log x
x <- seq(from = 0, to = 100, length.out = 1000)
y <- a + b * log(x)
plot(x, y, type = "l", main = "untransformed")
plot(log(x), y, type = "l", main = "log(x)")
# log y
x <- seq(from = 0, to = 10, length.out = 1000)
y <- exp(a + b * x)
plot(x, y, type = "l", main = "untransformed")
plot(x, log(y), type = "l", main = "log(y)")
# assymptotic
x <- seq(from = 1, to = 100, length.out = 100)
y <- (a * x)/(1 + b * x)
plot(x, y, type = "l", main = "untransformed")
plot(1/x, y, type = "l", main = "1/x")
# reciprocal
x <- seq(from = 1, to = 100, length.out = 100)
y <- a + b/x
plot(x, y, type = "l", main = "untransformed")
plot(1/x, y, type = "l", main = "1/x")
# power
x <- seq(from = 1, to = 100, length.out = 100)
y <- a * x^b
plot(x, y, type = "l", main = "untransformed")
plot(x^b, y, type = "l", main = "x^b")
# exp
x <- seq(from = 1, to = 10, length.out = 100)
y <- a * exp(b * x)
plot(x, y, type = "l", main = "untransformed")
plot(x, log(y), type = "l", main = "log(y)")
Uniformly transforming a dataset is the most common way of attempting to squeeze your data into matching the normal distribution so that you can use parametric statistics. This is a very common practice, and there is much written on the topic, including its drawbacks.
For a good discussion on common transformations, see R in Action Chapter 8.5 (Corrective measures), and please do note A caution concerning transformations (p. 200). When trying out transformations, keep your normalization tests handy so that you can retest the transformed data and see if the transformation achieved it’s intended purpose.