In this module, we extend our simple linear regression and ANOVA models to cases where we have more than one predictor variable. The same approach can be used with combinations of continuous and categorical variables. If all of our predictors variables are continuous, we typically describe this as “multiple regression”. If our predictors are combinations of continuous and categorical variables, we typically describe this as “analysis of covariance”, or ANCOVA.
Multiple linear regression and ANCOVA are pretty straightforward generalizations of the simple linear regression and ANOVA approach that we have considered previously (i.e., Model I regressions with our parameters estimated using the criterion of OLS). In multiple linear regression and ANCOVA, we are looking to model a response variable in terms of more than one predictor variable so we can evaluate the effects of several different explanatory variables. When we do multiple linear regression, we are, essentially, looking at the relationship between each of two or more continuous predictor variables and a continuous response variable while holding the effect of all other predictor variables constant. When we do ANCOVA, we are effectively looking at the relationship between one or more continuous predictor variables and a continuous response variable within each of one or more categorical groups.
To characterize this using formulas analogous to those we have used before…
Simple Bivariate Linear Model:
Multivariate Linear Model:
We thus now need to estimate multiple \(\beta\) coefficients, one for the intercept plus one for each predictor variable in our model, and, instead of a “line”" of best fit, we are determining a multidimensional surface of best fit. The criteria we typically use for estimating best fit is analogous to that we’ve used before, i.e., ordinary least squares, where we want to minimize the multidimensional squared deviation of our observed and predicted values:
Estimating our set of coefficients for multiple regression is then done using matrix algebra, which we don’t need to explore thoroughly (but see the “digression” below). The lm()
function will implement this process for us directly.
NOTE: Besides the least squares approach to parameter estimation, we might also use other approaches, including maximum likelihood and Bayesian approaches (which we will explore superficially later). These are beyond the scope of what we will discuss in this class, but the objective is the same… to use a particular criterion to estimate parameter values describing a relationship between our predictor and response variables as well as measures of uncertainty in those parameter values. It is worth noting that for many common population parameters (such as the mean) as well as for regression slopes and/or factor effects in certain linear models, the OLS and ML estimators turn out to be the same when the assumptions behind OLS are met (i.e., normally distributed variables and normally distributed error terms). When response variables or residuals are not normally distributed (e.g., where we have binary or categorical response variables), we use ML or Bayesian approaches rather than OLS for parameter estimation.
Let’s work some examples!
We will start by constructing a dataset ourselves of some correlated random normal continuous variables. The following bit of code will let us do that. See also this post. First, we define a matrix of correlations among our variables (you can play with the values in this matrix, but it must be symmetric):
R = matrix(cbind(1, 0.8, -0.5, 0, 0.8, 1, -0.3, 0.3, -0.5, -0.3, 1, 0.6, 0,
0.3, 0.6, 1), nrow = 4)
Second, let’s generate a dataset of random normal variables where each has a defined mean and standard deviation and then bundle these into a matrix (“M”) and a dataframe (“orig”):
n <- 1000
k <- 4
M <- NULL
V <- NULL
mu <- c(15, 40, 5, 23) # vector of variable means
s <- c(5, 20, 4, 15) # vector of variable SDs
for (i in 1:k) {
V <- rnorm(n, mu[i], s[i])
M <- cbind(M, V)
}
M <- matrix(M, nrow = n, ncol = k)
orig <- as.data.frame(M)
names(orig) = c("Y", "X1", "X2", "X3")
head(orig)
## Y X1 X2 X3
## 1 9.574545 65.47289 10.687693 22.747349
## 2 16.204971 79.75803 8.739305 37.776902
## 3 15.576722 86.46733 4.795702 67.004664
## 4 16.204521 68.48138 1.301661 64.672690
## 5 7.602543 71.90371 9.387753 41.728566
## 6 20.480895 33.07346 2.960964 8.525107
cor(orig) # variables are uncorrelated
## Y X1 X2 X3
## Y 1.000000000 0.013294649 0.004717308 -0.01477745
## X1 0.013294649 1.000000000 -0.001162035 -0.00133139
## X2 0.004717308 -0.001162035 1.000000000 -0.01448358
## X3 -0.014777451 -0.001331390 -0.014483578 1.00000000
plot(orig) # does quick bivariate plots for each pair of variables; using `pairs(orig)` would do the same
Now, let’s normalize and standardize our variables by subtracting the relevant means and dividing by the standard deviation. This converts them to Z scores from a standard normal distribution.
NOTE how we use the
apply()
andsweep()
functions here. Cool, eh?
ms <- apply(orig, 2, FUN = "mean") # returns a vector of means, where we are taking this across dimension 2 of the array 'orig'
ms
## Y X1 X2 X3
## 15.105998 39.998477 4.965911 23.276681
sds <- apply(orig, 2, FUN = "sd")
sds
## Y X1 X2 X3
## 4.922904 19.720467 3.903137 15.691867
normalized <- sweep(orig, 2, STATS = ms, FUN = "-") # 2nd dimension is columns, removing array of means, function = subtract
normalized <- sweep(normalized, 2, STATS = sds, FUN = "/") # 2nd dimension is columns, scaling by array of sds, function = divide
head(normalized) # now a dataframe of Z scores
## Y X1 X2 X3
## 1 -1.12361582 1.2917755 1.46594453 -0.03373288
## 2 0.22323672 2.0161567 0.96675954 0.92405970
## 3 0.09561921 2.3563769 -0.04360831 2.78666549
## 4 0.22314547 1.4443322 -0.93879617 2.63805513
## 5 -1.52419265 1.6178742 1.13289455 1.17588852
## 6 1.09181433 -0.3511591 -0.51367586 -0.94007767
M <- as.matrix(normalized) # redefine M as our matrix of normalized variables
With apply()
we apply a function to the specified margin of an array or matrix, and with sweep()
we then perform whatever function is specified on all of the elements in an array specified by the given margin.
Next, we take the Cholesky decomposition of our correlation matrix and then multiply our normalized data matrix by the decomposition matrix to yield a transformed dataset with the specified correlation among variables. The Cholesky decomposition breaks certain symmetric matrices into two such that:
U = chol(R)
newM = M %*% U
new = as.data.frame(newM)
names(new) = c("Y", "X1", "X2", "X3")
cor(new) # note that is correlation matrix is what we are aiming for!
## Y X1 X2 X3
## Y 1.0000000000 0.8028696 -0.4954014 0.0004037263
## X1 0.8028695762 1.0000000 -0.2986878 0.2993721423
## X2 -0.4954013573 -0.2986878 1.0000000 0.5971177919
## X3 0.0004037263 0.2993721 0.5971178 1.0000000000
plot(orig)
plot(new) # note the axis scales; using `pairs(new)` would plot the same
Finally, we can scale these back out to the mean and distribution of our original random variables.
df <- sweep(new, 2, STATS = sds, FUN = "*") # scale back out to original mean...
df <- sweep(df, 2, STATS = ms, FUN = "+") # and standard deviation
head(df)
## Y X1 X2 X3
## 1 9.574545 37.55654 12.8616327 47.070549
## 2 16.204971 67.37607 9.0485737 57.261283
## 3 15.576722 69.38831 6.1675303 68.318009
## 4 16.204521 60.60865 2.3559847 51.183798
## 5 7.602543 35.09539 12.7507842 58.158520
## 6 20.480895 53.06833 0.9028463 6.522947
cor(df)
## Y X1 X2 X3
## Y 1.0000000000 0.8028696 -0.4954014 0.0004037263
## X1 0.8028695762 1.0000000 -0.2986878 0.2993721423
## X2 -0.4954013573 -0.2986878 1.0000000 0.5971177919
## X3 0.0004037263 0.2993721 0.5971178 1.0000000000
plot(df) # note the change to the axis scales; using `pairs(d)` would produce the same plot
Now, we have a dataframe, df, comprising correlated random variables in our original units!
Let’s explore this dataset first with single and then with multivariate regression.
Start off by making some bivariate scatterplots in {ggplot2}. Then, using simple linear regression as implemented with lm()
, how does the response variable (\(Y\)) vary with each predictor variable (\(X1\), \(X2\), \(X3\))? Are the \(\beta_1\) coefficients significant? How much of the variation in \(Y\) does each predictor explain in a simple bivariate linear model?
library(ggplot2)
require(gridExtra)
## Loading required package: gridExtra
g1 <- ggplot(data = df, aes(x = X1, y = Y)) + geom_point() + geom_smooth(method = "lm",
formula = y ~ x)
g2 <- ggplot(data = df, aes(x = X2, y = Y)) + geom_point() + geom_smooth(method = "lm",
formula = y ~ x)
g3 <- ggplot(data = df, aes(x = X3, y = Y)) + geom_point() + geom_smooth(method = "lm",
formula = y ~ x)
grid.arrange(g1, g2, g3, ncol = 3)
m1 <- lm(data = df, formula = Y ~ X1)
summary(m1)
##
## Call:
## lm(formula = Y ~ X1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0967 -1.9954 -0.0438 1.8376 10.1349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.140026 0.208995 34.16 <2e-16 ***
## X1 0.199157 0.004681 42.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.936 on 998 degrees of freedom
## Multiple R-squared: 0.6446, Adjusted R-squared: 0.6442
## F-statistic: 1810 on 1 and 998 DF, p-value: < 2.2e-16
m2 <- lm(data = df, formula = Y ~ X2)
summary(m2)
##
## Call:
## lm(formula = Y ~ X2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7767 -2.6678 0.0539 2.7524 14.1840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.2191 0.2195 83.02 <2e-16 ***
## X2 -0.6269 0.0348 -18.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.278 on 998 degrees of freedom
## Multiple R-squared: 0.2454, Adjusted R-squared: 0.2447
## F-statistic: 324.6 on 1 and 998 DF, p-value: < 2.2e-16
m3 <- lm(data = df, formula = Y ~ X3)
summary(m3)
##
## Call:
## lm(formula = Y ~ X3, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7501 -3.3017 0.0467 3.2451 19.6751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.510e+01 2.799e-01 53.952 <2e-16 ***
## X3 1.275e-04 9.993e-03 0.013 0.99
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.925 on 998 degrees of freedom
## Multiple R-squared: 1.63e-07, Adjusted R-squared: -0.001002
## F-statistic: 0.0001627 on 1 and 998 DF, p-value: 0.9898
In simple linear regression, \(Y\) has a significant, positive relationship with \(X1\), a signficant negative relationship with \(X2\), and no significant bivariate relationship with \(X3\).
Now let’s move on to doing actual multiple regression. To review, with multiple regression, we are looking to model a response variable in terms of two or more predictor variables so we can evaluate the effect of several explanatory variables.
Using lm()
and formula notation, we can fit a model with all three predictor variables. The +
sign is used to add additional predictors to our model.
m <- lm(data = df, formula = Y ~ X1 + X2 + X3)
coef(m)
## (Intercept) X1 X2 X3
## 9.42305696 0.19402584 -0.24030088 -0.03799875
summary(m)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6321 -1.8096 -0.0481 1.6659 7.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.423057 0.255370 36.900 < 2e-16 ***
## X1 0.194026 0.005557 34.915 < 2e-16 ***
## X2 -0.240301 0.033720 -7.126 1.98e-12 ***
## X3 -0.037999 0.008414 -4.516 7.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.599 on 996 degrees of freedom
## Multiple R-squared: 0.722, Adjusted R-squared: 0.7212
## F-statistic: 862.3 on 3 and 996 DF, p-value: < 2.2e-16
# let's check if our residuals are random normal...
plot(fitted(m), residuals(m))
hist(residuals(m))
qqnorm(residuals(m))
What does this output tell us? First off, the results of the omnibus F test tells us that the overall model is significant; using these three variables, we explain signficantly more of the variation in the response variable, Y, than we would using a model with just an intercept, i.e., just that \(Y\) = mean(\(Y\)).
For a multiple regression model, we calculate the F statistic as follows:
Where:
\(R^2\) = multiple R squared value
\(n\) = number of data points
\(p\) = number of parameters estimated from the data (i.e., the number of \(\beta\) coefficients, not including the intercept)
f <- (summary(m)$r.squared * (nrow(df) - (ncol(df) - 1) - 1))/((1 - summary(m)$r.squared) *
(ncol(df) - 1))
f
## [1] 862.3273
Second, looking at summary()
we see that the \(\beta\) coefficient for each of our predictor variables (including \(X3\)) is significant. That is, each predictor is significant even when the effects of the other predictor variables are held constant. Recall that in the simple linear regression, the \(\beta\) coefficient for \(X3\) was not significant.
Third, we can interpret our \(\beta\) coefficients as we did in simple linear regression… for each change of one unit in a particular predictor variable (holding the other predictors constant), our predicted value of the response variable changes \(\beta\) units.
Load up the “zombies.csv” dataset again and run a linear model of height as a function of both weight and age. Is the overall model significant? Are both predictor variables significant when the other is controlled for?
library(curl)
## Using libcurl 7.64.1 with LibreSSL/2.8.3
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall21/zombies.csv")
z <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = TRUE)
head(z)
## id first_name last_name gender height weight zombies_killed
## 1 1 Sarah Little Female 62.88951 132.0872 2
## 2 2 Mark Duncan Male 67.80277 146.3753 5
## 3 3 Brandon Perez Male 72.12908 152.9370 1
## 4 4 Roger Coleman Male 66.78484 129.7418 5
## 5 5 Tammy Powell Female 64.71832 132.4265 4
## 6 6 Anthony Green Male 71.24326 152.5246 1
## years_of_education major age
## 1 1 medicine/nursing 17.64275
## 2 3 criminal justice administration 22.58951
## 3 1 education 21.91276
## 4 6 energy studies 18.19058
## 5 3 logistics 21.10399
## 6 4 energy studies 21.48355
m <- lm(data = z, height ~ weight + age)
summary(m)
##
## Call:
## lm(formula = height ~ weight + age, data = z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2278 -1.1782 -0.0574 1.1566 5.4117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.763388 0.470797 67.47 <2e-16 ***
## weight 0.163107 0.002976 54.80 <2e-16 ***
## age 0.618270 0.018471 33.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.64 on 997 degrees of freedom
## Multiple R-squared: 0.8555, Adjusted R-squared: 0.8553
## F-statistic: 2952 on 2 and 997 DF, p-value: < 2.2e-16
We can use the same linear modeling approach to do analysis of covariance, where we have a continuous response variable and a combination of continuous and categorical predictor variables. Let’s return to our “zombies.csv” dataset and now include one continuous and one categorical variable in our model… we want to predict height as a function of age and gender, and we want to use Type II regression. What is our model formula?
library(car)
## Loading required package: carData
m <- lm(data = z, formula = height ~ gender + age)
summary(m)
##
## Call:
## lm(formula = height ~ gender + age, data = z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1909 -1.7173 0.1217 1.7670 7.6746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.74251 0.56869 82.19 <2e-16 ***
## genderMale 4.00224 0.16461 24.31 <2e-16 ***
## age 0.94091 0.02777 33.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.603 on 997 degrees of freedom
## Multiple R-squared: 0.6361, Adjusted R-squared: 0.6354
## F-statistic: 871.5 on 2 and 997 DF, p-value: < 2.2e-16
m.aov <- Anova(m, type = "II")
m.aov
## Anova Table (Type II tests)
##
## Response: height
## Sum Sq Df F value Pr(>F)
## gender 4003.9 1 591.15 < 2.2e-16 ***
## age 7775.6 1 1148.01 < 2.2e-16 ***
## Residuals 6752.7 997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fitted(m), residuals(m))
hist(residuals(m))
qqnorm(residuals(m))
How do we interpret these results?
The omnibus F test is significant
Both predictors are significant
Controlling for age, being male adds 4 inches to predicted height when compared to being female.
We can write two equations for the relationship between height on the one hand and age and gender on the other:
For females, height = 46.7251 + 0.94091 x age
In this case, females are the first level of our “gender” factor and thus do not have additional regression coefficients associated with them.
For males, height = 46.7251 + 4.00224 + 0.94091 x age
Here, the additional 4.00224 added to the intercept term is the coefficient associated with genderMale
library(ggplot2)
p <- ggplot(data = z, aes(x = age, y = height)) + geom_point(aes(color = factor(gender))) +
scale_color_manual(values = c("goldenrod", "blue"))
p <- p + geom_abline(slope = m$coefficients[3], intercept = m$coefficients[1],
color = "goldenrod4")
p <- p + geom_abline(slope = m$coefficients[3], intercept = m$coefficients[1] +
m$coefficients[2], color = "darkblue")
p
Note that this model is based on all of data collectively… we are not doing separate linear models for males and females, which could result in different slopes and intercepts for each sex. Below, we will explore a model where we posit an interaction between age and sex, which would require estimation of four separate parameters (i.e., both a slope and an intercept for males and females rather than, as above, different intercepts more males and females but the same slope for each sex).
Using the confint()
function on our ANCOVA model results reveals the confidence intervals for each of the coefficients in our multiple regression, just as it did for simple regression.
m <- lm(data = z, formula = height ~ age + gender)
summary(m)
##
## Call:
## lm(formula = height ~ age + gender, data = z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1909 -1.7173 0.1217 1.7670 7.6746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.74251 0.56869 82.19 <2e-16 ***
## age 0.94091 0.02777 33.88 <2e-16 ***
## genderMale 4.00224 0.16461 24.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.603 on 997 degrees of freedom
## Multiple R-squared: 0.6361, Adjusted R-squared: 0.6354
## F-statistic: 871.5 on 2 and 997 DF, p-value: < 2.2e-16
confint(m, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 45.6265330 47.8584809
## age 0.8864191 0.9954081
## genderMale 3.6792172 4.3252593
Similarly, using predict()
allows us to determine confidence intervals for the predicted mean response and prediction intervals for individual responses for a given combination of predictor variables.
What is the estimated mean height, in inches, for a 29 year old male who has survived the zombie apocalypse?
What is the 95% confidence interval around this mean height?
What is the 95% prediction interval for the individual heights of 29 year old male survivors?
ci <- predict(m, newdata = data.frame(age = 29, gender = "Male"), interval = "confidence",
level = 0.95)
ci
## fit lwr upr
## 1 78.03124 77.49345 78.56903
pi <- predict(m, newdata = data.frame(age = 29, gender = "Male"), interval = "prediction",
level = 0.95)
pi
## fit lwr upr
## 1 78.03124 72.89597 83.16651
So far, we have only considered the joint main effects of multiple predictors on a response variable, but often there are interactive effects between our predictors. An interactive effect is an additional change in the response that occurs because of particular combinations of predictors or because the relationship of one continuous variable to a response is contingent on a particular level of a categorical variable. We explored the former case a bit when we looked at ANOVAs involving two discrete predictors. Now, we’ll consider the latter case… is there an interactive effect of sex AND age on height in our population of zombie apocalypse survivors?
Using formula notation, it is easy for us to consider interactions between predictors. The colon (:) operator allows us to specify particular interactions we want to consider; we can use the asterisk (*) operator to specify a full model, i.e., all single terms factors and their interactions.
m <- lm(data = z, height ~ age + gender + age:gender) # or
summary(m)
##
## Call:
## lm(formula = height ~ age + gender + age:gender, data = z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7985 -1.6973 0.1189 1.7662 7.9473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.18107 0.79839 60.348 <2e-16 ***
## age 0.86913 0.03941 22.053 <2e-16 ***
## genderMale 1.15975 1.12247 1.033 0.3018
## age:genderMale 0.14179 0.05539 2.560 0.0106 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.595 on 996 degrees of freedom
## Multiple R-squared: 0.6385, Adjusted R-squared: 0.6374
## F-statistic: 586.4 on 3 and 996 DF, p-value: < 2.2e-16
m <- lm(data = z, height ~ age * gender)
summary(m)
##
## Call:
## lm(formula = height ~ age * gender, data = z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7985 -1.6973 0.1189 1.7662 7.9473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.18107 0.79839 60.348 <2e-16 ***
## age 0.86913 0.03941 22.053 <2e-16 ***
## genderMale 1.15975 1.12247 1.033 0.3018
## age:genderMale 0.14179 0.05539 2.560 0.0106 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.595 on 996 degrees of freedom
## Multiple R-squared: 0.6385, Adjusted R-squared: 0.6374
## F-statistic: 586.4 on 3 and 996 DF, p-value: < 2.2e-16
coefficients(m)
## (Intercept) age genderMale age:genderMale
## 48.1810741 0.8691284 1.1597481 0.1417928
Here, when we allow an interaction, there is no main effect of gender, but there is an interaction effect of gender and age.
If we want to visualize this…
female height = 48.1817041 + 0.8891281 * age
male height = 0.481704 + 1.1597481 + 0.8891281 * age + 0.1417928 * age
library(ggplot2)
library(gridExtra)
p1 <- ggplot(data = z, aes(x = age, y = height)) + geom_point(aes(color = factor(gender))) +
scale_color_manual(values = c("goldenrod", "blue"))
p1 <- p1 + geom_abline(slope = m$coefficients[2], intercept = m$coefficients[1],
color = "goldenrod4")
p1 <- p1 + geom_abline(slope = m$coefficients[2] + m$coefficients[4], intercept = m$coefficients[1] +
m$coefficients[3], color = "darkblue")
p1
p2 <- ggplot(data = z, aes(x = age, y = height)) + geom_point(aes(color = factor(gender))) +
scale_color_manual(values = c("goldenrod", "blue")) + geom_smooth(method = "lm",
aes(color = factor(gender), fullrange = TRUE))
grid.arrange(p1, p2, ncol = 2)
## `geom_smooth()` using formula 'y ~ x'
Load in the “KamilarAndCooper.csv”" dataset we have used previously
Reduce the dataset to the following variables: Family, Brain_Size_Female_Mean, Body_mass_female_mean, MeanGroupSize, DayLength_km, HomeRange_km2, and Move
Fit a Model I least squares multiple linear regression model using log(HomeRange_km2) as the response variable and log(Body_mass_female_mean), log(Brain_Size_Female_Mean), MeanGroupSize, and Move as predictor variables, and view a model summary.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall21/KamilarAndCooperData.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = TRUE)
head(d)
## Scientific_Name Family Genus Species
## 1 Allenopithecus_nigroviridis Cercopithecidae Allenopithecus nigroviridis
## 2 Allocebus_trichotis Cercopithecidae Allocebus trichotis
## 3 Alouatta_belzebul Atelidae Alouatta belzebul
## 4 Alouatta_caraya Atelidae Alouatta caraya
## 5 Alouatta_guariba Atelidae Alouatta guariba
## 6 Alouatta_palliata Atelidae Alouatta palliata
## Brain_Size_Species_Mean Brain_Size_Female_Mean Brain_size_Ref
## 1 58.02 53.70 Isler et al 2008
## 2 NA NA
## 3 52.84 51.19 Isler et al 2008
## 4 52.63 47.80 Isler et al 2008
## 5 51.70 49.08 Isler et al 2008
## 6 49.88 48.04 Isler et al 2008
## Body_mass_male_mean Body_mass_female_mean Mass_Dimorphism
## 1 6130 3180 1.928
## 2 92 84 1.095
## 3 7270 5520 1.317
## 4 6525 4240 1.539
## 5 5800 4550 1.275
## 6 7150 5350 1.336
## Mass_Ref MeanGroupSize AdultMales AdultFemale AdultSexRatio
## 1 Isler et al 2008 NA NA NA NA
## 2 Smith and Jungers 1997 1.00 1.00 1.0 NA
## 3 Isler et al 2008 7.00 1.00 1.0 1.00
## 4 Isler et al 2008 8.00 2.30 3.3 1.43
## 5 Isler et al 2008 6.53 1.37 2.2 1.61
## 6 Isler et al 2008 12.00 2.90 6.3 2.17
## Social_Organization_Ref
## 1
## 2 Kappeler 1997
## 3 Campbell et al 2007
## 4 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## 5 Campbell et al 2007
## 6 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## InterbirthInterval_d Gestation WeaningAge_d MaxLongevity_m LitterSz
## 1 NA NA 106.15 276.0 1.01
## 2 NA NA NA NA 1.00
## 3 NA NA NA NA NA
## 4 337.62 187 323.16 243.6 1.01
## 5 NA NA NA NA NA
## 6 684.37 186 495.60 300.0 1.02
## Life_History_Ref GR_MidRangeLat_dd Precip_Mean_mm Temp_Mean_degC AET_Mean_mm
## 1 Jones et al. 2009 -0.17 1574.0 25.2 1517.8
## 2 -16.59 1902.3 20.3 1388.2
## 3 -6.80 1643.5 24.9 1286.6
## 4 Jones et al. 2009 -20.34 1166.4 22.9 1193.1
## 5 -21.13 1332.3 19.6 1225.7
## 6 Jones et al. 2009 6.95 1852.6 23.7 1300.0
## PET_Mean_mm Climate_Ref HomeRange_km2 HomeRangeRef DayLength_km
## 1 1589.4 Jones et al. 2009 NA NA
## 2 1653.7 Jones et al. 2009 NA NA
## 3 1549.8 Jones et al. 2009 NA NA
## 4 1404.9 Jones et al. 2009 NA 0.40
## 5 1332.2 Jones et al. 2009 0.03 Jones et al. 2009 NA
## 6 1633.9 Jones et al. 2009 0.19 Jones et al. 2009 0.32
## DayLengthRef Territoriality Fruit Leaves Fauna DietRef1
## 1 NA NA
## 2 NA NA
## 3 NA 57.3 19.1 0.0 Campbell et al. 2007
## 4 Nunn et al. 2003 NA 23.8 67.7 0.0 Campbell et al. 2007
## 5 NA 5.2 73.0 0.0 Campbell et al. 2007
## 6 Nunn et al. 2003 0.6506 33.1 56.4 0.0 Campbell et al. 2007
## Canine_Dimorphism Canine_Dimorphism_Ref Feed Move Rest Social
## 1 2.210 Plavcan & Ruff 2008 NA NA NA NA
## 2 NA NA NA NA NA
## 3 1.811 Plavcan & Ruff 2008 13.75 18.75 57.30 10.00
## 4 1.542 Plavcan & Ruff 2008 15.90 17.60 61.60 4.90
## 5 1.783 Plavcan & Ruff 2008 18.33 14.33 64.37 3.00
## 6 1.703 Plavcan & Ruff 2008 17.94 12.32 66.14 3.64
## Activity_Budget_Ref
## 1
## 2
## 3 Campbell et al. 2007
## 4 Campbell et al. 2007
## 5 Campbell et al. 2007
## 6 Campbell et al. 2007
d <- select(d, Brain_Size_Female_Mean, Family, Body_mass_female_mean, MeanGroupSize,
DayLength_km, HomeRange_km2, Move)
Look at and interpret the estimated regression coefficients for the fitted model and interpret. Are any of them statistically significant? What can you infer about the relationship between the response and predictors?
Report and interpret the coefficient of determination and the outcome of the omnibus F test.
Examine the residuals… are they normally distributed?
What happens if you remove the “Move” term?
m <- lm(data = d, log(HomeRange_km2) ~ log(Body_mass_female_mean) + log(Brain_Size_Female_Mean) +
MeanGroupSize + Move)
summary(m)
##
## Call:
## lm(formula = log(HomeRange_km2) ~ log(Body_mass_female_mean) +
## log(Brain_Size_Female_Mean) + MeanGroupSize + Move, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7978 -0.6473 -0.0038 0.8807 2.1598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.955853 1.957472 -3.553 0.000865 ***
## log(Body_mass_female_mean) 0.315276 0.468439 0.673 0.504153
## log(Brain_Size_Female_Mean) 0.614460 0.591100 1.040 0.303771
## MeanGroupSize 0.034026 0.009793 3.475 0.001095 **
## Move 0.025916 0.019559 1.325 0.191441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.127 on 48 degrees of freedom
## (160 observations deleted due to missingness)
## Multiple R-squared: 0.6359, Adjusted R-squared: 0.6055
## F-statistic: 20.95 on 4 and 48 DF, p-value: 4.806e-10
plot(m$residuals)
qqnorm(m$residuals)
shapiro.test(m$residuals)
##
## Shapiro-Wilk normality test
##
## data: m$residuals
## W = 0.96517, p-value = 0.1242
m <- lm(data = d, log(HomeRange_km2) ~ log(Body_mass_female_mean) + log(Brain_Size_Female_Mean) +
MeanGroupSize)
summary(m)
##
## Call:
## lm(formula = log(HomeRange_km2) ~ log(Body_mass_female_mean) +
## log(Brain_Size_Female_Mean) + MeanGroupSize, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7982 -0.7310 -0.0140 0.8386 3.1926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.354845 1.176046 -4.553 1.45e-05 ***
## log(Body_mass_female_mean) -0.181627 0.311382 -0.583 0.560972
## log(Brain_Size_Female_Mean) 1.390536 0.398840 3.486 0.000721 ***
## MeanGroupSize 0.030433 0.008427 3.611 0.000473 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.228 on 103 degrees of freedom
## (106 observations deleted due to missingness)
## Multiple R-squared: 0.6766, Adjusted R-squared: 0.6672
## F-statistic: 71.85 on 3 and 103 DF, p-value: < 2.2e-16
plot(m$residuals)
qqnorm(m$residuals)
shapiro.test(m$residuals) # no significant deviation from normal
##
## Shapiro-Wilk normality test
##
## data: m$residuals
## W = 0.99366, p-value = 0.9058