Multiple Regression and ANCOVA

Preliminaries

Install these packages in R: {curl}, {ggplot2}, {gridExtra}, {dplyr}, {car}

Objectives

In this module, we extend our simple linear regression and ANOVA models to cases where we have more than one predictor variable. The same approach can be used with combinations of continuous and categorical variables. If all of our predictors variables are continuous, we typically describe this as “multiple regression”. If our predictors are combinations of continuous and categorical variables, we typically describe this as “analysis of covariance”, or ANCOVA.

Overview

Multiple linear regression and ANCOVA are pretty straightforward generalizations of the simple linear regression and ANOVA approach that we have considered previously (i.e., Model I regressions with our parameters estimated using the criterion of OLS). In multiple linear regression and ANCOVA, we are looking to model a response variable in terms of more than one predictor variable so we can evaluate the effects of several different explanatory variables. When we do multiple linear regression, we are, essentially, looking at the relationship between each of two or more continuous predictor variables and a continuous response variable while holding the effect of all other predictor variables constant. When we do ANCOVA, we are effectively looking at the relationship between one or more continuous predictor variables and a continuous response variable within each of one or more categorical groups.

To characterize this using formulas analogous to those we have used before…

Simple Bivariate Linear Model:

Multivariate Linear Model:

We thus now need to estimate multiple \(\beta\) coefficients, one for the intercept plus one for each predictor variable in our model, and, instead of a “line”" of best fit, we are determining a multidimensional surface of best fit. The criteria we typically use for estimating best fit is analogous to that we’ve used before, i.e., ordinary least squares, where we want to minimize the multidimensional squared deviation of our observed and predicted values:

Estimating our set of coefficients for multiple regression is then done using matrix algebra, which we don’t need to explore thoroughly (but see the “digression” below). The lm() function will implement this process for us directly.

NOTE: Besides the least squares approach to parameter estimation, we might also use other approaches, including maximum likelihood and Bayesian approaches (which we will explore superficially later). These are beyond the scope of what we will discuss in this class, but the objective is the same… to use a particular criterion to estimate parameter values describing a relationship between our predictor and response variables as well as measures of uncertainty in those parameter values. It is worth noting that for many common population parameters (such as the mean) as well as for regression slopes and/or factor effects in certain linear models, the OLS and ML estimators turn out to be the same when the assumptions behind OLS are met (i.e., normally distributed variables and normally distributed error terms). When response variables or residuals are not normally distributed (e.g., where we have binary or categorical response variables), we use ML or Bayesian approaches rather than OLS for parameter estimation.

Let’s work some examples!

Multiple Regression - Continuous Response Variable and More than One Continuous Predictor Variables

We will start by constructing a dataset ourselves of some correlated random normal continuous variables. The following bit of code will let us do that. See also this post. First, we define a matrix of correlations among our variables (you can play with the values in this matrix, but it must be symmetric):

R = matrix(cbind(1, 0.8, -0.5, 0, 0.8, 1, -0.3, 0.3, -0.5, -0.3, 1, 0.6, 0,
    0.3, 0.6, 1), nrow = 4)

Second, let’s generate a dataset of random normal variables where each has a defined mean and standard deviation and then bundle these into a matrix (“M”) and a dataframe (“orig”):

n <- 1000
k <- 4
M <- NULL
V <- NULL
mu <- c(15, 40, 5, 23)  # vector of variable means
s <- c(5, 20, 4, 15)  # vector of variable SDs
for (i in 1:k) {
    V <- rnorm(n, mu[i], s[i])
    M <- cbind(M, V)
}
M <- matrix(M, nrow = n, ncol = k)
orig <- as.data.frame(M)
names(orig) = c("Y", "X1", "X2", "X3")
head(orig)

##           Y       X1        X2        X3
## 1  9.574545 65.47289 10.687693 22.747349
## 2 16.204971 79.75803  8.739305 37.776902
## 3 15.576722 86.46733  4.795702 67.004664
## 4 16.204521 68.48138  1.301661 64.672690
## 5  7.602543 71.90371  9.387753 41.728566
## 6 20.480895 33.07346  2.960964  8.525107

cor(orig)  # variables are uncorrelated

##               Y           X1           X2          X3
## Y   1.000000000  0.013294649  0.004717308 -0.01477745
## X1  0.013294649  1.000000000 -0.001162035 -0.00133139
## X2  0.004717308 -0.001162035  1.000000000 -0.01448358
## X3 -0.014777451 -0.001331390 -0.014483578  1.00000000

plot(orig)  # does quick bivariate plots for each pair of variables; using `pairs(orig)` would do the same

Now, let’s normalize and standardize our variables by subtracting the relevant means and dividing by the standard deviation. This converts them to Z scores from a standard normal distribution.

NOTE how we use the apply() and sweep() functions here. Cool, eh?

ms <- apply(orig, 2, FUN = "mean")  # returns a vector of means, where we are taking this across dimension 2 of the array 'orig'
ms

##         Y        X1        X2        X3 
## 15.105998 39.998477  4.965911 23.276681

sds <- apply(orig, 2, FUN = "sd")
sds

##         Y        X1        X2        X3 
##  4.922904 19.720467  3.903137 15.691867

normalized <- sweep(orig, 2, STATS = ms, FUN = "-")  # 2nd dimension is columns, removing array of means, function = subtract
normalized <- sweep(normalized, 2, STATS = sds, FUN = "/")  # 2nd dimension is columns, scaling by array of sds, function = divide
head(normalized)  # now a dataframe of Z scores

##             Y         X1          X2          X3
## 1 -1.12361582  1.2917755  1.46594453 -0.03373288
## 2  0.22323672  2.0161567  0.96675954  0.92405970
## 3  0.09561921  2.3563769 -0.04360831  2.78666549
## 4  0.22314547  1.4443322 -0.93879617  2.63805513
## 5 -1.52419265  1.6178742  1.13289455  1.17588852
## 6  1.09181433 -0.3511591 -0.51367586 -0.94007767

M <- as.matrix(normalized)  # redefine M as our matrix of normalized variables

With apply() we apply a function to the specified margin of an array or matrix, and with sweep() we then perform whatever function is specified on all of the elements in an array specified by the given margin.

Next, we take the Cholesky decomposition of our correlation matrix and then multiply our normalized data matrix by the decomposition matrix to yield a transformed dataset with the specified correlation among variables. The Cholesky decomposition breaks certain symmetric matrices into two such that:

U = chol(R)
newM = M %*% U
new = as.data.frame(newM)
names(new) = c("Y", "X1", "X2", "X3")
cor(new)  # note that is correlation matrix is what we are aiming for!

##                Y         X1         X2           X3
## Y   1.0000000000  0.8028696 -0.4954014 0.0004037263
## X1  0.8028695762  1.0000000 -0.2986878 0.2993721423
## X2 -0.4954013573 -0.2986878  1.0000000 0.5971177919
## X3  0.0004037263  0.2993721  0.5971178 1.0000000000

plot(orig)

plot(new)  # note the axis scales; using `pairs(new)` would plot the same

Finally, we can scale these back out to the mean and distribution of our original random variables.

df <- sweep(new, 2, STATS = sds, FUN = "*")  # scale back out to original mean...
df <- sweep(df, 2, STATS = ms, FUN = "+")  # and standard deviation
head(df)

##           Y       X1         X2        X3
## 1  9.574545 37.55654 12.8616327 47.070549
## 2 16.204971 67.37607  9.0485737 57.261283
## 3 15.576722 69.38831  6.1675303 68.318009
## 4 16.204521 60.60865  2.3559847 51.183798
## 5  7.602543 35.09539 12.7507842 58.158520
## 6 20.480895 53.06833  0.9028463  6.522947

cor(df)

##                Y         X1         X2           X3
## Y   1.0000000000  0.8028696 -0.4954014 0.0004037263
## X1  0.8028695762  1.0000000 -0.2986878 0.2993721423
## X2 -0.4954013573 -0.2986878  1.0000000 0.5971177919
## X3  0.0004037263  0.2993721  0.5971178 1.0000000000

plot(df)  # note the change to the axis scales; using `pairs(d)` would produce the same plot

Now, we have a dataframe, df, comprising correlated random variables in our original units!

Let’s explore this dataset first with single and then with multivariate regression.

CHALLENGE 1

Start off by making some bivariate scatterplots in {ggplot2}. Then, using simple linear regression as implemented with lm(), how does the response variable (\(Y\)) vary with each predictor variable (\(X1\), \(X2\), \(X3\))? Are the \(\beta_1\) coefficients significant? How much of the variation in \(Y\) does each predictor explain in a simple bivariate linear model?

library(ggplot2)
require(gridExtra)

## Loading required package: gridExtra

g1 <- ggplot(data = df, aes(x = X1, y = Y)) + geom_point() + geom_smooth(method = "lm",
    formula = y ~ x)
g2 <- ggplot(data = df, aes(x = X2, y = Y)) + geom_point() + geom_smooth(method = "lm",
    formula = y ~ x)
g3 <- ggplot(data = df, aes(x = X3, y = Y)) + geom_point() + geom_smooth(method = "lm",
    formula = y ~ x)
grid.arrange(g1, g2, g3, ncol = 3)

m1 <- lm(data = df, formula = Y ~ X1)
summary(m1)

## 
## Call:
## lm(formula = Y ~ X1, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0967 -1.9954 -0.0438  1.8376 10.1349 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.140026   0.208995   34.16   <2e-16 ***
## X1          0.199157   0.004681   42.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.936 on 998 degrees of freedom
## Multiple R-squared:  0.6446, Adjusted R-squared:  0.6442 
## F-statistic:  1810 on 1 and 998 DF,  p-value: < 2.2e-16

m2 <- lm(data = df, formula = Y ~ X2)
summary(m2)

## 
## Call:
## lm(formula = Y ~ X2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7767  -2.6678   0.0539   2.7524  14.1840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.2191     0.2195   83.02   <2e-16 ***
## X2           -0.6269     0.0348  -18.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.278 on 998 degrees of freedom
## Multiple R-squared:  0.2454, Adjusted R-squared:  0.2447 
## F-statistic: 324.6 on 1 and 998 DF,  p-value: < 2.2e-16

m3 <- lm(data = df, formula = Y ~ X3)
summary(m3)

## 
## Call:
## lm(formula = Y ~ X3, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7501  -3.3017   0.0467   3.2451  19.6751 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.510e+01  2.799e-01  53.952   <2e-16 ***
## X3          1.275e-04  9.993e-03   0.013     0.99    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.925 on 998 degrees of freedom
## Multiple R-squared:  1.63e-07,   Adjusted R-squared:  -0.001002 
## F-statistic: 0.0001627 on 1 and 998 DF,  p-value: 0.9898

In simple linear regression, \(Y\) has a significant, positive relationship with \(X1\), a signficant negative relationship with \(X2\), and no significant bivariate relationship with \(X3\).

Now let’s move on to doing actual multiple regression. To review, with multiple regression, we are looking to model a response variable in terms of two or more predictor variables so we can evaluate the effect of several explanatory variables.

Using lm() and formula notation, we can fit a model with all three predictor variables. The + sign is used to add additional predictors to our model.

m <- lm(data = df, formula = Y ~ X1 + X2 + X3)
coef(m)

## (Intercept)          X1          X2          X3 
##  9.42305696  0.19402584 -0.24030088 -0.03799875

summary(m)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6321 -1.8096 -0.0481  1.6659  7.8982 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.423057   0.255370  36.900  < 2e-16 ***
## X1           0.194026   0.005557  34.915  < 2e-16 ***
## X2          -0.240301   0.033720  -7.126 1.98e-12 ***
## X3          -0.037999   0.008414  -4.516 7.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.599 on 996 degrees of freedom
## Multiple R-squared:  0.722,  Adjusted R-squared:  0.7212 
## F-statistic: 862.3 on 3 and 996 DF,  p-value: < 2.2e-16

# let's check if our residuals are random normal...
plot(fitted(m), residuals(m))

hist(residuals(m))

qqnorm(residuals(m))

What does this output tell us? First off, the results of the omnibus F test tells us that the overall model is significant; using these three variables, we explain signficantly more of the variation in the response variable, Y, than we would using a model with just an intercept, i.e., just that \(Y\) = mean(\(Y\)).

For a multiple regression model, we calculate the F statistic as follows:

Where:

\(R^2\) = multiple R squared value
\(n\) = number of data points
\(p\) = number of parameters estimated from the data (i.e., the number of \(\beta\) coefficients, not including the intercept)

f <- (summary(m)$r.squared * (nrow(df) - (ncol(df) - 1) - 1))/((1 - summary(m)$r.squared) *
    (ncol(df) - 1))
f

## [1] 862.3273

Second, looking at summary() we see that the \(\beta\) coefficient for each of our predictor variables (including \(X3\)) is significant. That is, each predictor is significant even when the effects of the other predictor variables are held constant. Recall that in the simple linear regression, the \(\beta\) coefficient for \(X3\) was not significant.

Third, we can interpret our \(\beta\) coefficients as we did in simple linear regression… for each change of one unit in a particular predictor variable (holding the other predictors constant), our predicted value of the response variable changes \(\beta\) units.

CHALLENGE 2

Load up the “zombies.csv” dataset again and run a linear model of height as a function of both weight and age. Is the overall model significant? Are both predictor variables significant when the other is controlled for?

library(curl)

## Using libcurl 7.64.1 with LibreSSL/2.8.3

f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall21/zombies.csv")
z <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = TRUE)
head(z)

##   id first_name last_name gender   height   weight zombies_killed
## 1  1      Sarah    Little Female 62.88951 132.0872              2
## 2  2       Mark    Duncan   Male 67.80277 146.3753              5
## 3  3    Brandon     Perez   Male 72.12908 152.9370              1
## 4  4      Roger   Coleman   Male 66.78484 129.7418              5
## 5  5      Tammy    Powell Female 64.71832 132.4265              4
## 6  6    Anthony     Green   Male 71.24326 152.5246              1
##   years_of_education                           major      age
## 1                  1                medicine/nursing 17.64275
## 2                  3 criminal justice administration 22.58951
## 3                  1                       education 21.91276
## 4                  6                  energy studies 18.19058
## 5                  3                       logistics 21.10399
## 6                  4                  energy studies 21.48355

m <- lm(data = z, height ~ weight + age)
summary(m)

## 
## Call:
## lm(formula = height ~ weight + age, data = z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2278 -1.1782 -0.0574  1.1566  5.4117 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.763388   0.470797   67.47   <2e-16 ***
## weight       0.163107   0.002976   54.80   <2e-16 ***
## age          0.618270   0.018471   33.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.64 on 997 degrees of freedom
## Multiple R-squared:  0.8555, Adjusted R-squared:  0.8553 
## F-statistic:  2952 on 2 and 997 DF,  p-value: < 2.2e-16

ANCOVA - Continuous Response Variable and Both Continuous and Categorical Predictor Variables

We can use the same linear modeling approach to do analysis of covariance, where we have a continuous response variable and a combination of continuous and categorical predictor variables. Let’s return to our “zombies.csv” dataset and now include one continuous and one categorical variable in our model… we want to predict height as a function of age and gender, and we want to use Type II regression. What is our model formula?

library(car)

## Loading required package: carData

m <- lm(data = z, formula = height ~ gender + age)
summary(m)

## 
## Call:
## lm(formula = height ~ gender + age, data = z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1909  -1.7173   0.1217   1.7670   7.6746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.74251    0.56869   82.19   <2e-16 ***
## genderMale   4.00224    0.16461   24.31   <2e-16 ***
## age          0.94091    0.02777   33.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.603 on 997 degrees of freedom
## Multiple R-squared:  0.6361, Adjusted R-squared:  0.6354 
## F-statistic: 871.5 on 2 and 997 DF,  p-value: < 2.2e-16

m.aov <- Anova(m, type = "II")
m.aov

## Anova Table (Type II tests)
## 
## Response: height
##           Sum Sq  Df F value    Pr(>F)    
## gender    4003.9   1  591.15 < 2.2e-16 ***
## age       7775.6   1 1148.01 < 2.2e-16 ***
## Residuals 6752.7 997                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(fitted(m), residuals(m))

hist(residuals(m))

qqnorm(residuals(m))

How do we interpret these results?

The omnibus F test is significant
Both predictors are significant
Controlling for age, being male adds 4 inches to predicted height when compared to being female.

Visualizing the Model

We can write two equations for the relationship between height on the one hand and age and gender on the other:

For females, height = 46.7251 + 0.94091 x age

In this case, females are the first level of our “gender” factor and thus do not have additional regression coefficients associated with them.

For males, height = 46.7251 + 4.00224 + 0.94091 x age

Here, the additional 4.00224 added to the intercept term is the coefficient associated with genderMale

library(ggplot2)
p <- ggplot(data = z, aes(x = age, y = height)) + geom_point(aes(color = factor(gender))) +
    scale_color_manual(values = c("goldenrod", "blue"))
p <- p + geom_abline(slope = m$coefficients[3], intercept = m$coefficients[1],
    color = "goldenrod4")
p <- p + geom_abline(slope = m$coefficients[3], intercept = m$coefficients[1] +
    m$coefficients[2], color = "darkblue")
p

Note that this model is based on all of data collectively… we are not doing separate linear models for males and females, which could result in different slopes and intercepts for each sex. Below, we will explore a model where we posit an interaction between age and sex, which would require estimation of four separate parameters (i.e., both a slope and an intercept for males and females rather than, as above, different intercepts more males and females but the same slope for each sex).

Confidence Intervals and Prediction

Using the confint() function on our ANCOVA model results reveals the confidence intervals for each of the coefficients in our multiple regression, just as it did for simple regression.

m <- lm(data = z, formula = height ~ age + gender)
summary(m)

## 
## Call:
## lm(formula = height ~ age + gender, data = z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1909  -1.7173   0.1217   1.7670   7.6746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.74251    0.56869   82.19   <2e-16 ***
## age          0.94091    0.02777   33.88   <2e-16 ***
## genderMale   4.00224    0.16461   24.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.603 on 997 degrees of freedom
## Multiple R-squared:  0.6361, Adjusted R-squared:  0.6354 
## F-statistic: 871.5 on 2 and 997 DF,  p-value: < 2.2e-16

confint(m, level = 0.95)

##                  2.5 %     97.5 %
## (Intercept) 45.6265330 47.8584809
## age          0.8864191  0.9954081
## genderMale   3.6792172  4.3252593

Similarly, using predict() allows us to determine confidence intervals for the predicted mean response and prediction intervals for individual responses for a given combination of predictor variables.

CHALLENGE 3

What is the estimated mean height, in inches, for a 29 year old male who has survived the zombie apocalypse?
What is the 95% confidence interval around this mean height?
What is the 95% prediction interval for the individual heights of 29 year old male survivors?

ci <- predict(m, newdata = data.frame(age = 29, gender = "Male"), interval = "confidence",
    level = 0.95)
ci

##        fit      lwr      upr
## 1 78.03124 77.49345 78.56903

pi <- predict(m, newdata = data.frame(age = 29, gender = "Male"), interval = "prediction",
    level = 0.95)
pi

##        fit      lwr      upr
## 1 78.03124 72.89597 83.16651

Interactions Between Predictors

So far, we have only considered the joint main effects of multiple predictors on a response variable, but often there are interactive effects between our predictors. An interactive effect is an additional change in the response that occurs because of particular combinations of predictors or because the relationship of one continuous variable to a response is contingent on a particular level of a categorical variable. We explored the former case a bit when we looked at ANOVAs involving two discrete predictors. Now, we’ll consider the latter case… is there an interactive effect of sex AND age on height in our population of zombie apocalypse survivors?

Using formula notation, it is easy for us to consider interactions between predictors. The colon (:) operator allows us to specify particular interactions we want to consider; we can use the asterisk (*) operator to specify a full model, i.e., all single terms factors and their interactions.

m <- lm(data = z, height ~ age + gender + age:gender)  # or
summary(m)

## 
## Call:
## lm(formula = height ~ age + gender + age:gender, data = z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7985 -1.6973  0.1189  1.7662  7.9473 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    48.18107    0.79839  60.348   <2e-16 ***
## age             0.86913    0.03941  22.053   <2e-16 ***
## genderMale      1.15975    1.12247   1.033   0.3018    
## age:genderMale  0.14179    0.05539   2.560   0.0106 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 996 degrees of freedom
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6374 
## F-statistic: 586.4 on 3 and 996 DF,  p-value: < 2.2e-16

m <- lm(data = z, height ~ age * gender)
summary(m)

## 
## Call:
## lm(formula = height ~ age * gender, data = z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7985 -1.6973  0.1189  1.7662  7.9473 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    48.18107    0.79839  60.348   <2e-16 ***
## age             0.86913    0.03941  22.053   <2e-16 ***
## genderMale      1.15975    1.12247   1.033   0.3018    
## age:genderMale  0.14179    0.05539   2.560   0.0106 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.595 on 996 degrees of freedom
## Multiple R-squared:  0.6385, Adjusted R-squared:  0.6374 
## F-statistic: 586.4 on 3 and 996 DF,  p-value: < 2.2e-16

coefficients(m)

##    (Intercept)            age     genderMale age:genderMale 
##     48.1810741      0.8691284      1.1597481      0.1417928

Here, when we allow an interaction, there is no main effect of gender, but there is an interaction effect of gender and age.

If we want to visualize this…

female height = 48.1817041 + 0.8891281 * age

male height = 0.481704 + 1.1597481 + 0.8891281 * age + 0.1417928 * age

library(ggplot2)
library(gridExtra)
p1 <- ggplot(data = z, aes(x = age, y = height)) + geom_point(aes(color = factor(gender))) +
    scale_color_manual(values = c("goldenrod", "blue"))
p1 <- p1 + geom_abline(slope = m$coefficients[2], intercept = m$coefficients[1],
    color = "goldenrod4")
p1 <- p1 + geom_abline(slope = m$coefficients[2] + m$coefficients[4], intercept = m$coefficients[1] +
    m$coefficients[3], color = "darkblue")
p1

p2 <- ggplot(data = z, aes(x = age, y = height)) + geom_point(aes(color = factor(gender))) +
    scale_color_manual(values = c("goldenrod", "blue")) + geom_smooth(method = "lm",
    aes(color = factor(gender), fullrange = TRUE))
grid.arrange(p1, p2, ncol = 2)

## `geom_smooth()` using formula 'y ~ x'

CHALLENGE 4

Load in the “KamilarAndCooper.csv”" dataset we have used previously
Reduce the dataset to the following variables: Family, Brain_Size_Female_Mean, Body_mass_female_mean, MeanGroupSize, DayLength_km, HomeRange_km2, and Move
Fit a Model I least squares multiple linear regression model using log(HomeRange_km2) as the response variable and log(Body_mass_female_mean), log(Brain_Size_Female_Mean), MeanGroupSize, and Move as predictor variables, and view a model summary.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall21/KamilarAndCooperData.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = TRUE)
head(d)

##               Scientific_Name          Family          Genus      Species
## 1 Allenopithecus_nigroviridis Cercopithecidae Allenopithecus nigroviridis
## 2         Allocebus_trichotis Cercopithecidae      Allocebus    trichotis
## 3           Alouatta_belzebul        Atelidae       Alouatta     belzebul
## 4             Alouatta_caraya        Atelidae       Alouatta       caraya
## 5            Alouatta_guariba        Atelidae       Alouatta      guariba
## 6           Alouatta_palliata        Atelidae       Alouatta     palliata
##   Brain_Size_Species_Mean Brain_Size_Female_Mean   Brain_size_Ref
## 1                   58.02                  53.70 Isler et al 2008
## 2                      NA                     NA                 
## 3                   52.84                  51.19 Isler et al 2008
## 4                   52.63                  47.80 Isler et al 2008
## 5                   51.70                  49.08 Isler et al 2008
## 6                   49.88                  48.04 Isler et al 2008
##   Body_mass_male_mean Body_mass_female_mean Mass_Dimorphism
## 1                6130                  3180           1.928
## 2                  92                    84           1.095
## 3                7270                  5520           1.317
## 4                6525                  4240           1.539
## 5                5800                  4550           1.275
## 6                7150                  5350           1.336
##                 Mass_Ref MeanGroupSize AdultMales AdultFemale AdultSexRatio
## 1       Isler et al 2008            NA         NA          NA            NA
## 2 Smith and Jungers 1997          1.00       1.00         1.0            NA
## 3       Isler et al 2008          7.00       1.00         1.0          1.00
## 4       Isler et al 2008          8.00       2.30         3.3          1.43
## 5       Isler et al 2008          6.53       1.37         2.2          1.61
## 6       Isler et al 2008         12.00       2.90         6.3          2.17
##                                                     Social_Organization_Ref
## 1                                                                          
## 2                                                             Kappeler 1997
## 3                                                       Campbell et al 2007
## 4 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
## 5                                                       Campbell et al 2007
## 6 van Schaik et al. 1999; Kappeler and Pereira 2003; Nunn & van Schaik 2000
##   InterbirthInterval_d Gestation WeaningAge_d MaxLongevity_m LitterSz
## 1                   NA        NA       106.15          276.0     1.01
## 2                   NA        NA           NA             NA     1.00
## 3                   NA        NA           NA             NA       NA
## 4               337.62       187       323.16          243.6     1.01
## 5                   NA        NA           NA             NA       NA
## 6               684.37       186       495.60          300.0     1.02
##    Life_History_Ref GR_MidRangeLat_dd Precip_Mean_mm Temp_Mean_degC AET_Mean_mm
## 1 Jones et al. 2009             -0.17         1574.0           25.2      1517.8
## 2                              -16.59         1902.3           20.3      1388.2
## 3                               -6.80         1643.5           24.9      1286.6
## 4 Jones et al. 2009            -20.34         1166.4           22.9      1193.1
## 5                              -21.13         1332.3           19.6      1225.7
## 6 Jones et al. 2009              6.95         1852.6           23.7      1300.0
##   PET_Mean_mm       Climate_Ref HomeRange_km2      HomeRangeRef DayLength_km
## 1      1589.4 Jones et al. 2009            NA                             NA
## 2      1653.7 Jones et al. 2009            NA                             NA
## 3      1549.8 Jones et al. 2009            NA                             NA
## 4      1404.9 Jones et al. 2009            NA                           0.40
## 5      1332.2 Jones et al. 2009          0.03 Jones et al. 2009           NA
## 6      1633.9 Jones et al. 2009          0.19 Jones et al. 2009         0.32
##       DayLengthRef Territoriality Fruit Leaves Fauna             DietRef1
## 1                              NA    NA                                  
## 2                              NA    NA                                  
## 3                              NA  57.3   19.1   0.0 Campbell et al. 2007
## 4 Nunn et al. 2003             NA  23.8   67.7   0.0 Campbell et al. 2007
## 5                              NA   5.2   73.0   0.0 Campbell et al. 2007
## 6 Nunn et al. 2003         0.6506  33.1   56.4   0.0 Campbell et al. 2007
##   Canine_Dimorphism Canine_Dimorphism_Ref  Feed  Move  Rest Social
## 1             2.210   Plavcan & Ruff 2008    NA    NA    NA     NA
## 2                NA                          NA    NA    NA     NA
## 3             1.811   Plavcan & Ruff 2008 13.75 18.75 57.30  10.00
## 4             1.542   Plavcan & Ruff 2008 15.90 17.60 61.60   4.90
## 5             1.783   Plavcan & Ruff 2008 18.33 14.33 64.37   3.00
## 6             1.703   Plavcan & Ruff 2008 17.94 12.32 66.14   3.64
##    Activity_Budget_Ref
## 1                     
## 2                     
## 3 Campbell et al. 2007
## 4 Campbell et al. 2007
## 5 Campbell et al. 2007
## 6 Campbell et al. 2007

d <- select(d, Brain_Size_Female_Mean, Family, Body_mass_female_mean, MeanGroupSize,
    DayLength_km, HomeRange_km2, Move)

Look at and interpret the estimated regression coefficients for the fitted model and interpret. Are any of them statistically significant? What can you infer about the relationship between the response and predictors?
Report and interpret the coefficient of determination and the outcome of the omnibus F test.
Examine the residuals… are they normally distributed?
What happens if you remove the “Move” term?

m <- lm(data = d, log(HomeRange_km2) ~ log(Body_mass_female_mean) + log(Brain_Size_Female_Mean) +
    MeanGroupSize + Move)
summary(m)

## 
## Call:
## lm(formula = log(HomeRange_km2) ~ log(Body_mass_female_mean) + 
##     log(Brain_Size_Female_Mean) + MeanGroupSize + Move, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7978 -0.6473 -0.0038  0.8807  2.1598 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -6.955853   1.957472  -3.553 0.000865 ***
## log(Body_mass_female_mean)   0.315276   0.468439   0.673 0.504153    
## log(Brain_Size_Female_Mean)  0.614460   0.591100   1.040 0.303771    
## MeanGroupSize                0.034026   0.009793   3.475 0.001095 ** 
## Move                         0.025916   0.019559   1.325 0.191441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.127 on 48 degrees of freedom
##   (160 observations deleted due to missingness)
## Multiple R-squared:  0.6359, Adjusted R-squared:  0.6055 
## F-statistic: 20.95 on 4 and 48 DF,  p-value: 4.806e-10

plot(m$residuals)

qqnorm(m$residuals)

shapiro.test(m$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  m$residuals
## W = 0.96517, p-value = 0.1242

m <- lm(data = d, log(HomeRange_km2) ~ log(Body_mass_female_mean) + log(Brain_Size_Female_Mean) +
    MeanGroupSize)
summary(m)

## 
## Call:
## lm(formula = log(HomeRange_km2) ~ log(Body_mass_female_mean) + 
##     log(Brain_Size_Female_Mean) + MeanGroupSize, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7982 -0.7310 -0.0140  0.8386  3.1926 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -5.354845   1.176046  -4.553 1.45e-05 ***
## log(Body_mass_female_mean)  -0.181627   0.311382  -0.583 0.560972    
## log(Brain_Size_Female_Mean)  1.390536   0.398840   3.486 0.000721 ***
## MeanGroupSize                0.030433   0.008427   3.611 0.000473 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.228 on 103 degrees of freedom
##   (106 observations deleted due to missingness)
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.6672 
## F-statistic: 71.85 on 3 and 103 DF,  p-value: < 2.2e-16

plot(m$residuals)

qqnorm(m$residuals)

shapiro.test(m$residuals)  # no significant deviation from normal

## 
##  Shapiro-Wilk normality test
## 
## data:  m$residuals
## W = 0.99366, p-value = 0.9058

Module 15

Anthony Di Fiore and Christopher A. Schmitt