Preliminaries
For this module, you will need to install the following packages:
- {curl}
- {MASS}
- {candisc}
- {ggplot2}
- {klaR}
- {mvnormtest}
- {car}
Objectives
The objective of this module is to introduce discriminant analysis in general as well as describe and go through examples using three different types of disriminant analysis: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and canonical discriminant analysis (CDA).
Introduction
What is Discriminant Analysis?
Discriminant analysis is used when the objective of the assessment of data is to determine the adequacy of classification of data into groups, including a priori groups. This is different than a similar statistical tool, principal component analysis (PCA), which focuses on overall variation in data without regard for a specific grouping of the observations.
In this way, discriminant analysis allows us to build classifiers, that is, through this type of analysis, we can build a predictive model of group membership that we can then apply to new cases, to see into which group a new case would fall. As opposed to regression techniques, discriminant analysis produces group labels, and not a real value as its output.
Furthermore, discriminant analysis can be linear or it can fit other curves; it can also be two- or multi-dimensional with either a line or plane separting the categories, depending on the numnber of dimensions. We will see some of the examples of how this works and the kinds of visuals that can be produced as part of this tutorial.
While discriminant analysis parallels multiple regression analysis in some ways, one important difference is that regression analysis works with a continuous dependent variable, while discriminant analysis must use a discrete dependent variable.
A simple example of discriminant analysis would be to consider the following situation: An academic advisor splits a cohort of students into three groups: 1) those who graduate and do not plan to go on to graduate school, 2) those who graduate and go on to graduate school, and 3) those who drop out of college. A variety of data could be collected for each group during their time in college, before knowing which group they will fall into at the end. After graduation for that cohort, all the students will fall into one of these categories. Discriminant analysis could then be used to determine which variable(s) that were collected work as the best predictors of the students’ educational outcomes.
Linear Discriminant Analysis (LDA)
Just like a PCA or other discriminant analysis, the goal is to see if certain qualities can describe a categorical variable. If so, when we plot these analyses, we should be able to see somewhat clear separation of the categories. LDA uses data to divide predictor variables into categorical regions with linear boundaries. Compared to PCA, LDA orders dimensions based on how much separation each category achieves, maximizing the difference and minimizing the overlap of clusters.
A PCA can be thought of as an “unsupervised” algorithm that ignores class labels and searches for the directions (principal components) that maximize variance. LDA, however, can be thought of as “supervised,” computing the directions (linear discriminants) that represent the axes in order to maximize separation between different classes. LDA tends to be better to use when there are multiple classes and the number of samples per class is relatively small. Also, in comparison with QDA, it is assumed that covariances of independent variables is the same for each class.
Here we have the blue AND green cluster on the x-axis, so if points in the future behave accordingly to the probability density function, then they should be classified into a single cluster.
The LDA returns and output giving scores that define the weight to how much the variables compose the function. These scores are calculated with this equation:
\[ \delta_i (X) = - \frac{1}{2} \mu_i^T \Sigma^{-1} X + ln(\pi_i)\]
- \(δ\) is the discriminant score for class \(i\)
- \(X\) is the matrix of independent variables
- \(μ\) is a vector containing the means of each variable for class \(i\)
- \(Σ\) is the covariance matrix of the variables (assumed to be the same for all classes)
- \(π\) is the prior probability that an observation belongs to class \(i\)
But why would we ever do this by hand when we have the lovely lda() function in the {MASS} R package?!
LDA CHALLENGE
Packages we need:
library(curl)
library(MASS)
library(ggplot2)
We can use the Kamilar and Cooper data set to take a look at how this works. We’re going to use the Kamala and Cooper dataset, but, for the sake of simplicity, this dataset only includes the primate families Atelidae, Cebidae, and two genera of Cercopithecidae (macaques and mandrills). The dataset also includes five numeric variables: mean species brain size, mean female brain size, mean male body mass, mean female body mass, and mass dimorphism. Furthermore, we took out any species within these families that did not have values for any of the variables. Let’s load this data in first:
f <- curl("https://raw.githubusercontent.com/laurabw/AN597_GroupProject_LLMC/master/kamcoop.simplified.2.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = TRUE)
head(d)
## Scientific_Name Family Genus Species
## 1 Callithrix_pygmaea Cebidae Callithrix pygmaea
## 2 Callithrix_jacchus Cebidae Callithrix jacchus
## 3 Callithrix_penicillata Cebidae Callithrix penicillata
## 4 Saguinus_fuscicollis Cebidae Saguinus fuscicollis
## 5 Callithrix_argentata Cebidae Callithrix argentata
## 6 Saguinus_niger Cebidae Saguinus niger
## Brain_Size_Species_Mean Brain_Size_Female_Mean Body_mass_male_mean
## 1 4.17 4.33 110.0
## 2 7.24 7.05 318.0
## 3 7.32 7.38 343.8
## 4 7.94 8.03 404.6
## 5 7.95 8.36 330.0
## 6 9.48 9.62 380.0
## Body_mass_female_mean Mass_Dimorphism
## 1 122.0 0.902
## 2 322.0 0.988
## 3 311.7 1.103
## 4 396.6 1.020
## 5 360.0 0.917
## 6 370.0 1.027
summary(d)
## Scientific_Name Family Genus
## Alouatta_belzebul : 1 Atelidae :12 Macaca :13
## Alouatta_caraya : 1 Cebidae :24 Saguinus : 7
## Alouatta_guariba : 1 Cercopithecidae:15 Alouatta : 6
## Alouatta_palliata : 1 Cebus : 5
## Alouatta_pigra : 1 Ateles : 4
## Alouatta_seniculus: 1 Callithrix: 4
## (Other) :45 (Other) :12
## Species Brain_Size_Species_Mean Brain_Size_Female_Mean
## geoffroyi : 2 Min. : 4.17 Min. : 4.33
## albifrons : 1 1st Qu.: 12.34 1st Qu.: 12.29
## apella : 1 Median : 63.98 Median : 62.14
## arachnoides: 1 Mean : 58.46 Mean : 56.46
## arctoides : 1 3rd Qu.: 91.83 3rd Qu.: 86.40
## argentata : 1 Max. :153.88 Max. :146.63
## (Other) :44
## Body_mass_male_mean Body_mass_female_mean Mass_Dimorphism
## Min. : 110.0 Min. : 122 Min. :0.871
## 1st Qu.: 624.6 1st Qu.: 639 1st Qu.:1.032
## Median : 5680.0 Median : 3518 Median :1.226
## Mean : 5679.8 Mean : 4006 Mean :1.267
## 3rd Qu.: 8202.5 3rd Qu.: 6485 3rd Qu.:1.398
## Max. :34400.0 Max. :12800 Max. :2.688
##
str(d) # There are 3 different families, this is probably what we will use for the analysis
## 'data.frame': 51 obs. of 9 variables:
## $ Scientific_Name : Factor w/ 51 levels "Alouatta_belzebul",..: 19 17 18 43 16 48 45 49 46 44 ...
## $ Family : Factor w/ 3 levels "Atelidae","Cebidae",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Genus : Factor w/ 13 levels "Alouatta","Aotus",..: 6 6 6 12 6 12 12 12 12 12 ...
## $ Species : Factor w/ 50 levels "albifrons","apella",..: 40 21 38 17 5 31 26 33 27 18 ...
## $ Brain_Size_Species_Mean: num 4.17 7.24 7.32 7.94 7.95 ...
## $ Brain_Size_Female_Mean : num 4.33 7.05 7.38 8.03 8.36 ...
## $ Body_mass_male_mean : num 110 318 344 405 330 ...
## $ Body_mass_female_mean : num 122 322 312 397 360 ...
## $ Mass_Dimorphism : num 0.902 0.988 1.103 1.02 0.917 ...
plot(d$Family, d$Body_mass_male_mean) #plot mass and male body mass so we can start to see what differences we may be looking for
This simple boxplot just shows what we will be looking for in the LDA; in terms of male body mass, Cercopithecidae and Atelidae are close to one another, but species within Cebidae are a bit smaller.
Now we’re going to look for this difference in body mass (male and female) using an LDA. We will do this by simply running the lda() function from the {MASS} package.
kamcoop.lda <- lda(Family ~ Body_mass_male_mean + Body_mass_female_mean, data = d) #categorical dependent variable is family
kamcoop.lda
## Call:
## lda(Family ~ Body_mass_male_mean + Body_mass_female_mean, data = d)
##
## Prior probabilities of groups:
## Atelidae Cebidae Cercopithecidae
## 0.2352941 0.4705882 0.2941176
##
## Group means:
## Body_mass_male_mean Body_mass_female_mean
## Atelidae 7894.583 6616.2000
## Cebidae 1113.621 942.2208
## Cercopithecidae 11213.687 6820.7333
##
## Coefficients of linear discriminants:
## LD1 LD2
## Body_mass_male_mean 0.0001091874 0.0004160871
## Body_mass_female_mean -0.0007126010 -0.0006123181
##
## Proportion of trace:
## LD1 LD2
## 0.9245 0.0755
Output gives: group means, LD coefficients (lines that actually discriminate between the variables), prior proabilities of groups (what proportion of the data seemed to be in each group prior to starting), the proportion of trace is the percentage of separation achieved by each discriminant function (91% and 8%).
Now we can generate prediction values based on the LDA function and store it in an object by passing the model through the predict() function. The length of the value predicted will be correspond with the length of the processed data.
lda.predict <- predict(kamcoop.lda)
lda.predict
## $class
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Cercopithecidae
## [21] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [25] Cercopithecidae Cebidae Cebidae Cebidae
## [29] Cebidae Cebidae Cebidae Cebidae
## [33] Cebidae Atelidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae Cercopithecidae Atelidae Atelidae
## [45] Atelidae Cercopithecidae Atelidae Atelidae
## [49] Atelidae Atelidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
##
## $posterior
## Atelidae Cebidae Cercopithecidae
## 1 0.0003401910 9.985793e-01 0.001080556
## 2 0.0005170006 9.979535e-01 0.001529514
## 3 0.0004944403 9.980079e-01 0.001497706
## 4 0.0006008810 9.976594e-01 0.001739715
## 5 0.0005695940 9.977927e-01 0.001637707
## 6 0.0005672634 9.977720e-01 0.001660756
## 7 0.0008313975 9.969330e-01 0.002235615
## 8 0.0006813717 9.974199e-01 0.001898767
## 9 0.0009440186 9.966033e-01 0.002452714
## 10 0.0008265242 9.969545e-01 0.002218965
## 11 0.0007689420 9.970221e-01 0.002208925
## 12 0.0006888250 9.973453e-01 0.001965889
## 13 0.0012144544 9.958130e-01 0.002972548
## 14 0.0009120509 9.966217e-01 0.002466270
## 15 0.0011599079 9.958161e-01 0.003024039
## 16 0.0017507576 9.938694e-01 0.004379867
## 17 0.0036193459 9.889666e-01 0.007414008
## 18 0.0011691551 9.956853e-01 0.003145514
## 19 0.0009590634 9.962439e-01 0.002796995
## 20 0.4290728510 6.050376e-02 0.510423390
## 21 0.1793044387 1.898460e-02 0.801710964
## 22 0.3656774322 1.987355e-01 0.435587061
## 23 0.1860623006 3.754918e-01 0.438445871
## 24 0.4585216246 4.367976e-02 0.497798618
## 25 0.4528980582 7.020209e-02 0.476899849
## 26 0.1170722469 6.512987e-01 0.231629015
## 27 0.0179987783 9.406955e-01 0.041305693
## 28 0.0173410302 9.480793e-01 0.034579622
## 29 0.0266849672 9.163905e-01 0.056924484
## 30 0.0390526327 8.094128e-01 0.151534570
## 31 0.0422509177 8.852036e-01 0.072545480
## 32 0.0247185679 9.227838e-01 0.052497635
## 33 0.1101921924 5.601168e-01 0.329690974
## 34 0.4678085311 9.858515e-02 0.433606316
## 35 0.3843191182 2.077976e-02 0.594901118
## 36 0.3415407910 1.629742e-01 0.495485041
## 37 0.4100832796 3.881187e-02 0.551104851
## 38 0.2652295227 7.227040e-03 0.727543437
## 39 0.4360041049 5.050641e-05 0.563945389
## 40 0.1580120318 9.053494e-02 0.751453026
## 41 0.7844391252 1.209391e-03 0.214351484
## 42 0.4709515941 3.645727e-04 0.528683833
## 43 0.5531257216 5.449115e-04 0.446329367
## 44 0.8327248616 1.531785e-04 0.167121960
## 45 0.7807387443 7.851417e-04 0.218476114
## 46 0.2221499844 1.442925e-02 0.763420769
## 47 0.9163544015 1.079450e-05 0.083634804
## 48 0.7297299422 2.725182e-04 0.269997540
## 49 0.9119560783 3.576075e-05 0.088008161
## 50 0.6846662311 1.892034e-07 0.315333580
## 51 0.0002806603 1.483931e-06 0.999717856
##
## $x
## LD1 LD2
## 1 2.15977336 0.060895255
## 2 2.03996414 0.024977753
## 3 2.05012097 0.042019676
## 4 1.99625973 0.015331966
## 5 2.01419555 0.006702712
## 6 2.01252891 0.021383885
## 7 1.90629370 -0.028006549
## 8 1.96300883 -0.009372415
## 9 1.87196813 -0.049444912
## 10 1.90840336 -0.029642017
## 11 1.92003236 0.022244958
## 12 1.95568731 0.010891077
## 13 1.80248368 -0.084981051
## 14 1.87626086 -0.018363736
## 15 1.80660276 -0.035422136
## 16 1.68366487 -0.047086628
## 17 1.48704196 -0.172339993
## 18 1.79894905 -0.009063233
## 19 1.84807248 0.046682304
## 20 -0.79702543 -0.211050571
## 21 -1.10258806 0.896016015
## 22 -0.37434761 -0.282913674
## 23 -0.07428043 0.208568065
## 24 -0.90506511 -0.265214193
## 25 -0.74748749 -0.316726099
## 26 0.27205282 0.009615968
## 27 0.95005969 0.010777016
## 28 0.98707482 -0.107495921
## 29 0.83032403 -0.026753052
## 30 0.57456127 0.493785278
## 31 0.71079578 -0.178960295
## 32 0.85721567 -0.034905569
## 33 0.17636982 0.355463396
## 34 -0.63042452 -0.438500307
## 35 -1.14039823 0.057863270
## 36 -0.44724832 -0.113362984
## 37 -0.94187477 -0.088950286
## 38 -1.44842928 0.566617812
## 39 -3.03755898 0.242137164
## 40 -0.58336407 0.855549871
## 41 -1.97287476 -1.179530436
## 42 -2.41906214 0.022619074
## 43 -2.28970027 -0.262610351
## 44 -2.58898906 -1.318999043
## 45 -2.11068333 -1.136812774
## 46 -1.21286551 0.705448192
## 47 -3.32204921 -1.819990911
## 48 -2.46669917 -0.852981509
## 49 -2.95417086 -1.839039387
## 50 -4.76203306 -0.282623489
## 51 -3.13054627 6.565552811
These prediction values can also help us look at the accuracy of the LDA by creating a table between the predicted classes (families based off predicted values) and the actual families from the data.
lda.predict.class <- predict(kamcoop.lda,
newdata=d[,c(7,8)]
)$class #find prediction values of class
lda.predict.class #gives the classifications that are in the predictions
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Cercopithecidae
## [21] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [25] Cercopithecidae Cebidae Cebidae Cebidae
## [29] Cebidae Cebidae Cebidae Cebidae
## [33] Cebidae Atelidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae Cercopithecidae Atelidae Atelidae
## [45] Atelidae Cercopithecidae Atelidae Atelidae
## [49] Atelidae Atelidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
#determine how well the model fits by comparing predictions to
table(lda.predict.class, d[,2])
##
## lda.predict.class Atelidae Cebidae Cercopithecidae
## Atelidae 6 0 3
## Cebidae 0 24 3
## Cercopithecidae 6 0 9
This is a cross-tabulation of the real families and the predicted families. This basically is saying how many “Atelidae” were predicted to be “Atelidae” and so on. Based off of our table, ours is okay but not great. Only 6/12 Atelidae were correctly classified and 6 Cercopithecidae were misclassified as Atelidae.
Visualize LDA
One way we can visualize the output of the LDA is to plot the predicted linear discriminants for each family on stacked histograms.
ldahist(lda.predict$x[,1], g = d$Family)
ldahist(lda.predict$x[,2], g = d$Family)
As we saw earlier, Cercopithecidae has the most variance and is a bit all over the place for the first discriminant. So we can expect that they won’t group together that well in a scatter plot. In the second, they all go to the left, so they will all probably overlap a bit in the scatter plot.
Making a scatter plot:
newdata <- data.frame(Family = d[,2], lda = lda.predict$x) #make a dataframe with the familiy names and predicts LDs
newdata
## Family lda.LD1 lda.LD2
## 1 Cebidae 2.15977336 0.060895255
## 2 Cebidae 2.03996414 0.024977753
## 3 Cebidae 2.05012097 0.042019676
## 4 Cebidae 1.99625973 0.015331966
## 5 Cebidae 2.01419555 0.006702712
## 6 Cebidae 2.01252891 0.021383885
## 7 Cebidae 1.90629370 -0.028006549
## 8 Cebidae 1.96300883 -0.009372415
## 9 Cebidae 1.87196813 -0.049444912
## 10 Cebidae 1.90840336 -0.029642017
## 11 Cebidae 1.92003236 0.022244958
## 12 Cebidae 1.95568731 0.010891077
## 13 Cebidae 1.80248368 -0.084981051
## 14 Cebidae 1.87626086 -0.018363736
## 15 Cebidae 1.80660276 -0.035422136
## 16 Cebidae 1.68366487 -0.047086628
## 17 Cebidae 1.48704196 -0.172339993
## 18 Cebidae 1.79894905 -0.009063233
## 19 Cebidae 1.84807248 0.046682304
## 20 Atelidae -0.79702543 -0.211050571
## 21 Atelidae -1.10258806 0.896016015
## 22 Atelidae -0.37434761 -0.282913674
## 23 Atelidae -0.07428043 0.208568065
## 24 Atelidae -0.90506511 -0.265214193
## 25 Atelidae -0.74748749 -0.316726099
## 26 Cercopithecidae 0.27205282 0.009615968
## 27 Cebidae 0.95005969 0.010777016
## 28 Cebidae 0.98707482 -0.107495921
## 29 Cebidae 0.83032403 -0.026753052
## 30 Cercopithecidae 0.57456127 0.493785278
## 31 Cebidae 0.71079578 -0.178960295
## 32 Cebidae 0.85721567 -0.034905569
## 33 Cercopithecidae 0.17636982 0.355463396
## 34 Cercopithecidae -0.63042452 -0.438500307
## 35 Cercopithecidae -1.14039823 0.057863270
## 36 Cercopithecidae -0.44724832 -0.113362984
## 37 Cercopithecidae -0.94187477 -0.088950286
## 38 Cercopithecidae -1.44842928 0.566617812
## 39 Cercopithecidae -3.03755898 0.242137164
## 40 Cercopithecidae -0.58336407 0.855549871
## 41 Atelidae -1.97287476 -1.179530436
## 42 Cercopithecidae -2.41906214 0.022619074
## 43 Cercopithecidae -2.28970027 -0.262610351
## 44 Atelidae -2.58898906 -1.318999043
## 45 Atelidae -2.11068333 -1.136812774
## 46 Cercopithecidae -1.21286551 0.705448192
## 47 Atelidae -3.32204921 -1.819990911
## 48 Atelidae -2.46669917 -0.852981509
## 49 Atelidae -2.95417086 -1.839039387
## 50 Cercopithecidae -4.76203306 -0.282623489
## 51 Cercopithecidae -3.13054627 6.565552811
library(ggplot2)
ggplot(newdata) + geom_point(aes(lda.LD1, lda.LD2, colour = Family), size = 2.5)
They are kind of on top of each other, so I’m going to try to add another variable, mass dimorphism, which we can use the same dataset:
kamcoop.lda.2 <- lda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Mass_Dimorphism, data = d) #categorical dependent variable is family
lda.predict.2 <- predict(kamcoop.lda.2)
lda.predict.2
## $class
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Atelidae
## [21] Cercopithecidae Atelidae Cercopithecidae Atelidae
## [25] Atelidae Cercopithecidae Cebidae Cebidae
## [29] Cebidae Cercopithecidae Cebidae Cebidae
## [33] Cercopithecidae Atelidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae Cercopithecidae Atelidae Atelidae
## [45] Atelidae Cercopithecidae Atelidae Atelidae
## [49] Atelidae Atelidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
##
## $posterior
## Atelidae Cebidae Cercopithecidae
## 1 1.477516e-06 9.999983e-01 2.707484e-07
## 2 1.011385e-05 9.999870e-01 2.869293e-06
## 3 5.827367e-05 9.999107e-01 3.104874e-05
## 4 2.032249e-05 9.999729e-01 6.787852e-06
## 5 3.772918e-06 9.999955e-01 7.215359e-07
## 6 2.104817e-05 9.999716e-01 7.335173e-06
## 7 9.863534e-06 9.999880e-01 2.147339e-06
## 8 6.466605e-06 9.999922e-01 1.353236e-06
## 9 6.215873e-06 9.999927e-01 1.071590e-06
## 10 8.126629e-06 9.999902e-01 1.657814e-06
## 11 2.844193e-04 9.995041e-01 2.114447e-04
## 12 4.959365e-05 9.999293e-01 2.107350e-05
## 13 5.208518e-06 9.999941e-01 7.316696e-07
## 14 4.536749e-05 9.999385e-01 1.611665e-05
## 15 7.650989e-05 9.998947e-01 2.881898e-05
## 16 3.587390e-04 9.994512e-01 1.900477e-04
## 17 8.601006e-05 9.998960e-01 1.798848e-05
## 18 3.589728e-04 9.994064e-01 2.346014e-04
## 19 3.567732e-03 9.905437e-01 5.888552e-03
## 20 5.051472e-01 6.523237e-03 4.883296e-01
## 21 8.330024e-02 3.037484e-05 9.166694e-01
## 22 5.375877e-01 5.536599e-02 4.070464e-01
## 23 2.159895e-01 6.152515e-03 7.778580e-01
## 24 5.389498e-01 5.331601e-03 4.557186e-01
## 25 5.711225e-01 1.308015e-02 4.157973e-01
## 26 3.082461e-01 8.719978e-02 6.045541e-01
## 27 1.414057e-01 5.809160e-01 2.776783e-01
## 28 3.614377e-02 9.301642e-01 3.369202e-02
## 29 1.542144e-01 5.965021e-01 2.492836e-01
## 30 6.033347e-02 1.015470e-03 9.386511e-01
## 31 7.243274e-02 8.732434e-01 5.432388e-02
## 32 1.362293e-01 6.541424e-01 2.096283e-01
## 33 1.310767e-01 3.166082e-03 8.657572e-01
## 34 6.373236e-01 3.979355e-02 3.228829e-01
## 35 3.569428e-01 6.119489e-04 6.424452e-01
## 36 4.345348e-01 1.537974e-02 5.500855e-01
## 37 4.311229e-01 2.158437e-03 5.667186e-01
## 38 1.752746e-01 4.790270e-05 8.246775e-01
## 39 3.879168e-01 1.236134e-06 6.120820e-01
## 40 6.285453e-02 6.058919e-05 9.370849e-01
## 41 8.838837e-01 5.994420e-04 1.155169e-01
## 42 4.363092e-01 1.104061e-05 5.636798e-01
## 43 5.533482e-01 2.699007e-05 4.466248e-01
## 44 8.916639e-01 3.594499e-05 1.083001e-01
## 45 8.694343e-01 2.752187e-04 1.302904e-01
## 46 1.252067e-01 4.931946e-05 8.747439e-01
## 47 9.410597e-01 1.674786e-06 5.893866e-02
## 48 7.811075e-01 3.291880e-05 2.188596e-01
## 49 9.497654e-01 1.095331e-05 5.022364e-02
## 50 6.436569e-01 5.111265e-09 3.563431e-01
## 51 5.275722e-04 8.960915e-07 9.994715e-01
##
## $x
## LD1 LD2
## 1 3.4592684 -0.295870934
## 2 2.9226842 -0.105683080
## 3 2.4001611 0.222725251
## 4 2.7273353 -0.033559621
## 5 3.2226688 -0.317006815
## 6 2.7125022 -0.007040944
## 7 2.9675781 -0.281310464
## 8 3.0771011 -0.285362578
## 9 3.1151717 -0.412794871
## 10 3.0246864 -0.314508464
## 11 1.9618025 0.361322318
## 12 2.4728962 0.080061523
## 13 3.1885721 -0.540271141
## 14 2.5209846 -0.034912842
## 15 2.3839923 -0.023473655
## 16 1.9543191 0.122567087
## 17 2.4413069 -0.422882283
## 18 1.9234313 0.262852728
## 19 1.2214459 0.759854917
## 20 -1.1514635 -0.126271981
## 21 -2.3837613 1.310176731
## 22 -0.6055386 -0.179522183
## 23 -1.1494847 0.794611421
## 24 -1.1973563 -0.229873630
## 25 -0.9691959 -0.285411593
## 26 -0.4965037 0.510102850
## 27 0.1607580 0.652712850
## 28 0.7206121 0.251991426
## 29 0.1744247 0.519476388
## 30 -1.4928380 1.744650847
## 31 0.5661014 0.066481462
## 32 0.2347643 0.498019434
## 33 -1.2787374 1.191216397
## 34 -0.6697185 -0.474588899
## 35 -1.7385381 0.182138643
## 36 -0.9431740 0.107373363
## 37 -1.4292438 0.028914341
## 38 -2.3303087 0.727050896
## 39 -3.2642843 -0.239025362
## 40 -2.1893216 1.564629940
## 41 -1.5827300 -1.619761780
## 42 -2.7259116 -0.262854221
## 43 -2.4958442 -0.542461948
## 44 -2.2655981 -1.818310244
## 45 -1.7899681 -1.568688893
## 46 -2.2983278 1.010626825
## 47 -2.9353760 -2.426298375
## 48 -2.3769707 -1.257303037
## 49 -2.4514468 -2.440368008
## 50 -4.5840202 -1.256648346
## 51 -2.7589070 4.832508556
newdata.2 <- data.frame(Family = d[,2], lda = lda.predict.2$x)
newdata.2 #3 LD because 3 variables?
## Family lda.LD1 lda.LD2
## 1 Cebidae 3.4592684 -0.295870934
## 2 Cebidae 2.9226842 -0.105683080
## 3 Cebidae 2.4001611 0.222725251
## 4 Cebidae 2.7273353 -0.033559621
## 5 Cebidae 3.2226688 -0.317006815
## 6 Cebidae 2.7125022 -0.007040944
## 7 Cebidae 2.9675781 -0.281310464
## 8 Cebidae 3.0771011 -0.285362578
## 9 Cebidae 3.1151717 -0.412794871
## 10 Cebidae 3.0246864 -0.314508464
## 11 Cebidae 1.9618025 0.361322318
## 12 Cebidae 2.4728962 0.080061523
## 13 Cebidae 3.1885721 -0.540271141
## 14 Cebidae 2.5209846 -0.034912842
## 15 Cebidae 2.3839923 -0.023473655
## 16 Cebidae 1.9543191 0.122567087
## 17 Cebidae 2.4413069 -0.422882283
## 18 Cebidae 1.9234313 0.262852728
## 19 Cebidae 1.2214459 0.759854917
## 20 Atelidae -1.1514635 -0.126271981
## 21 Atelidae -2.3837613 1.310176731
## 22 Atelidae -0.6055386 -0.179522183
## 23 Atelidae -1.1494847 0.794611421
## 24 Atelidae -1.1973563 -0.229873630
## 25 Atelidae -0.9691959 -0.285411593
## 26 Cercopithecidae -0.4965037 0.510102850
## 27 Cebidae 0.1607580 0.652712850
## 28 Cebidae 0.7206121 0.251991426
## 29 Cebidae 0.1744247 0.519476388
## 30 Cercopithecidae -1.4928380 1.744650847
## 31 Cebidae 0.5661014 0.066481462
## 32 Cebidae 0.2347643 0.498019434
## 33 Cercopithecidae -1.2787374 1.191216397
## 34 Cercopithecidae -0.6697185 -0.474588899
## 35 Cercopithecidae -1.7385381 0.182138643
## 36 Cercopithecidae -0.9431740 0.107373363
## 37 Cercopithecidae -1.4292438 0.028914341
## 38 Cercopithecidae -2.3303087 0.727050896
## 39 Cercopithecidae -3.2642843 -0.239025362
## 40 Cercopithecidae -2.1893216 1.564629940
## 41 Atelidae -1.5827300 -1.619761780
## 42 Cercopithecidae -2.7259116 -0.262854221
## 43 Cercopithecidae -2.4958442 -0.542461948
## 44 Atelidae -2.2655981 -1.818310244
## 45 Atelidae -1.7899681 -1.568688893
## 46 Cercopithecidae -2.2983278 1.010626825
## 47 Atelidae -2.9353760 -2.426298375
## 48 Atelidae -2.3769707 -1.257303037
## 49 Atelidae -2.4514468 -2.440368008
## 50 Cercopithecidae -4.5840202 -1.256648346
## 51 Cercopithecidae -2.7589070 4.832508556
library(ggplot2)
ggplot(newdata.2) + geom_point(aes(lda.LD1, lda.LD2, colour = Family), size = 2.5)
There is a little bit of differentiation between Cebidae and the other two families. This makes sense because Cebidae includes capuchins and squirrel monkeys, which are pretty small in comparison to atelids and cercopithecids.
What about brain size???
kamcoop.lda.3 <- lda(Family ~ Brain_Size_Species_Mean, data = d)
lda.predict.3 <- predict(kamcoop.lda.3)
lda.predict.3
## $class
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Cebidae
## [21] Cebidae Cebidae Cebidae Cebidae
## [25] Cebidae Atelidae Atelidae Atelidae
## [29] Atelidae Atelidae Atelidae Atelidae
## [33] Atelidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [45] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [49] Cercopithecidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
##
## $posterior
## Atelidae Cebidae Cercopithecidae
## 1 0.008226368 9.903126e-01 0.001461073
## 2 0.010653790 9.873143e-01 0.002031893
## 3 0.010725694 9.872249e-01 0.002049406
## 4 0.011299404 9.865103e-01 0.002190308
## 5 0.011308901 9.864984e-01 0.002192658
## 6 0.012858399 9.845584e-01 0.002583172
## 7 0.013097723 9.842576e-01 0.002644716
## 8 0.013163749 9.841745e-01 0.002661750
## 9 0.013185830 9.841467e-01 0.002667453
## 10 0.013589570 9.836382e-01 0.002772185
## 11 0.014714144 9.822174e-01 0.003068482
## 12 0.015138391 9.816797e-01 0.003181955
## 13 0.015665934 9.810098e-01 0.003324312
## 14 0.017014998 9.792905e-01 0.003694472
## 15 0.022697858 9.719606e-01 0.005341565
## 16 0.023753296 9.705850e-01 0.005661731
## 17 0.032500727 9.590346e-01 0.008464633
## 18 0.043035056 9.448171e-01 0.012147881
## 19 0.046359986 9.402681e-01 0.013371865
## 20 0.252417014 6.181191e-01 0.129463892
## 21 0.268142171 5.902817e-01 0.141576100
## 22 0.275328500 5.773659e-01 0.147305586
## 23 0.287032581 5.560505e-01 0.156916916
## 24 0.289667241 5.512021e-01 0.159130665
## 25 0.319068403 4.956984e-01 0.185233174
## 26 0.408524291 3.008613e-01 0.290614399
## 27 0.419145287 2.723404e-01 0.308514301
## 28 0.423310488 2.604484e-01 0.316241072
## 29 0.426606489 2.506722e-01 0.322721315
## 30 0.441565086 1.997379e-01 0.358696979
## 31 0.442096357 1.976067e-01 0.360296916
## 32 0.450643727 1.547979e-01 0.394558364
## 33 0.453100642 1.319265e-01 0.414972877
## 34 0.446589812 7.082441e-02 0.482585776
## 35 0.438345110 5.383540e-02 0.507819487
## 36 0.436409582 5.092624e-02 0.512664178
## 37 0.424079176 3.710301e-02 0.538817818
## 38 0.418048282 3.224167e-02 0.549710045
## 39 0.406122250 2.479976e-02 0.569077990
## 40 0.398335224 2.104392e-02 0.580620851
## 41 0.390802139 1.801453e-02 0.591183327
## 42 0.370398246 1.193723e-02 0.617664519
## 43 0.359404469 9.586537e-03 0.631008994
## 44 0.354780799 8.742214e-03 0.636476987
## 45 0.348610446 7.728807e-03 0.643660746
## 46 0.346122618 7.353572e-03 0.646523810
## 47 0.303656933 3.087003e-03 0.693256063
## 48 0.291845994 2.401758e-03 0.705752249
## 49 0.290431983 2.329795e-03 0.707238222
## 50 0.166746549 9.546390e-05 0.833157987
## 51 0.148650155 5.156367e-05 0.851298281
##
## $x
## LD1
## 1 -2.1058802
## 2 -1.9867871
## 3 -1.9836837
## 4 -1.9596323
## 5 -1.9592444
## 6 -1.8998918
## 7 -1.8913574
## 8 -1.8890299
## 9 -1.8882540
## 10 -1.8742887
## 11 -1.8374358
## 12 -1.8242463
## 13 -1.8083414
## 14 -1.7699367
## 15 -1.6353266
## 16 -1.6139907
## 17 -1.4658031
## 18 -1.3311930
## 19 -1.2951159
## 20 -0.3326727
## 21 -0.2841821
## 22 -0.2620703
## 23 -0.2259932
## 24 -0.2178468
## 25 -0.1255205
## 26 0.2143022
## 27 0.2713273
## 28 0.2961545
## 29 0.3171025
## 30 0.4361956
## 31 0.4416266
## 32 0.5614955
## 33 0.6367531
## 34 0.9133440
## 35 1.0297216
## 36 1.0529972
## 37 1.1841160
## 38 1.2415289
## 39 1.3478205
## 40 1.4137679
## 41 1.4758360
## 42 1.6387647
## 43 1.7248841
## 44 1.7609612
## 45 1.8090640
## 46 1.8284602
## 47 2.1640158
## 48 2.2602213
## 49 2.2718591
## 50 3.4736523
## 51 3.7017525
newdata.3 <- data.frame(Family = d[,2], lda = lda.predict.2$x)
newdata.3
## Family lda.LD1 lda.LD2
## 1 Cebidae 3.4592684 -0.295870934
## 2 Cebidae 2.9226842 -0.105683080
## 3 Cebidae 2.4001611 0.222725251
## 4 Cebidae 2.7273353 -0.033559621
## 5 Cebidae 3.2226688 -0.317006815
## 6 Cebidae 2.7125022 -0.007040944
## 7 Cebidae 2.9675781 -0.281310464
## 8 Cebidae 3.0771011 -0.285362578
## 9 Cebidae 3.1151717 -0.412794871
## 10 Cebidae 3.0246864 -0.314508464
## 11 Cebidae 1.9618025 0.361322318
## 12 Cebidae 2.4728962 0.080061523
## 13 Cebidae 3.1885721 -0.540271141
## 14 Cebidae 2.5209846 -0.034912842
## 15 Cebidae 2.3839923 -0.023473655
## 16 Cebidae 1.9543191 0.122567087
## 17 Cebidae 2.4413069 -0.422882283
## 18 Cebidae 1.9234313 0.262852728
## 19 Cebidae 1.2214459 0.759854917
## 20 Atelidae -1.1514635 -0.126271981
## 21 Atelidae -2.3837613 1.310176731
## 22 Atelidae -0.6055386 -0.179522183
## 23 Atelidae -1.1494847 0.794611421
## 24 Atelidae -1.1973563 -0.229873630
## 25 Atelidae -0.9691959 -0.285411593
## 26 Cercopithecidae -0.4965037 0.510102850
## 27 Cebidae 0.1607580 0.652712850
## 28 Cebidae 0.7206121 0.251991426
## 29 Cebidae 0.1744247 0.519476388
## 30 Cercopithecidae -1.4928380 1.744650847
## 31 Cebidae 0.5661014 0.066481462
## 32 Cebidae 0.2347643 0.498019434
## 33 Cercopithecidae -1.2787374 1.191216397
## 34 Cercopithecidae -0.6697185 -0.474588899
## 35 Cercopithecidae -1.7385381 0.182138643
## 36 Cercopithecidae -0.9431740 0.107373363
## 37 Cercopithecidae -1.4292438 0.028914341
## 38 Cercopithecidae -2.3303087 0.727050896
## 39 Cercopithecidae -3.2642843 -0.239025362
## 40 Cercopithecidae -2.1893216 1.564629940
## 41 Atelidae -1.5827300 -1.619761780
## 42 Cercopithecidae -2.7259116 -0.262854221
## 43 Cercopithecidae -2.4958442 -0.542461948
## 44 Atelidae -2.2655981 -1.818310244
## 45 Atelidae -1.7899681 -1.568688893
## 46 Cercopithecidae -2.2983278 1.010626825
## 47 Atelidae -2.9353760 -2.426298375
## 48 Atelidae -2.3769707 -1.257303037
## 49 Atelidae -2.4514468 -2.440368008
## 50 Cercopithecidae -4.5840202 -1.256648346
## 51 Cercopithecidae -2.7589070 4.832508556
ggplot(newdata.3) + geom_point(aes(lda.LD1, lda.LD2, colour = Family), size = 2.5)
So with all of these variables, we are able to say that body mass and brain size can describe Cebidae out of these three families, but not the other two because they are too similar.
Quadratic Discriminant Analysis (QDA)
Just like ANOVA, Linear Discriminant Analyses (LDA) assumes equal variance between our input variables. If the data fails to meet this criteria, observation are more likely to be assigned to the class showing the greatest variability. This type of bias error is the reason the Quadratic Discriminant Analysis (QDA) was invented. In other words, LDA and QDA are both classification techniques, but QDA is less conservative by allowing different variance matrices for different classes. Another key difference between LDA and QDA is that the former assigns a linear classification decision boundary while the later leads to a quadratic decision boundary.
- Top row: when the covariance matrices are indeed the same in the data, LDA and QDA lead to the same decision boundaries.
- Bottom row: when the covariance matrices are different, LDA leads to bad performance as its assumption becomes invalid, while QDA performs classification much better.
Both LDA and QDA can be derived from the same probabilistic models that model the class conditional distribution of the data \(P(X|y=k)\) for each class \(k\). QDA provides an alternative approach by assuming that each class has its own covariance matrix \(\Sigma_k\).
In order to calculate the quadratic boundary we use the following equation:
\[\delta_k(x) = \log \pi_k - \frac{1}{2} \log |\Sigma_k| - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k)\]
- \(δ\) is the discriminant score for class \(k\)
- \(X\) is the matrix of independent variables
- \(μ\) is a vector containing the means of each variable for class \(k\)
- \(Σ\) is the covariance matrix of the variables (assumed to be the same for all classes)
- \(π\) is the prior probability that an observation belongs to class \(k\)
Let’s do a little exercise to see how QDA can be applied to our simplified Kamilar and Cooper dataset and how it differs from LDA.
Packages we need:
library(curl)
library(MASS)
library(ggplot2)
library(klaR)
library(car)
## Loading required package: carData
QDA CHALLENGE
Similar to what we did for LDA, we will now try to use QDA to try to determine which variable or combination of variables will give us the best predictor model for which of the three families of primates an individual belongs. Our data was already loaded in earlier in the module but here it is in case you need it again.
f <- curl("https://raw.githubusercontent.com/laurabw/AN597_GroupProject_LLMC/master/kamcoop.simplified.2.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = TRUE)
Now let’s take a more expansive look at our data to see if our other variables differ between groups.
plot(d$Family, d$Body_mass_female_mean, main = "Mean Female Body Mass", ylab = "Body Mass (g)")
plot(d$Family, d$Brain_Size_Species_Mean, main = "Mean Species Brain Size", ylab = "Brain Size (g)")
plot(d$Family, d$Brain_Size_Female_Mean, main = "Mean Female Brain Size", ylab = "Brain Size (g)")
plot(d$Family, d$Mass_Dimorphism, main = "Sexual Dimorphism in Mass", ylab = "Sexual Dimorphism")
Another way we can visualize the variation in our data is by using the scatterplotMatrix() function from the {car} package.
scatterplotMatrix(~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + Brain_Size_Female_Mean + Mass_Dimorphism| Family, data=d,legend = TRUE,smooth = list(method=gamLine),diagonal = TRUE,plot.points = TRUE)
Most of our variables appear to be noticebly different between families but let’s investigate further with QDA to see if one is a better predictor than the others. Instead of using the lda(), here we need to use the qda() function, which is also a function of the {MASS} package. We can start with a model that includes all of our variables and we can simplify later if need be.
kamcoop.qda <- qda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + Brain_Size_Female_Mean + Mass_Dimorphism, data = d) #categorical dependent variable is family
kamcoop.qda
## Call:
## qda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean +
## Brain_Size_Female_Mean + Mass_Dimorphism, data = d)
##
## Prior probabilities of groups:
## Atelidae Cebidae Cercopithecidae
## 0.2352941 0.4705882 0.2941176
##
## Group means:
## Body_mass_male_mean Body_mass_female_mean
## Atelidae 7894.583 6616.2000
## Cebidae 1113.621 942.2208
## Cercopithecidae 11213.687 6820.7333
## Brain_Size_Species_Mean Brain_Size_Female_Mean
## Atelidae 80.56833 78.77833
## Cebidae 23.94333 23.60667
## Cercopithecidae 95.98533 91.16600
## Mass_Dimorphism
## Atelidae 1.230583
## Cebidae 1.083250
## Cercopithecidae 1.588533
QDA Partition Plots
Next we can use the partimat() function from the {klaR} package which allows us to plot the results of the QDA. This will create a figure for every possible combination of two variables from our data.
partimat(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + Brain_Size_Female_Mean + Mass_Dimorphism, data = d, method="qda")
Think of each of these plots as a different prespective on our data. The different colored regions outline each classification area. Observation that falls within a region is predicted to be from a specific class. The apparent error rate at the top of each figure imforms us about how many observations fall outside their predicted range.
QDA Predictions
The predict() function works identically to the way we used it for LDA. We will again use this function to create a confusion matrix for our data.
qda.predict <- predict(kamcoop.qda)
qda.predict
## $class
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Atelidae
## [21] Atelidae Atelidae Atelidae Atelidae
## [25] Atelidae Cercopithecidae Cebidae Cebidae
## [29] Cebidae Cercopithecidae Cebidae Cebidae
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae Cercopithecidae Cercopithecidae Atelidae
## [45] Atelidae Cercopithecidae Atelidae Atelidae
## [49] Atelidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
##
## $posterior
## Atelidae Cebidae Cercopithecidae
## 1 3.858812e-63 1.000000e+00 1.423939e-23
## 2 1.259996e-39 1.000000e+00 1.824540e-21
## 3 9.278062e-19 1.000000e+00 3.991246e-19
## 4 9.109489e-32 1.000000e+00 1.268353e-20
## 5 5.155527e-55 1.000000e+00 6.465711e-22
## 6 5.076162e-31 1.000000e+00 1.796671e-19
## 7 8.740848e-47 1.000000e+00 1.779306e-21
## 8 4.101006e-50 1.000000e+00 3.599706e-21
## 9 3.917562e-55 1.000000e+00 6.067798e-22
## 10 1.516864e-49 1.000000e+00 2.435858e-21
## 11 4.253817e-11 1.000000e+00 2.466262e-16
## 12 5.213397e-23 1.000000e+00 1.920585e-17
## 13 2.887668e-63 1.000000e+00 1.011983e-21
## 14 2.563074e-28 1.000000e+00 2.817876e-18
## 15 1.633880e-25 1.000000e+00 1.938416e-16
## 16 5.984879e-15 1.000000e+00 5.494075e-15
## 17 2.729108e-34 1.000000e+00 8.618578e-15
## 18 5.098725e-14 1.000000e+00 5.264144e-11
## 19 1.226205e-07 9.999994e-01 4.312588e-07
## 20 1.000000e+00 8.517466e-267 5.097335e-10
## 21 1.000000e+00 0.000000e+00 7.060414e-17
## 22 9.999980e-01 4.130779e-118 1.965400e-06
## 23 9.999979e-01 4.768397e-186 2.134411e-06
## 24 1.000000e+00 1.593256e-267 8.593283e-09
## 25 9.999966e-01 2.388903e-205 3.376183e-06
## 26 1.123581e-02 6.396755e-50 9.887642e-01
## 27 7.198739e-10 9.997299e-01 2.701147e-04
## 28 3.509460e-11 9.996593e-01 3.407163e-04
## 29 6.084442e-09 9.995180e-01 4.820043e-04
## 30 1.210549e-21 1.367158e-107 1.000000e+00
## 31 3.508459e-12 9.999983e-01 1.711234e-06
## 32 1.861134e-09 9.999925e-01 7.449099e-06
## 33 7.636107e-08 2.624225e-112 9.999999e-01
## 34 9.806719e-14 7.100828e-80 1.000000e+00
## 35 9.596781e-04 2.337239e-281 9.990403e-01
## 36 4.922369e-04 1.612531e-69 9.995078e-01
## 37 7.113299e-05 7.834125e-154 9.999289e-01
## 38 1.847042e-04 0.000000e+00 9.998153e-01
## 39 1.358517e-05 0.000000e+00 9.999864e-01
## 40 1.433789e-14 0.000000e+00 1.000000e+00
## 41 9.999914e-01 4.040575e-210 8.642646e-06
## 42 3.777099e-04 0.000000e+00 9.996223e-01
## 43 5.241684e-08 0.000000e+00 9.999999e-01
## 44 9.975638e-01 9.231261e-230 2.436202e-03
## 45 1.000000e+00 7.197630e-297 2.355335e-10
## 46 1.658995e-11 0.000000e+00 1.000000e+00
## 47 9.816954e-01 0.000000e+00 1.830460e-02
## 48 9.999987e-01 1.453697e-259 1.278303e-06
## 49 9.999903e-01 1.200042e-287 9.693333e-06
## 50 2.226455e-47 0.000000e+00 1.000000e+00
## 51 0.000000e+00 0.000000e+00 1.000000e+00
This doesn’t tell us much. Let’s create a matrix like we did with LDA.
qda.predict.class <- predict(kamcoop.qda,
newdata=d[,c(5,6,7,8,9)]
)$class
qda.predict.class
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Atelidae
## [21] Atelidae Atelidae Atelidae Atelidae
## [25] Atelidae Cercopithecidae Cebidae Cebidae
## [29] Cebidae Cercopithecidae Cebidae Cebidae
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae Cercopithecidae Cercopithecidae Atelidae
## [45] Atelidae Cercopithecidae Atelidae Atelidae
## [49] Atelidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class, d[,2])
##
## qda.predict.class Atelidae Cebidae Cercopithecidae
## Atelidae 12 0 0
## Cebidae 0 24 0
## Cercopithecidae 0 0 15
Our model correctly predicted the correct family 100% of the time. This appears to be more accurate than the LDA model using mean male and female body mass only. However, we should also test simpler models to see if we get similar results. As we learned this semester, simpler is better when it comes to modeling!
Let’s test some other modeling options!
Removing Mass_Dimorphism:
kamcoop.qda1 <- qda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + Brain_Size_Female_Mean, data = d)
qda.predict.class1 <- predict(kamcoop.qda1,
newdata=d[,c(5,6,7,8)]
)$class
qda.predict.class1
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Atelidae
## [21] Atelidae Atelidae Atelidae Atelidae
## [25] Atelidae Cercopithecidae Cebidae Cebidae
## [29] Cebidae Cercopithecidae Cebidae Cebidae
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae Cercopithecidae Cercopithecidae Atelidae
## [45] Atelidae Cercopithecidae Atelidae Atelidae
## [49] Atelidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class1, d[,2])
##
## qda.predict.class1 Atelidae Cebidae Cercopithecidae
## Atelidae 12 0 0
## Cebidae 0 24 0
## Cercopithecidae 0 0 15
Still looking good!
Now let’s try without Mass_Dimorphism and Brain_Size_Female_Mean:
kamcoop.qda2 <- qda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean, data = d)
qda.predict.class2 <- predict(kamcoop.qda2,
newdata=d[,c(5,7,8)]
)$class
qda.predict.class2
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Atelidae
## [21] Atelidae Atelidae Atelidae Atelidae
## [25] Atelidae Cercopithecidae Cebidae Cebidae
## [29] Cebidae Cercopithecidae Cebidae Cebidae
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae Cercopithecidae Cercopithecidae Atelidae
## [45] Atelidae Cercopithecidae Atelidae Atelidae
## [49] Atelidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class2, d[,2])
##
## qda.predict.class2 Atelidae Cebidae Cercopithecidae
## Atelidae 12 0 0
## Cebidae 0 24 0
## Cercopithecidae 0 0 15
Still 100%!
Now we try with just Male and Female body mass:
kamcoop.qda3 <- qda(Family ~ Body_mass_male_mean + Body_mass_female_mean, data = d)
qda.predict.class3 <- predict(kamcoop.qda3,
newdata=d[,c(7,8)]
)$class
qda.predict.class3
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Atelidae
## [21] Cercopithecidae Atelidae Atelidae Atelidae
## [25] Atelidae Cercopithecidae Cebidae Cebidae
## [29] Cebidae Atelidae Cebidae Cebidae
## [33] Atelidae Atelidae Atelidae Atelidae
## [37] Atelidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae Cercopithecidae Cercopithecidae Atelidae
## [45] Atelidae Cercopithecidae Atelidae Atelidae
## [49] Atelidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class3, d[,2])
##
## qda.predict.class3 Atelidae Cebidae Cercopithecidae
## Atelidae 11 0 6
## Cebidae 0 24 0
## Cercopithecidae 1 0 9
And now we see we are starting to lose accuracy.
One more combo just for the sake of being thorough without body mass or sexual dimorphism:
kamcoop.qda4 <- qda(Family ~ Brain_Size_Species_Mean + Brain_Size_Female_Mean, data = d)
qda.predict.class4 <- predict(kamcoop.qda4,
newdata=d[,c(5,6)]
)$class
qda.predict.class4
## [1] Cebidae Cebidae Cebidae Cebidae
## [5] Cebidae Cebidae Cebidae Cebidae
## [9] Cebidae Cebidae Cebidae Cebidae
## [13] Cebidae Cebidae Cebidae Cebidae
## [17] Cebidae Cebidae Cebidae Cebidae
## [21] Atelidae Atelidae Atelidae Cebidae
## [25] Cebidae Cebidae Cebidae Cebidae
## [29] Atelidae Atelidae Cebidae Atelidae
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Atelidae Atelidae Cercopithecidae
## [41] Atelidae Atelidae Cercopithecidae Cercopithecidae
## [45] Atelidae Cercopithecidae Cercopithecidae Atelidae
## [49] Cercopithecidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class4, d[,2])
##
## qda.predict.class4 Atelidae Cebidae Cercopithecidae
## Atelidae 6 2 4
## Cebidae 3 22 1
## Cercopithecidae 3 0 10
Not very good!
It looks as though our best and simplest model is:
Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean
As such, we can use this model and QDA to predict how an unknown sample should be classified into one of our three Families of primates.
Canonical Discriminant Analysis (CDA)
Canonical discriminant analysis (CDA) is a multivariate statistical technique that identifies differences among groups of treatments (the independent variables) and shows us the relationships between the various (dependent) variables within those groups. CDA determines the best way to discriminate the treatment groups from each other using quantitative measurements of all the variables within each group.
Running the CDA yields several uncorrelated canonical discriminant functions (CDFs). These linear combinations of the variables best separate the original treatment group means relative to within group variation. A lack of correlation between all of the CDFs allows each one to pull out a completely new dimension of information from the groups and variables. CDF1 shows the maximum possible variations among groups, therefore yielding the greatest degree of group differences. CDF1 and CDF2 are completely uncorrelated so CDF2 will reflect differences not shown in CDF1. Similarly, CDF1 and CDF2 are uncorrelated to CDF3, which reflects its own unique differences between groups.
Time to load in some data! To all my true crime lovers out there…
Load in the USA Arrests dataset, which shows number of arrests per 100,000 residents for assault, murder, and burglary in each of the 50 US states in 1973 as well as the percentage of residents living in urban areas in each state. (Note: This dataset has been modified for the purpose of this module.)
library(curl)
df<- curl("https://raw.githubusercontent.com/laurabw/AN597_GroupProject_LLMC/master/USArrests_data_NEW.csv")
crimes<- read.csv(df, header = T, sep = ",", stringsAsFactors = T)
head(crimes)
## State Murder Assault UrbanPop Burglary
## 1 Alabama 13.2 236 58 148.4
## 2 Alaska 10.0 263 48 311.5
## 3 Arizona 8.1 294 80 217.0
## 4 Arkansas 8.8 190 50 136.5
## 5 California 9.0 276 91 284.2
## 6 Colorado 7.9 204 78 270.9
Let’s assign our 50 states to 6 different geographical regions for the purpose of this analysis.
crimes$region <- "temp" #creating a new variable name
crimes$region[row(data.frame(levels(crimes$State))) %in% c(7, 8, 19, 20, 21, 29, 30, 32, 38, 39, 45)] <- "Northeast"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(1, 4, 9, 10, 18, 24, 33, 40, 42, 46, 48)] <- "Southeast"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(3, 31, 36, 43)] <- "Southwest"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(5, 6, 12, 26, 28, 37, 44, 47, 50)] <- "West"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(13, 14, 15, 16, 17, 22, 23, 25, 27, 34, 35, 41, 49)] <- "Midwest"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(2, 11)] <- "Noncontiguous"
crimes$region <- as.factor(crimes$region)
levels(crimes$region)
## [1] "Midwest" "Noncontiguous" "Northeast" "Southeast"
## [5] "Southwest" "West"
Data exploration: What are we looking at?
Let’s explore our data a bit and view each variable separately across the 6 regions.
par(mfrow = c(2,2))
par(xaxt = "n") #this just removes the built in x-axis labels so that we can insert our own and put them on an angle
lablist<-as.vector(c("MW", "NC", "NE", "SE", "SW", "W")) #our new x-axis labels
plot(data= crimes, Murder ~ region, xlab= "Region")
axis(1, at=seq(1, 6, by=1), labels = FALSE)
text(seq(1, 6, by=1), par("usr")[1] - 0.2, labels = lablist, srt = 45, pos = 1, xpd = TRUE) #this code is just for the positioning of our new x-axis labels
plot(data= crimes, Assault ~ region, xlab= "Region")
axis(1, at=seq(1, 6, by=1), labels = FALSE)
text(seq(1, 6, by=1), par("usr")[2] - 0.2, labels = lablist, srt = 45, pos = 1, xpd = TRUE)
plot(data= crimes, UrbanPop ~ region, xlab= "Region")
axis(1, at=seq(1, 6, by=1), labels = FALSE)
text(seq(1, 6, by=1), par("usr")[3] - 0.2, labels = lablist, srt = 45, pos = 1, xpd = TRUE)
plot(data= crimes, Burglary ~ region, xlab= "Region")
axis(1, at=seq(1, 6, by=1), labels = FALSE)
text(seq(1, 6, by=1), par("usr")[2] - 0.2, labels = lablist, srt = 45, pos = 1, xpd = TRUE)
We should probably test for normality and see what we’re working with. Since we have a multivariate dataset, we can use the Shapiro-Wilk multivariate normality test (conveniently, there is a built in package for this in the {mvnormtest} package).
Let’s create a matrix that has all of our dependent variables:
library(mvnormtest)
crimes_matrix<- cbind(crimes$Murder, crimes$Assault, crimes$UrbanPop, crimes$Burglary)
#The mshapiro.test() requires that the dependent variables be in rows and the independent variables be in columns, the function t() solves this
crimes_matrix_t<- t(crimes_matrix)
mshapiro.test(crimes_matrix_t)
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.93822, p-value = 0.01147
This p-value from this test is <0.05 which means technically our data is not normally distributed. Normality and homogeneity assumptions are not always considered absolute prerequisites for CDA so for the purpose of this module we will continue with this dataset. However, note that you should try to normalize your data beforehand in most cases!
Next Steps: Running a MANOVA
In order to investigate between-group differences of multivariate data, we can use multivariate analysis of variance, or MANOVA. Remember, ANOVA only looks at 1 response/dependent variable and tests for the difference in means across two or more groups. MANOVA extends this analysis to several dependent variables and tests for the difference in two or more vectors of means. For the purpose of the CDA, a MANOVA is the most appropriate approach. CDA then shows visual descriptions of differences between the groups that a simple MANOVA does not.
colnames(crimes_matrix)= c("Murder", "Assault", "Urban Population", "Blurglary") #setting column names in our matrix now will allow for better visualize with our CDA plot
crimes_manova<- manova(data= crimes, crimes_matrix ~ region)
summary(crimes_manova)
## Df Pillai approx F num Df den Df Pr(>F)
## region 5 1.2876 4.1774 20 176 9.527e-08 ***
## Residuals 44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Great! We can see that there is a significant difference in the means of our variables across the 6 regions. To look at the differences more closely, let’s quickly run an ANOVA summary…
summary.aov(crimes_manova)
## Response Murder :
## Df Sum Sq Mean Sq F value Pr(>F)
## region 5 417.68 83.536 7.1806 5.477e-05 ***
## Residuals 44 511.87 11.634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Assault :
## Df Sum Sq Mean Sq F value Pr(>F)
## region 5 91313 18262.5 3.2271 0.01441 *
## Residuals 44 249000 5659.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Urban Population :
## Df Sum Sq Mean Sq F value Pr(>F)
## region 5 1959.0 391.81 2.0752 0.08661 .
## Residuals 44 8307.4 188.80
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Blurglary :
## Df Sum Sq Mean Sq F value Pr(>F)
## region 5 62749 12549.7 3.7338 0.006647 **
## Residuals 44 147889 3361.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a significant difference in the mean number of murders, assaults, and burglaries across the regions but not in the percentage of people living in urban areas.
CDA Time!
library(candisc)
## Loading required package: heplots
##
## Attaching package: 'candisc'
## The following object is masked from 'package:stats':
##
## cancor
crimes_CDA<- candisc(crimes_manova)
summary(crimes_CDA)
##
## Canonical Discriminant Analysis for region:
##
## CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.695113 2.27990 1.598 71.54378 71.544
## 2 0.405419 0.68186 1.598 21.39682 92.941
## 3 0.176408 0.21419 1.598 6.72145 99.662
## 4 0.010655 0.01077 1.598 0.33795 100.000
##
## Class means:
##
## Can1 Can2
## Midwest 0.39690 0.0082801
## Noncontiguous 1.70210 -1.9205945
## Northeast 0.27187 1.2616902
## Southeast -2.45823 -0.4275676
## Southwest 0.11105 -0.0409363
## West 1.67131 -0.5864507
##
## std coefficients:
## Can1 Can2
## Murder -1.38225 -0.41149
## Assault -0.22115 0.83969
## Urban Population 0.33252 0.66148
## Blurglary 1.19140 -1.30648
The summary() function with our CDA object shows us the standardized canonical coefficients for the first two dimensions across our dependent variables. The canonical coefficients yield relative information on each variable in distinguishing between groups. Each canonical dimension is most strongly influenced by the variable(s) with the greatest absolute value coefficients.
We can see that CDF1 is most strongly influenced by the Murder and Burglary variables while CDF2 is most strongly influenced by Assault and Burglary.
Let’s plot this sucker!
par(mfrow = c(1,1))
heplot(crimes_CDA, term = "region")
## Vector scale factor set to 3.503351
Overall, we can see that the mean number of arrests for murder and assault in 1973 is most strongly associated with the southeast region of the United States. Arrests for burglary are most strongly associated with the noncontiguous states, Hawaii and Alaksa. Remember that our ANOVA showed that the percentage of people living in urban areas was not significantly different across the regions. Therefore, there is a weak association of UrbanPop with any of the regions in our CDA plot. In addition, the further away variables are from the “Error” circle, the stronger the association. Therefore, we see that the northeast, midwest, and southwest regions are not strongly associated with any of our response variables.
CDA CHALLENGE
Load in the built-in R dataset “Wine” in the {candisc} package, which contains the results of a chemical analysis of three types of wine, Barolo, Grignolino, and Barbera (all red wine), grown in a specific area of Italy.
data("Wine")
head(Wine)
## Cultivar Alcohol MalicAcid Ash AlcAsh Mg Phenols Flav NonFlavPhenols
## 1 barolo 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28
## 2 barolo 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26
## 3 barolo 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30
## 4 barolo 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24
## 5 barolo 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39
## 6 barolo 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34
## Proa Color Hue OD Proline
## 1 2.29 5.64 1.04 3.92 1065
## 2 1.28 4.38 1.05 3.40 1050
## 3 2.81 5.68 1.03 3.17 1185
## 4 2.18 7.80 0.86 3.45 1480
## 5 1.82 4.32 1.04 2.93 735
## 6 1.97 6.75 1.05 2.85 1450
levels(Wine$Cultivar) #these are the categories of our independent variable... the three types of wine
## [1] "barolo" "grignolino" "barbera"
Since this dataset is also not normally distributed, we are going to use the log transformation of the most normally distributed variables: Alcohol, Akalinity of Ash, and Color.
par(mfrow = c(1,2))
hist(log(Wine$Alcohol))
qqnorm(log(Wine$Alcohol))
hist(log(Wine$AlcAsh))
qqnorm(log(Wine$AlcAsh))
hist(log(Wine$Color))
qqnorm(log(Wine$Color))
Plot these log transformed variables against the Cultivar variable.
par(mfrow = c(2,2))
par(xaxt = "s") #since we removed the x-axis above, we need this code to add them back in
plot(data = Wine, log(Alcohol) ~ Cultivar)
plot(data = Wine, log(AlcAsh) ~ Cultivar)
plot(data = Wine, log(Color) ~ Cultivar)
Create a log transformed matrix containing these variables.
wine_matrix<- cbind(Wine$Alcohol, Wine$AlcAsh, Wine$Color)
log_wine_matrix<- log(wine_matrix) #log transforming the matrix
Name the columns of the matrix and run the appropriate MANOVA to use for this CDA analysis. What does the MANOVA tell us? How can you tell which variables are significantly different across the three types of wine?
colnames(log_wine_matrix) = c("Alcohol", "Alkalinity of Ash", "Color")
log_wine_manova<- manova(data = Wine, log_wine_matrix ~ Cultivar)
summary(log_wine_manova)
## Df Pillai approx F num Df den Df Pr(>F)
## Cultivar 2 1.1529 78.946 6 348 < 2.2e-16 ***
## Residuals 175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(log_wine_manova)
## Response Alcohol :
## Df Sum Sq Mean Sq F value Pr(>F)
## Cultivar 2 0.42355 0.211774 135.51 < 2.2e-16 ***
## Residuals 175 0.27350 0.001563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Alkalinity of Ash :
## Df Sum Sq Mean Sq F value Pr(>F)
## Cultivar 2 1.6350 0.81748 37.316 3.176e-14 ***
## Residuals 175 3.8337 0.02191
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Color :
## Df Sum Sq Mean Sq F value Pr(>F)
## Cultivar 2 23.948 11.9741 154.81 < 2.2e-16 ***
## Residuals 175 13.536 0.0773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Run the CDA and investigate which variables most strongly influence the first two CDFs, then plot it.
log_wine_CDA<- candisc(log_wine_manova)
summary(log_wine_CDA)
##
## Canonical Discriminant Analysis for Cultivar:
##
## CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.69347 2.26235 1.4123 72.688 72.688
## 2 0.45948 0.85006 1.4123 27.312 100.000
##
## Class means:
##
## Can1 Can2
## barolo 1.2609 1.043183
## grignolino -1.8302 0.030076
## barbera 1.1573 -1.326732
##
## std coefficients:
## Can1 Can2
## Alcohol 0.56268 0.60585
## Alkalinity of Ash -0.06216 -0.65614
## Color 0.65306 -0.68300
par(mfrow= c(1,1))
heplot(log_wine_CDA, term = "Cultivar")
## Vector scale factor set to 10.17842
Let’s add some color!
myColors <- hsv((c(0,120,240) + 80)/360,s = 0.9,v = 0.8,0.7) #creating a vector for the colors
plot(log_wine_CDA, col = myColors, pch = rep(16,20), xpd = T)
## Vector scale factor set to 5.693
Thanks for joining us! We hope you enjoyed the module and learned a few things about discriminant analysis!