Preliminaries

For this module, you will need to install the following packages:

Objectives

The objective of this module is to introduce discriminant analysis in general as well as describe and go through examples using three different types of disriminant analysis: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and canonical discriminant analysis (CDA).

Introduction

What is Discriminant Analysis?

Discriminant analysis is used when the objective of the assessment of data is to determine the adequacy of classification of data into groups, including a priori groups. This is different than a similar statistical tool, principal component analysis (PCA), which focuses on overall variation in data without regard for a specific grouping of the observations.

In this way, discriminant analysis allows us to build classifiers, that is, through this type of analysis, we can build a predictive model of group membership that we can then apply to new cases, to see into which group a new case would fall. As opposed to regression techniques, discriminant analysis produces group labels, and not a real value as its output.

Furthermore, discriminant analysis can be linear or it can fit other curves; it can also be two- or multi-dimensional with either a line or plane separting the categories, depending on the numnber of dimensions. We will see some of the examples of how this works and the kinds of visuals that can be produced as part of this tutorial.

While discriminant analysis parallels multiple regression analysis in some ways, one important difference is that regression analysis works with a continuous dependent variable, while discriminant analysis must use a discrete dependent variable.

A simple example of discriminant analysis would be to consider the following situation: An academic advisor splits a cohort of students into three groups: 1) those who graduate and do not plan to go on to graduate school, 2) those who graduate and go on to graduate school, and 3) those who drop out of college. A variety of data could be collected for each group during their time in college, before knowing which group they will fall into at the end. After graduation for that cohort, all the students will fall into one of these categories. Discriminant analysis could then be used to determine which variable(s) that were collected work as the best predictors of the students’ educational outcomes.

Linear Discriminant Analysis (LDA)

Just like a PCA or other discriminant analysis, the goal is to see if certain qualities can describe a categorical variable. If so, when we plot these analyses, we should be able to see somewhat clear separation of the categories. LDA uses data to divide predictor variables into categorical regions with linear boundaries. Compared to PCA, LDA orders dimensions based on how much separation each category achieves, maximizing the difference and minimizing the overlap of clusters.

A PCA can be thought of as an “unsupervised” algorithm that ignores class labels and searches for the directions (principal components) that maximize variance. LDA, however, can be thought of as “supervised,” computing the directions (linear discriminants) that represent the axes in order to maximize separation between different classes. LDA tends to be better to use when there are multiple classes and the number of samples per class is relatively small. Also, in comparison with QDA, it is assumed that covariances of independent variables is the same for each class.

Here we have the blue AND green cluster on the x-axis, so if points in the future behave accordingly to the probability density function, then they should be classified into a single cluster.

The LDA returns and output giving scores that define the weight to how much the variables compose the function. These scores are calculated with this equation:

\[ \delta_i (X) = - \frac{1}{2} \mu_i^T \Sigma^{-1} X + ln(\pi_i)\]

But why would we ever do this by hand when we have the lovely lda() function in the {MASS} R package?!

LDA CHALLENGE

Packages we need:

library(curl)
library(MASS)
library(ggplot2)

We can use the Kamilar and Cooper data set to take a look at how this works. We’re going to use the Kamala and Cooper dataset, but, for the sake of simplicity, this dataset only includes the primate families Atelidae, Cebidae, and two genera of Cercopithecidae (macaques and mandrills). The dataset also includes five numeric variables: mean species brain size, mean female brain size, mean male body mass, mean female body mass, and mass dimorphism. Furthermore, we took out any species within these families that did not have values for any of the variables. Let’s load this data in first:

f <- curl("https://raw.githubusercontent.com/laurabw/AN597_GroupProject_LLMC/master/kamcoop.simplified.2.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = TRUE)
head(d)
##          Scientific_Name  Family      Genus     Species
## 1     Callithrix_pygmaea Cebidae Callithrix     pygmaea
## 2     Callithrix_jacchus Cebidae Callithrix     jacchus
## 3 Callithrix_penicillata Cebidae Callithrix penicillata
## 4   Saguinus_fuscicollis Cebidae   Saguinus fuscicollis
## 5   Callithrix_argentata Cebidae Callithrix   argentata
## 6         Saguinus_niger Cebidae   Saguinus       niger
##   Brain_Size_Species_Mean Brain_Size_Female_Mean Body_mass_male_mean
## 1                    4.17                   4.33               110.0
## 2                    7.24                   7.05               318.0
## 3                    7.32                   7.38               343.8
## 4                    7.94                   8.03               404.6
## 5                    7.95                   8.36               330.0
## 6                    9.48                   9.62               380.0
##   Body_mass_female_mean Mass_Dimorphism
## 1                 122.0           0.902
## 2                 322.0           0.988
## 3                 311.7           1.103
## 4                 396.6           1.020
## 5                 360.0           0.917
## 6                 370.0           1.027
summary(d)
##            Scientific_Name             Family          Genus   
##  Alouatta_belzebul : 1     Atelidae       :12   Macaca    :13  
##  Alouatta_caraya   : 1     Cebidae        :24   Saguinus  : 7  
##  Alouatta_guariba  : 1     Cercopithecidae:15   Alouatta  : 6  
##  Alouatta_palliata : 1                          Cebus     : 5  
##  Alouatta_pigra    : 1                          Ateles    : 4  
##  Alouatta_seniculus: 1                          Callithrix: 4  
##  (Other)           :45                          (Other)   :12  
##         Species   Brain_Size_Species_Mean Brain_Size_Female_Mean
##  geoffroyi  : 2   Min.   :  4.17          Min.   :  4.33        
##  albifrons  : 1   1st Qu.: 12.34          1st Qu.: 12.29        
##  apella     : 1   Median : 63.98          Median : 62.14        
##  arachnoides: 1   Mean   : 58.46          Mean   : 56.46        
##  arctoides  : 1   3rd Qu.: 91.83          3rd Qu.: 86.40        
##  argentata  : 1   Max.   :153.88          Max.   :146.63        
##  (Other)    :44                                                 
##  Body_mass_male_mean Body_mass_female_mean Mass_Dimorphism
##  Min.   :  110.0     Min.   :  122         Min.   :0.871  
##  1st Qu.:  624.6     1st Qu.:  639         1st Qu.:1.032  
##  Median : 5680.0     Median : 3518         Median :1.226  
##  Mean   : 5679.8     Mean   : 4006         Mean   :1.267  
##  3rd Qu.: 8202.5     3rd Qu.: 6485         3rd Qu.:1.398  
##  Max.   :34400.0     Max.   :12800         Max.   :2.688  
## 
str(d) # There are 3 different families, this is probably what we will use for the analysis
## 'data.frame':    51 obs. of  9 variables:
##  $ Scientific_Name        : Factor w/ 51 levels "Alouatta_belzebul",..: 19 17 18 43 16 48 45 49 46 44 ...
##  $ Family                 : Factor w/ 3 levels "Atelidae","Cebidae",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Genus                  : Factor w/ 13 levels "Alouatta","Aotus",..: 6 6 6 12 6 12 12 12 12 12 ...
##  $ Species                : Factor w/ 50 levels "albifrons","apella",..: 40 21 38 17 5 31 26 33 27 18 ...
##  $ Brain_Size_Species_Mean: num  4.17 7.24 7.32 7.94 7.95 ...
##  $ Brain_Size_Female_Mean : num  4.33 7.05 7.38 8.03 8.36 ...
##  $ Body_mass_male_mean    : num  110 318 344 405 330 ...
##  $ Body_mass_female_mean  : num  122 322 312 397 360 ...
##  $ Mass_Dimorphism        : num  0.902 0.988 1.103 1.02 0.917 ...
plot(d$Family, d$Body_mass_male_mean) #plot mass and male body mass so we can start to see what differences we may be looking for 

This simple boxplot just shows what we will be looking for in the LDA; in terms of male body mass, Cercopithecidae and Atelidae are close to one another, but species within Cebidae are a bit smaller.

Now we’re going to look for this difference in body mass (male and female) using an LDA. We will do this by simply running the lda() function from the {MASS} package.

kamcoop.lda <- lda(Family ~ Body_mass_male_mean + Body_mass_female_mean, data = d) #categorical dependent variable is family 
kamcoop.lda
## Call:
## lda(Family ~ Body_mass_male_mean + Body_mass_female_mean, data = d)
## 
## Prior probabilities of groups:
##        Atelidae         Cebidae Cercopithecidae 
##       0.2352941       0.4705882       0.2941176 
## 
## Group means:
##                 Body_mass_male_mean Body_mass_female_mean
## Atelidae                   7894.583             6616.2000
## Cebidae                    1113.621              942.2208
## Cercopithecidae           11213.687             6820.7333
## 
## Coefficients of linear discriminants:
##                                 LD1           LD2
## Body_mass_male_mean    0.0001091874  0.0004160871
## Body_mass_female_mean -0.0007126010 -0.0006123181
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9245 0.0755

Output gives: group means, LD coefficients (lines that actually discriminate between the variables), prior proabilities of groups (what proportion of the data seemed to be in each group prior to starting), the proportion of trace is the percentage of separation achieved by each discriminant function (91% and 8%).

Now we can generate prediction values based on the LDA function and store it in an object by passing the model through the predict() function. The length of the value predicted will be correspond with the length of the processed data.

lda.predict <- predict(kamcoop.lda)
lda.predict
## $class
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Cercopithecidae
## [21] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [25] Cercopithecidae Cebidae         Cebidae         Cebidae        
## [29] Cebidae         Cebidae         Cebidae         Cebidae        
## [33] Cebidae         Atelidae        Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [45] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [49] Atelidae        Atelidae        Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
## 
## $posterior
##        Atelidae      Cebidae Cercopithecidae
## 1  0.0003401910 9.985793e-01     0.001080556
## 2  0.0005170006 9.979535e-01     0.001529514
## 3  0.0004944403 9.980079e-01     0.001497706
## 4  0.0006008810 9.976594e-01     0.001739715
## 5  0.0005695940 9.977927e-01     0.001637707
## 6  0.0005672634 9.977720e-01     0.001660756
## 7  0.0008313975 9.969330e-01     0.002235615
## 8  0.0006813717 9.974199e-01     0.001898767
## 9  0.0009440186 9.966033e-01     0.002452714
## 10 0.0008265242 9.969545e-01     0.002218965
## 11 0.0007689420 9.970221e-01     0.002208925
## 12 0.0006888250 9.973453e-01     0.001965889
## 13 0.0012144544 9.958130e-01     0.002972548
## 14 0.0009120509 9.966217e-01     0.002466270
## 15 0.0011599079 9.958161e-01     0.003024039
## 16 0.0017507576 9.938694e-01     0.004379867
## 17 0.0036193459 9.889666e-01     0.007414008
## 18 0.0011691551 9.956853e-01     0.003145514
## 19 0.0009590634 9.962439e-01     0.002796995
## 20 0.4290728510 6.050376e-02     0.510423390
## 21 0.1793044387 1.898460e-02     0.801710964
## 22 0.3656774322 1.987355e-01     0.435587061
## 23 0.1860623006 3.754918e-01     0.438445871
## 24 0.4585216246 4.367976e-02     0.497798618
## 25 0.4528980582 7.020209e-02     0.476899849
## 26 0.1170722469 6.512987e-01     0.231629015
## 27 0.0179987783 9.406955e-01     0.041305693
## 28 0.0173410302 9.480793e-01     0.034579622
## 29 0.0266849672 9.163905e-01     0.056924484
## 30 0.0390526327 8.094128e-01     0.151534570
## 31 0.0422509177 8.852036e-01     0.072545480
## 32 0.0247185679 9.227838e-01     0.052497635
## 33 0.1101921924 5.601168e-01     0.329690974
## 34 0.4678085311 9.858515e-02     0.433606316
## 35 0.3843191182 2.077976e-02     0.594901118
## 36 0.3415407910 1.629742e-01     0.495485041
## 37 0.4100832796 3.881187e-02     0.551104851
## 38 0.2652295227 7.227040e-03     0.727543437
## 39 0.4360041049 5.050641e-05     0.563945389
## 40 0.1580120318 9.053494e-02     0.751453026
## 41 0.7844391252 1.209391e-03     0.214351484
## 42 0.4709515941 3.645727e-04     0.528683833
## 43 0.5531257216 5.449115e-04     0.446329367
## 44 0.8327248616 1.531785e-04     0.167121960
## 45 0.7807387443 7.851417e-04     0.218476114
## 46 0.2221499844 1.442925e-02     0.763420769
## 47 0.9163544015 1.079450e-05     0.083634804
## 48 0.7297299422 2.725182e-04     0.269997540
## 49 0.9119560783 3.576075e-05     0.088008161
## 50 0.6846662311 1.892034e-07     0.315333580
## 51 0.0002806603 1.483931e-06     0.999717856
## 
## $x
##            LD1          LD2
## 1   2.15977336  0.060895255
## 2   2.03996414  0.024977753
## 3   2.05012097  0.042019676
## 4   1.99625973  0.015331966
## 5   2.01419555  0.006702712
## 6   2.01252891  0.021383885
## 7   1.90629370 -0.028006549
## 8   1.96300883 -0.009372415
## 9   1.87196813 -0.049444912
## 10  1.90840336 -0.029642017
## 11  1.92003236  0.022244958
## 12  1.95568731  0.010891077
## 13  1.80248368 -0.084981051
## 14  1.87626086 -0.018363736
## 15  1.80660276 -0.035422136
## 16  1.68366487 -0.047086628
## 17  1.48704196 -0.172339993
## 18  1.79894905 -0.009063233
## 19  1.84807248  0.046682304
## 20 -0.79702543 -0.211050571
## 21 -1.10258806  0.896016015
## 22 -0.37434761 -0.282913674
## 23 -0.07428043  0.208568065
## 24 -0.90506511 -0.265214193
## 25 -0.74748749 -0.316726099
## 26  0.27205282  0.009615968
## 27  0.95005969  0.010777016
## 28  0.98707482 -0.107495921
## 29  0.83032403 -0.026753052
## 30  0.57456127  0.493785278
## 31  0.71079578 -0.178960295
## 32  0.85721567 -0.034905569
## 33  0.17636982  0.355463396
## 34 -0.63042452 -0.438500307
## 35 -1.14039823  0.057863270
## 36 -0.44724832 -0.113362984
## 37 -0.94187477 -0.088950286
## 38 -1.44842928  0.566617812
## 39 -3.03755898  0.242137164
## 40 -0.58336407  0.855549871
## 41 -1.97287476 -1.179530436
## 42 -2.41906214  0.022619074
## 43 -2.28970027 -0.262610351
## 44 -2.58898906 -1.318999043
## 45 -2.11068333 -1.136812774
## 46 -1.21286551  0.705448192
## 47 -3.32204921 -1.819990911
## 48 -2.46669917 -0.852981509
## 49 -2.95417086 -1.839039387
## 50 -4.76203306 -0.282623489
## 51 -3.13054627  6.565552811

These prediction values can also help us look at the accuracy of the LDA by creating a table between the predicted classes (families based off predicted values) and the actual families from the data.

lda.predict.class <- predict(kamcoop.lda,
                       newdata=d[,c(7,8)]
                       )$class #find prediction values of class
lda.predict.class #gives the classifications that are in the predictions
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Cercopithecidae
## [21] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [25] Cercopithecidae Cebidae         Cebidae         Cebidae        
## [29] Cebidae         Cebidae         Cebidae         Cebidae        
## [33] Cebidae         Atelidae        Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [45] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [49] Atelidae        Atelidae        Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
#determine how well the model fits by comparing predictions to 
table(lda.predict.class, d[,2])
##                  
## lda.predict.class Atelidae Cebidae Cercopithecidae
##   Atelidae               6       0               3
##   Cebidae                0      24               3
##   Cercopithecidae        6       0               9

This is a cross-tabulation of the real families and the predicted families. This basically is saying how many “Atelidae” were predicted to be “Atelidae” and so on. Based off of our table, ours is okay but not great. Only 6/12 Atelidae were correctly classified and 6 Cercopithecidae were misclassified as Atelidae.

Visualize LDA

One way we can visualize the output of the LDA is to plot the predicted linear discriminants for each family on stacked histograms.

ldahist(lda.predict$x[,1], g = d$Family)

ldahist(lda.predict$x[,2], g = d$Family)

As we saw earlier, Cercopithecidae has the most variance and is a bit all over the place for the first discriminant. So we can expect that they won’t group together that well in a scatter plot. In the second, they all go to the left, so they will all probably overlap a bit in the scatter plot.

Making a scatter plot:

newdata <- data.frame(Family = d[,2], lda = lda.predict$x) #make a dataframe with the familiy names and predicts LDs
newdata
##             Family     lda.LD1      lda.LD2
## 1          Cebidae  2.15977336  0.060895255
## 2          Cebidae  2.03996414  0.024977753
## 3          Cebidae  2.05012097  0.042019676
## 4          Cebidae  1.99625973  0.015331966
## 5          Cebidae  2.01419555  0.006702712
## 6          Cebidae  2.01252891  0.021383885
## 7          Cebidae  1.90629370 -0.028006549
## 8          Cebidae  1.96300883 -0.009372415
## 9          Cebidae  1.87196813 -0.049444912
## 10         Cebidae  1.90840336 -0.029642017
## 11         Cebidae  1.92003236  0.022244958
## 12         Cebidae  1.95568731  0.010891077
## 13         Cebidae  1.80248368 -0.084981051
## 14         Cebidae  1.87626086 -0.018363736
## 15         Cebidae  1.80660276 -0.035422136
## 16         Cebidae  1.68366487 -0.047086628
## 17         Cebidae  1.48704196 -0.172339993
## 18         Cebidae  1.79894905 -0.009063233
## 19         Cebidae  1.84807248  0.046682304
## 20        Atelidae -0.79702543 -0.211050571
## 21        Atelidae -1.10258806  0.896016015
## 22        Atelidae -0.37434761 -0.282913674
## 23        Atelidae -0.07428043  0.208568065
## 24        Atelidae -0.90506511 -0.265214193
## 25        Atelidae -0.74748749 -0.316726099
## 26 Cercopithecidae  0.27205282  0.009615968
## 27         Cebidae  0.95005969  0.010777016
## 28         Cebidae  0.98707482 -0.107495921
## 29         Cebidae  0.83032403 -0.026753052
## 30 Cercopithecidae  0.57456127  0.493785278
## 31         Cebidae  0.71079578 -0.178960295
## 32         Cebidae  0.85721567 -0.034905569
## 33 Cercopithecidae  0.17636982  0.355463396
## 34 Cercopithecidae -0.63042452 -0.438500307
## 35 Cercopithecidae -1.14039823  0.057863270
## 36 Cercopithecidae -0.44724832 -0.113362984
## 37 Cercopithecidae -0.94187477 -0.088950286
## 38 Cercopithecidae -1.44842928  0.566617812
## 39 Cercopithecidae -3.03755898  0.242137164
## 40 Cercopithecidae -0.58336407  0.855549871
## 41        Atelidae -1.97287476 -1.179530436
## 42 Cercopithecidae -2.41906214  0.022619074
## 43 Cercopithecidae -2.28970027 -0.262610351
## 44        Atelidae -2.58898906 -1.318999043
## 45        Atelidae -2.11068333 -1.136812774
## 46 Cercopithecidae -1.21286551  0.705448192
## 47        Atelidae -3.32204921 -1.819990911
## 48        Atelidae -2.46669917 -0.852981509
## 49        Atelidae -2.95417086 -1.839039387
## 50 Cercopithecidae -4.76203306 -0.282623489
## 51 Cercopithecidae -3.13054627  6.565552811
library(ggplot2)
ggplot(newdata) + geom_point(aes(lda.LD1, lda.LD2, colour = Family), size = 2.5)

They are kind of on top of each other, so I’m going to try to add another variable, mass dimorphism, which we can use the same dataset:

kamcoop.lda.2 <- lda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Mass_Dimorphism, data = d) #categorical dependent variable is family 
lda.predict.2 <- predict(kamcoop.lda.2)
lda.predict.2
## $class
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Atelidae       
## [21] Cercopithecidae Atelidae        Cercopithecidae Atelidae       
## [25] Atelidae        Cercopithecidae Cebidae         Cebidae        
## [29] Cebidae         Cercopithecidae Cebidae         Cebidae        
## [33] Cercopithecidae Atelidae        Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [45] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [49] Atelidae        Atelidae        Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
## 
## $posterior
##        Atelidae      Cebidae Cercopithecidae
## 1  1.477516e-06 9.999983e-01    2.707484e-07
## 2  1.011385e-05 9.999870e-01    2.869293e-06
## 3  5.827367e-05 9.999107e-01    3.104874e-05
## 4  2.032249e-05 9.999729e-01    6.787852e-06
## 5  3.772918e-06 9.999955e-01    7.215359e-07
## 6  2.104817e-05 9.999716e-01    7.335173e-06
## 7  9.863534e-06 9.999880e-01    2.147339e-06
## 8  6.466605e-06 9.999922e-01    1.353236e-06
## 9  6.215873e-06 9.999927e-01    1.071590e-06
## 10 8.126629e-06 9.999902e-01    1.657814e-06
## 11 2.844193e-04 9.995041e-01    2.114447e-04
## 12 4.959365e-05 9.999293e-01    2.107350e-05
## 13 5.208518e-06 9.999941e-01    7.316696e-07
## 14 4.536749e-05 9.999385e-01    1.611665e-05
## 15 7.650989e-05 9.998947e-01    2.881898e-05
## 16 3.587390e-04 9.994512e-01    1.900477e-04
## 17 8.601006e-05 9.998960e-01    1.798848e-05
## 18 3.589728e-04 9.994064e-01    2.346014e-04
## 19 3.567732e-03 9.905437e-01    5.888552e-03
## 20 5.051472e-01 6.523237e-03    4.883296e-01
## 21 8.330024e-02 3.037484e-05    9.166694e-01
## 22 5.375877e-01 5.536599e-02    4.070464e-01
## 23 2.159895e-01 6.152515e-03    7.778580e-01
## 24 5.389498e-01 5.331601e-03    4.557186e-01
## 25 5.711225e-01 1.308015e-02    4.157973e-01
## 26 3.082461e-01 8.719978e-02    6.045541e-01
## 27 1.414057e-01 5.809160e-01    2.776783e-01
## 28 3.614377e-02 9.301642e-01    3.369202e-02
## 29 1.542144e-01 5.965021e-01    2.492836e-01
## 30 6.033347e-02 1.015470e-03    9.386511e-01
## 31 7.243274e-02 8.732434e-01    5.432388e-02
## 32 1.362293e-01 6.541424e-01    2.096283e-01
## 33 1.310767e-01 3.166082e-03    8.657572e-01
## 34 6.373236e-01 3.979355e-02    3.228829e-01
## 35 3.569428e-01 6.119489e-04    6.424452e-01
## 36 4.345348e-01 1.537974e-02    5.500855e-01
## 37 4.311229e-01 2.158437e-03    5.667186e-01
## 38 1.752746e-01 4.790270e-05    8.246775e-01
## 39 3.879168e-01 1.236134e-06    6.120820e-01
## 40 6.285453e-02 6.058919e-05    9.370849e-01
## 41 8.838837e-01 5.994420e-04    1.155169e-01
## 42 4.363092e-01 1.104061e-05    5.636798e-01
## 43 5.533482e-01 2.699007e-05    4.466248e-01
## 44 8.916639e-01 3.594499e-05    1.083001e-01
## 45 8.694343e-01 2.752187e-04    1.302904e-01
## 46 1.252067e-01 4.931946e-05    8.747439e-01
## 47 9.410597e-01 1.674786e-06    5.893866e-02
## 48 7.811075e-01 3.291880e-05    2.188596e-01
## 49 9.497654e-01 1.095331e-05    5.022364e-02
## 50 6.436569e-01 5.111265e-09    3.563431e-01
## 51 5.275722e-04 8.960915e-07    9.994715e-01
## 
## $x
##           LD1          LD2
## 1   3.4592684 -0.295870934
## 2   2.9226842 -0.105683080
## 3   2.4001611  0.222725251
## 4   2.7273353 -0.033559621
## 5   3.2226688 -0.317006815
## 6   2.7125022 -0.007040944
## 7   2.9675781 -0.281310464
## 8   3.0771011 -0.285362578
## 9   3.1151717 -0.412794871
## 10  3.0246864 -0.314508464
## 11  1.9618025  0.361322318
## 12  2.4728962  0.080061523
## 13  3.1885721 -0.540271141
## 14  2.5209846 -0.034912842
## 15  2.3839923 -0.023473655
## 16  1.9543191  0.122567087
## 17  2.4413069 -0.422882283
## 18  1.9234313  0.262852728
## 19  1.2214459  0.759854917
## 20 -1.1514635 -0.126271981
## 21 -2.3837613  1.310176731
## 22 -0.6055386 -0.179522183
## 23 -1.1494847  0.794611421
## 24 -1.1973563 -0.229873630
## 25 -0.9691959 -0.285411593
## 26 -0.4965037  0.510102850
## 27  0.1607580  0.652712850
## 28  0.7206121  0.251991426
## 29  0.1744247  0.519476388
## 30 -1.4928380  1.744650847
## 31  0.5661014  0.066481462
## 32  0.2347643  0.498019434
## 33 -1.2787374  1.191216397
## 34 -0.6697185 -0.474588899
## 35 -1.7385381  0.182138643
## 36 -0.9431740  0.107373363
## 37 -1.4292438  0.028914341
## 38 -2.3303087  0.727050896
## 39 -3.2642843 -0.239025362
## 40 -2.1893216  1.564629940
## 41 -1.5827300 -1.619761780
## 42 -2.7259116 -0.262854221
## 43 -2.4958442 -0.542461948
## 44 -2.2655981 -1.818310244
## 45 -1.7899681 -1.568688893
## 46 -2.2983278  1.010626825
## 47 -2.9353760 -2.426298375
## 48 -2.3769707 -1.257303037
## 49 -2.4514468 -2.440368008
## 50 -4.5840202 -1.256648346
## 51 -2.7589070  4.832508556
newdata.2 <- data.frame(Family = d[,2], lda = lda.predict.2$x) 
newdata.2 #3 LD because 3 variables?
##             Family    lda.LD1      lda.LD2
## 1          Cebidae  3.4592684 -0.295870934
## 2          Cebidae  2.9226842 -0.105683080
## 3          Cebidae  2.4001611  0.222725251
## 4          Cebidae  2.7273353 -0.033559621
## 5          Cebidae  3.2226688 -0.317006815
## 6          Cebidae  2.7125022 -0.007040944
## 7          Cebidae  2.9675781 -0.281310464
## 8          Cebidae  3.0771011 -0.285362578
## 9          Cebidae  3.1151717 -0.412794871
## 10         Cebidae  3.0246864 -0.314508464
## 11         Cebidae  1.9618025  0.361322318
## 12         Cebidae  2.4728962  0.080061523
## 13         Cebidae  3.1885721 -0.540271141
## 14         Cebidae  2.5209846 -0.034912842
## 15         Cebidae  2.3839923 -0.023473655
## 16         Cebidae  1.9543191  0.122567087
## 17         Cebidae  2.4413069 -0.422882283
## 18         Cebidae  1.9234313  0.262852728
## 19         Cebidae  1.2214459  0.759854917
## 20        Atelidae -1.1514635 -0.126271981
## 21        Atelidae -2.3837613  1.310176731
## 22        Atelidae -0.6055386 -0.179522183
## 23        Atelidae -1.1494847  0.794611421
## 24        Atelidae -1.1973563 -0.229873630
## 25        Atelidae -0.9691959 -0.285411593
## 26 Cercopithecidae -0.4965037  0.510102850
## 27         Cebidae  0.1607580  0.652712850
## 28         Cebidae  0.7206121  0.251991426
## 29         Cebidae  0.1744247  0.519476388
## 30 Cercopithecidae -1.4928380  1.744650847
## 31         Cebidae  0.5661014  0.066481462
## 32         Cebidae  0.2347643  0.498019434
## 33 Cercopithecidae -1.2787374  1.191216397
## 34 Cercopithecidae -0.6697185 -0.474588899
## 35 Cercopithecidae -1.7385381  0.182138643
## 36 Cercopithecidae -0.9431740  0.107373363
## 37 Cercopithecidae -1.4292438  0.028914341
## 38 Cercopithecidae -2.3303087  0.727050896
## 39 Cercopithecidae -3.2642843 -0.239025362
## 40 Cercopithecidae -2.1893216  1.564629940
## 41        Atelidae -1.5827300 -1.619761780
## 42 Cercopithecidae -2.7259116 -0.262854221
## 43 Cercopithecidae -2.4958442 -0.542461948
## 44        Atelidae -2.2655981 -1.818310244
## 45        Atelidae -1.7899681 -1.568688893
## 46 Cercopithecidae -2.2983278  1.010626825
## 47        Atelidae -2.9353760 -2.426298375
## 48        Atelidae -2.3769707 -1.257303037
## 49        Atelidae -2.4514468 -2.440368008
## 50 Cercopithecidae -4.5840202 -1.256648346
## 51 Cercopithecidae -2.7589070  4.832508556
library(ggplot2)
ggplot(newdata.2) + geom_point(aes(lda.LD1, lda.LD2, colour = Family), size = 2.5) 

There is a little bit of differentiation between Cebidae and the other two families. This makes sense because Cebidae includes capuchins and squirrel monkeys, which are pretty small in comparison to atelids and cercopithecids.

What about brain size???

kamcoop.lda.3 <- lda(Family ~ Brain_Size_Species_Mean, data = d) 
lda.predict.3 <- predict(kamcoop.lda.3)
lda.predict.3
## $class
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Cebidae        
## [21] Cebidae         Cebidae         Cebidae         Cebidae        
## [25] Cebidae         Atelidae        Atelidae        Atelidae       
## [29] Atelidae        Atelidae        Atelidae        Atelidae       
## [33] Atelidae        Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [45] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [49] Cercopithecidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
## 
## $posterior
##       Atelidae      Cebidae Cercopithecidae
## 1  0.008226368 9.903126e-01     0.001461073
## 2  0.010653790 9.873143e-01     0.002031893
## 3  0.010725694 9.872249e-01     0.002049406
## 4  0.011299404 9.865103e-01     0.002190308
## 5  0.011308901 9.864984e-01     0.002192658
## 6  0.012858399 9.845584e-01     0.002583172
## 7  0.013097723 9.842576e-01     0.002644716
## 8  0.013163749 9.841745e-01     0.002661750
## 9  0.013185830 9.841467e-01     0.002667453
## 10 0.013589570 9.836382e-01     0.002772185
## 11 0.014714144 9.822174e-01     0.003068482
## 12 0.015138391 9.816797e-01     0.003181955
## 13 0.015665934 9.810098e-01     0.003324312
## 14 0.017014998 9.792905e-01     0.003694472
## 15 0.022697858 9.719606e-01     0.005341565
## 16 0.023753296 9.705850e-01     0.005661731
## 17 0.032500727 9.590346e-01     0.008464633
## 18 0.043035056 9.448171e-01     0.012147881
## 19 0.046359986 9.402681e-01     0.013371865
## 20 0.252417014 6.181191e-01     0.129463892
## 21 0.268142171 5.902817e-01     0.141576100
## 22 0.275328500 5.773659e-01     0.147305586
## 23 0.287032581 5.560505e-01     0.156916916
## 24 0.289667241 5.512021e-01     0.159130665
## 25 0.319068403 4.956984e-01     0.185233174
## 26 0.408524291 3.008613e-01     0.290614399
## 27 0.419145287 2.723404e-01     0.308514301
## 28 0.423310488 2.604484e-01     0.316241072
## 29 0.426606489 2.506722e-01     0.322721315
## 30 0.441565086 1.997379e-01     0.358696979
## 31 0.442096357 1.976067e-01     0.360296916
## 32 0.450643727 1.547979e-01     0.394558364
## 33 0.453100642 1.319265e-01     0.414972877
## 34 0.446589812 7.082441e-02     0.482585776
## 35 0.438345110 5.383540e-02     0.507819487
## 36 0.436409582 5.092624e-02     0.512664178
## 37 0.424079176 3.710301e-02     0.538817818
## 38 0.418048282 3.224167e-02     0.549710045
## 39 0.406122250 2.479976e-02     0.569077990
## 40 0.398335224 2.104392e-02     0.580620851
## 41 0.390802139 1.801453e-02     0.591183327
## 42 0.370398246 1.193723e-02     0.617664519
## 43 0.359404469 9.586537e-03     0.631008994
## 44 0.354780799 8.742214e-03     0.636476987
## 45 0.348610446 7.728807e-03     0.643660746
## 46 0.346122618 7.353572e-03     0.646523810
## 47 0.303656933 3.087003e-03     0.693256063
## 48 0.291845994 2.401758e-03     0.705752249
## 49 0.290431983 2.329795e-03     0.707238222
## 50 0.166746549 9.546390e-05     0.833157987
## 51 0.148650155 5.156367e-05     0.851298281
## 
## $x
##           LD1
## 1  -2.1058802
## 2  -1.9867871
## 3  -1.9836837
## 4  -1.9596323
## 5  -1.9592444
## 6  -1.8998918
## 7  -1.8913574
## 8  -1.8890299
## 9  -1.8882540
## 10 -1.8742887
## 11 -1.8374358
## 12 -1.8242463
## 13 -1.8083414
## 14 -1.7699367
## 15 -1.6353266
## 16 -1.6139907
## 17 -1.4658031
## 18 -1.3311930
## 19 -1.2951159
## 20 -0.3326727
## 21 -0.2841821
## 22 -0.2620703
## 23 -0.2259932
## 24 -0.2178468
## 25 -0.1255205
## 26  0.2143022
## 27  0.2713273
## 28  0.2961545
## 29  0.3171025
## 30  0.4361956
## 31  0.4416266
## 32  0.5614955
## 33  0.6367531
## 34  0.9133440
## 35  1.0297216
## 36  1.0529972
## 37  1.1841160
## 38  1.2415289
## 39  1.3478205
## 40  1.4137679
## 41  1.4758360
## 42  1.6387647
## 43  1.7248841
## 44  1.7609612
## 45  1.8090640
## 46  1.8284602
## 47  2.1640158
## 48  2.2602213
## 49  2.2718591
## 50  3.4736523
## 51  3.7017525
newdata.3 <- data.frame(Family = d[,2], lda = lda.predict.2$x) 
newdata.3 
##             Family    lda.LD1      lda.LD2
## 1          Cebidae  3.4592684 -0.295870934
## 2          Cebidae  2.9226842 -0.105683080
## 3          Cebidae  2.4001611  0.222725251
## 4          Cebidae  2.7273353 -0.033559621
## 5          Cebidae  3.2226688 -0.317006815
## 6          Cebidae  2.7125022 -0.007040944
## 7          Cebidae  2.9675781 -0.281310464
## 8          Cebidae  3.0771011 -0.285362578
## 9          Cebidae  3.1151717 -0.412794871
## 10         Cebidae  3.0246864 -0.314508464
## 11         Cebidae  1.9618025  0.361322318
## 12         Cebidae  2.4728962  0.080061523
## 13         Cebidae  3.1885721 -0.540271141
## 14         Cebidae  2.5209846 -0.034912842
## 15         Cebidae  2.3839923 -0.023473655
## 16         Cebidae  1.9543191  0.122567087
## 17         Cebidae  2.4413069 -0.422882283
## 18         Cebidae  1.9234313  0.262852728
## 19         Cebidae  1.2214459  0.759854917
## 20        Atelidae -1.1514635 -0.126271981
## 21        Atelidae -2.3837613  1.310176731
## 22        Atelidae -0.6055386 -0.179522183
## 23        Atelidae -1.1494847  0.794611421
## 24        Atelidae -1.1973563 -0.229873630
## 25        Atelidae -0.9691959 -0.285411593
## 26 Cercopithecidae -0.4965037  0.510102850
## 27         Cebidae  0.1607580  0.652712850
## 28         Cebidae  0.7206121  0.251991426
## 29         Cebidae  0.1744247  0.519476388
## 30 Cercopithecidae -1.4928380  1.744650847
## 31         Cebidae  0.5661014  0.066481462
## 32         Cebidae  0.2347643  0.498019434
## 33 Cercopithecidae -1.2787374  1.191216397
## 34 Cercopithecidae -0.6697185 -0.474588899
## 35 Cercopithecidae -1.7385381  0.182138643
## 36 Cercopithecidae -0.9431740  0.107373363
## 37 Cercopithecidae -1.4292438  0.028914341
## 38 Cercopithecidae -2.3303087  0.727050896
## 39 Cercopithecidae -3.2642843 -0.239025362
## 40 Cercopithecidae -2.1893216  1.564629940
## 41        Atelidae -1.5827300 -1.619761780
## 42 Cercopithecidae -2.7259116 -0.262854221
## 43 Cercopithecidae -2.4958442 -0.542461948
## 44        Atelidae -2.2655981 -1.818310244
## 45        Atelidae -1.7899681 -1.568688893
## 46 Cercopithecidae -2.2983278  1.010626825
## 47        Atelidae -2.9353760 -2.426298375
## 48        Atelidae -2.3769707 -1.257303037
## 49        Atelidae -2.4514468 -2.440368008
## 50 Cercopithecidae -4.5840202 -1.256648346
## 51 Cercopithecidae -2.7589070  4.832508556
ggplot(newdata.3) + geom_point(aes(lda.LD1, lda.LD2, colour = Family), size = 2.5) 

So with all of these variables, we are able to say that body mass and brain size can describe Cebidae out of these three families, but not the other two because they are too similar.

Quadratic Discriminant Analysis (QDA)

Just like ANOVA, Linear Discriminant Analyses (LDA) assumes equal variance between our input variables. If the data fails to meet this criteria, observation are more likely to be assigned to the class showing the greatest variability. This type of bias error is the reason the Quadratic Discriminant Analysis (QDA) was invented. In other words, LDA and QDA are both classification techniques, but QDA is less conservative by allowing different variance matrices for different classes. Another key difference between LDA and QDA is that the former assigns a linear classification decision boundary while the later leads to a quadratic decision boundary.

Both LDA and QDA can be derived from the same probabilistic models that model the class conditional distribution of the data \(P(X|y=k)\) for each class \(k\). QDA provides an alternative approach by assuming that each class has its own covariance matrix \(\Sigma_k\).

In order to calculate the quadratic boundary we use the following equation:

\[\delta_k(x) = \log \pi_k - \frac{1}{2} \log |\Sigma_k| - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k)\]

Let’s do a little exercise to see how QDA can be applied to our simplified Kamilar and Cooper dataset and how it differs from LDA.

Packages we need:

library(curl)
library(MASS)
library(ggplot2)
library(klaR)
library(car)
## Loading required package: carData

QDA CHALLENGE

Similar to what we did for LDA, we will now try to use QDA to try to determine which variable or combination of variables will give us the best predictor model for which of the three families of primates an individual belongs. Our data was already loaded in earlier in the module but here it is in case you need it again.

f <- curl("https://raw.githubusercontent.com/laurabw/AN597_GroupProject_LLMC/master/kamcoop.simplified.2.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = TRUE)

Now let’s take a more expansive look at our data to see if our other variables differ between groups.

plot(d$Family, d$Body_mass_female_mean, main = "Mean Female Body Mass", ylab = "Body Mass (g)")

plot(d$Family, d$Brain_Size_Species_Mean, main = "Mean Species Brain Size", ylab = "Brain Size (g)")

plot(d$Family, d$Brain_Size_Female_Mean, main = "Mean Female Brain Size", ylab = "Brain Size (g)")

plot(d$Family, d$Mass_Dimorphism, main = "Sexual Dimorphism in Mass", ylab = "Sexual Dimorphism")

Another way we can visualize the variation in our data is by using the scatterplotMatrix() function from the {car} package.

scatterplotMatrix(~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + Brain_Size_Female_Mean + Mass_Dimorphism| Family, data=d,legend = TRUE,smooth = list(method=gamLine),diagonal = TRUE,plot.points = TRUE)

Most of our variables appear to be noticebly different between families but let’s investigate further with QDA to see if one is a better predictor than the others. Instead of using the lda(), here we need to use the qda() function, which is also a function of the {MASS} package. We can start with a model that includes all of our variables and we can simplify later if need be.

kamcoop.qda <- qda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + Brain_Size_Female_Mean + Mass_Dimorphism, data = d) #categorical dependent variable is family 
kamcoop.qda
## Call:
## qda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + 
##     Brain_Size_Female_Mean + Mass_Dimorphism, data = d)
## 
## Prior probabilities of groups:
##        Atelidae         Cebidae Cercopithecidae 
##       0.2352941       0.4705882       0.2941176 
## 
## Group means:
##                 Body_mass_male_mean Body_mass_female_mean
## Atelidae                   7894.583             6616.2000
## Cebidae                    1113.621              942.2208
## Cercopithecidae           11213.687             6820.7333
##                 Brain_Size_Species_Mean Brain_Size_Female_Mean
## Atelidae                       80.56833               78.77833
## Cebidae                        23.94333               23.60667
## Cercopithecidae                95.98533               91.16600
##                 Mass_Dimorphism
## Atelidae               1.230583
## Cebidae                1.083250
## Cercopithecidae        1.588533

QDA Partition Plots

Next we can use the partimat() function from the {klaR} package which allows us to plot the results of the QDA. This will create a figure for every possible combination of two variables from our data.

partimat(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + Brain_Size_Female_Mean + Mass_Dimorphism, data = d, method="qda")

Think of each of these plots as a different prespective on our data. The different colored regions outline each classification area. Observation that falls within a region is predicted to be from a specific class. The apparent error rate at the top of each figure imforms us about how many observations fall outside their predicted range.

QDA Predictions

The predict() function works identically to the way we used it for LDA. We will again use this function to create a confusion matrix for our data.

qda.predict <- predict(kamcoop.qda)
qda.predict
## $class
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Atelidae       
## [21] Atelidae        Atelidae        Atelidae        Atelidae       
## [25] Atelidae        Cercopithecidae Cebidae         Cebidae        
## [29] Cebidae         Cercopithecidae Cebidae         Cebidae        
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae        Cercopithecidae Cercopithecidae Atelidae       
## [45] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [49] Atelidae        Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
## 
## $posterior
##        Atelidae       Cebidae Cercopithecidae
## 1  3.858812e-63  1.000000e+00    1.423939e-23
## 2  1.259996e-39  1.000000e+00    1.824540e-21
## 3  9.278062e-19  1.000000e+00    3.991246e-19
## 4  9.109489e-32  1.000000e+00    1.268353e-20
## 5  5.155527e-55  1.000000e+00    6.465711e-22
## 6  5.076162e-31  1.000000e+00    1.796671e-19
## 7  8.740848e-47  1.000000e+00    1.779306e-21
## 8  4.101006e-50  1.000000e+00    3.599706e-21
## 9  3.917562e-55  1.000000e+00    6.067798e-22
## 10 1.516864e-49  1.000000e+00    2.435858e-21
## 11 4.253817e-11  1.000000e+00    2.466262e-16
## 12 5.213397e-23  1.000000e+00    1.920585e-17
## 13 2.887668e-63  1.000000e+00    1.011983e-21
## 14 2.563074e-28  1.000000e+00    2.817876e-18
## 15 1.633880e-25  1.000000e+00    1.938416e-16
## 16 5.984879e-15  1.000000e+00    5.494075e-15
## 17 2.729108e-34  1.000000e+00    8.618578e-15
## 18 5.098725e-14  1.000000e+00    5.264144e-11
## 19 1.226205e-07  9.999994e-01    4.312588e-07
## 20 1.000000e+00 8.517466e-267    5.097335e-10
## 21 1.000000e+00  0.000000e+00    7.060414e-17
## 22 9.999980e-01 4.130779e-118    1.965400e-06
## 23 9.999979e-01 4.768397e-186    2.134411e-06
## 24 1.000000e+00 1.593256e-267    8.593283e-09
## 25 9.999966e-01 2.388903e-205    3.376183e-06
## 26 1.123581e-02  6.396755e-50    9.887642e-01
## 27 7.198739e-10  9.997299e-01    2.701147e-04
## 28 3.509460e-11  9.996593e-01    3.407163e-04
## 29 6.084442e-09  9.995180e-01    4.820043e-04
## 30 1.210549e-21 1.367158e-107    1.000000e+00
## 31 3.508459e-12  9.999983e-01    1.711234e-06
## 32 1.861134e-09  9.999925e-01    7.449099e-06
## 33 7.636107e-08 2.624225e-112    9.999999e-01
## 34 9.806719e-14  7.100828e-80    1.000000e+00
## 35 9.596781e-04 2.337239e-281    9.990403e-01
## 36 4.922369e-04  1.612531e-69    9.995078e-01
## 37 7.113299e-05 7.834125e-154    9.999289e-01
## 38 1.847042e-04  0.000000e+00    9.998153e-01
## 39 1.358517e-05  0.000000e+00    9.999864e-01
## 40 1.433789e-14  0.000000e+00    1.000000e+00
## 41 9.999914e-01 4.040575e-210    8.642646e-06
## 42 3.777099e-04  0.000000e+00    9.996223e-01
## 43 5.241684e-08  0.000000e+00    9.999999e-01
## 44 9.975638e-01 9.231261e-230    2.436202e-03
## 45 1.000000e+00 7.197630e-297    2.355335e-10
## 46 1.658995e-11  0.000000e+00    1.000000e+00
## 47 9.816954e-01  0.000000e+00    1.830460e-02
## 48 9.999987e-01 1.453697e-259    1.278303e-06
## 49 9.999903e-01 1.200042e-287    9.693333e-06
## 50 2.226455e-47  0.000000e+00    1.000000e+00
## 51 0.000000e+00  0.000000e+00    1.000000e+00

This doesn’t tell us much. Let’s create a matrix like we did with LDA.

qda.predict.class <- predict(kamcoop.qda,
                       newdata=d[,c(5,6,7,8,9)]
                       )$class
qda.predict.class
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Atelidae       
## [21] Atelidae        Atelidae        Atelidae        Atelidae       
## [25] Atelidae        Cercopithecidae Cebidae         Cebidae        
## [29] Cebidae         Cercopithecidae Cebidae         Cebidae        
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae        Cercopithecidae Cercopithecidae Atelidae       
## [45] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [49] Atelidae        Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class, d[,2])
##                  
## qda.predict.class Atelidae Cebidae Cercopithecidae
##   Atelidae              12       0               0
##   Cebidae                0      24               0
##   Cercopithecidae        0       0              15

Our model correctly predicted the correct family 100% of the time. This appears to be more accurate than the LDA model using mean male and female body mass only. However, we should also test simpler models to see if we get similar results. As we learned this semester, simpler is better when it comes to modeling!

Let’s test some other modeling options!

Removing Mass_Dimorphism:

kamcoop.qda1 <- qda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean + Brain_Size_Female_Mean, data = d) 
qda.predict.class1 <- predict(kamcoop.qda1,
                       newdata=d[,c(5,6,7,8)]
                       )$class
qda.predict.class1
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Atelidae       
## [21] Atelidae        Atelidae        Atelidae        Atelidae       
## [25] Atelidae        Cercopithecidae Cebidae         Cebidae        
## [29] Cebidae         Cercopithecidae Cebidae         Cebidae        
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae        Cercopithecidae Cercopithecidae Atelidae       
## [45] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [49] Atelidae        Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class1, d[,2])
##                   
## qda.predict.class1 Atelidae Cebidae Cercopithecidae
##    Atelidae              12       0               0
##    Cebidae                0      24               0
##    Cercopithecidae        0       0              15

Still looking good!

Now let’s try without Mass_Dimorphism and Brain_Size_Female_Mean:

kamcoop.qda2 <- qda(Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean, data = d) 
qda.predict.class2 <- predict(kamcoop.qda2,
                       newdata=d[,c(5,7,8)]
                       )$class
qda.predict.class2
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Atelidae       
## [21] Atelidae        Atelidae        Atelidae        Atelidae       
## [25] Atelidae        Cercopithecidae Cebidae         Cebidae        
## [29] Cebidae         Cercopithecidae Cebidae         Cebidae        
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae        Cercopithecidae Cercopithecidae Atelidae       
## [45] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [49] Atelidae        Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class2, d[,2])
##                   
## qda.predict.class2 Atelidae Cebidae Cercopithecidae
##    Atelidae              12       0               0
##    Cebidae                0      24               0
##    Cercopithecidae        0       0              15

Still 100%!

Now we try with just Male and Female body mass:

kamcoop.qda3 <- qda(Family ~ Body_mass_male_mean + Body_mass_female_mean, data = d) 
qda.predict.class3 <- predict(kamcoop.qda3,
                       newdata=d[,c(7,8)]
                       )$class
qda.predict.class3
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Atelidae       
## [21] Cercopithecidae Atelidae        Atelidae        Atelidae       
## [25] Atelidae        Cercopithecidae Cebidae         Cebidae        
## [29] Cebidae         Atelidae        Cebidae         Cebidae        
## [33] Atelidae        Atelidae        Atelidae        Atelidae       
## [37] Atelidae        Cercopithecidae Cercopithecidae Cercopithecidae
## [41] Atelidae        Cercopithecidae Cercopithecidae Atelidae       
## [45] Atelidae        Cercopithecidae Atelidae        Atelidae       
## [49] Atelidae        Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class3, d[,2])
##                   
## qda.predict.class3 Atelidae Cebidae Cercopithecidae
##    Atelidae              11       0               6
##    Cebidae                0      24               0
##    Cercopithecidae        1       0               9

And now we see we are starting to lose accuracy.

One more combo just for the sake of being thorough without body mass or sexual dimorphism:

kamcoop.qda4 <- qda(Family ~ Brain_Size_Species_Mean + Brain_Size_Female_Mean, data = d) 
qda.predict.class4 <- predict(kamcoop.qda4,
                       newdata=d[,c(5,6)]
                       )$class
qda.predict.class4
##  [1] Cebidae         Cebidae         Cebidae         Cebidae        
##  [5] Cebidae         Cebidae         Cebidae         Cebidae        
##  [9] Cebidae         Cebidae         Cebidae         Cebidae        
## [13] Cebidae         Cebidae         Cebidae         Cebidae        
## [17] Cebidae         Cebidae         Cebidae         Cebidae        
## [21] Atelidae        Atelidae        Atelidae        Cebidae        
## [25] Cebidae         Cebidae         Cebidae         Cebidae        
## [29] Atelidae        Atelidae        Cebidae         Atelidae       
## [33] Cercopithecidae Cercopithecidae Cercopithecidae Cercopithecidae
## [37] Cercopithecidae Atelidae        Atelidae        Cercopithecidae
## [41] Atelidae        Atelidae        Cercopithecidae Cercopithecidae
## [45] Atelidae        Cercopithecidae Cercopithecidae Atelidae       
## [49] Cercopithecidae Cercopithecidae Cercopithecidae
## Levels: Atelidae Cebidae Cercopithecidae
table(qda.predict.class4, d[,2])
##                   
## qda.predict.class4 Atelidae Cebidae Cercopithecidae
##    Atelidae               6       2               4
##    Cebidae                3      22               1
##    Cercopithecidae        3       0              10

Not very good!

It looks as though our best and simplest model is:

Family ~ Body_mass_male_mean + Body_mass_female_mean + Brain_Size_Species_Mean

As such, we can use this model and QDA to predict how an unknown sample should be classified into one of our three Families of primates.

Canonical Discriminant Analysis (CDA)

Canonical discriminant analysis (CDA) is a multivariate statistical technique that identifies differences among groups of treatments (the independent variables) and shows us the relationships between the various (dependent) variables within those groups. CDA determines the best way to discriminate the treatment groups from each other using quantitative measurements of all the variables within each group.

Running the CDA yields several uncorrelated canonical discriminant functions (CDFs). These linear combinations of the variables best separate the original treatment group means relative to within group variation. A lack of correlation between all of the CDFs allows each one to pull out a completely new dimension of information from the groups and variables. CDF1 shows the maximum possible variations among groups, therefore yielding the greatest degree of group differences. CDF1 and CDF2 are completely uncorrelated so CDF2 will reflect differences not shown in CDF1. Similarly, CDF1 and CDF2 are uncorrelated to CDF3, which reflects its own unique differences between groups.

Time to load in some data! To all my true crime lovers out there…

Load in the USA Arrests dataset, which shows number of arrests per 100,000 residents for assault, murder, and burglary in each of the 50 US states in 1973 as well as the percentage of residents living in urban areas in each state. (Note: This dataset has been modified for the purpose of this module.)

library(curl)
df<- curl("https://raw.githubusercontent.com/laurabw/AN597_GroupProject_LLMC/master/USArrests_data_NEW.csv")
crimes<- read.csv(df, header = T, sep = ",", stringsAsFactors = T)
head(crimes)
##        State Murder Assault UrbanPop Burglary
## 1    Alabama   13.2     236       58    148.4
## 2     Alaska   10.0     263       48    311.5
## 3    Arizona    8.1     294       80    217.0
## 4   Arkansas    8.8     190       50    136.5
## 5 California    9.0     276       91    284.2
## 6   Colorado    7.9     204       78    270.9

Let’s assign our 50 states to 6 different geographical regions for the purpose of this analysis.

crimes$region <- "temp" #creating a new variable name

crimes$region[row(data.frame(levels(crimes$State))) %in% c(7, 8, 19, 20, 21, 29, 30, 32, 38, 39, 45)] <- "Northeast"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(1, 4, 9, 10, 18, 24, 33, 40, 42, 46, 48)] <- "Southeast"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(3, 31, 36, 43)] <- "Southwest"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(5, 6, 12, 26, 28, 37, 44, 47, 50)] <- "West"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(13, 14, 15, 16, 17, 22, 23, 25, 27, 34, 35, 41, 49)] <- "Midwest"
crimes$region[row(data.frame(levels(crimes$State))) %in% c(2, 11)] <- "Noncontiguous"
crimes$region <- as.factor(crimes$region)
levels(crimes$region)
## [1] "Midwest"       "Noncontiguous" "Northeast"     "Southeast"    
## [5] "Southwest"     "West"

Data exploration: What are we looking at?

Let’s explore our data a bit and view each variable separately across the 6 regions.

par(mfrow = c(2,2))
par(xaxt = "n") #this just removes the built in x-axis labels so that we can insert our own and put them on an angle
lablist<-as.vector(c("MW", "NC", "NE", "SE", "SW", "W")) #our new x-axis labels

plot(data= crimes, Murder ~ region, xlab= "Region")
axis(1, at=seq(1, 6, by=1), labels = FALSE)
text(seq(1, 6, by=1), par("usr")[1] - 0.2, labels = lablist, srt = 45, pos = 1, xpd = TRUE) #this code is just for the positioning of our new x-axis labels 

plot(data= crimes, Assault ~ region, xlab= "Region")
axis(1, at=seq(1, 6, by=1), labels = FALSE)
text(seq(1, 6, by=1), par("usr")[2] - 0.2, labels = lablist, srt = 45, pos = 1, xpd = TRUE)

plot(data= crimes, UrbanPop ~ region, xlab= "Region")
axis(1, at=seq(1, 6, by=1), labels = FALSE)
text(seq(1, 6, by=1), par("usr")[3] - 0.2, labels = lablist, srt = 45, pos = 1, xpd = TRUE)

plot(data= crimes, Burglary ~ region, xlab= "Region")
axis(1, at=seq(1, 6, by=1), labels = FALSE)
text(seq(1, 6, by=1), par("usr")[2] - 0.2, labels = lablist, srt = 45, pos = 1, xpd = TRUE)

We should probably test for normality and see what we’re working with. Since we have a multivariate dataset, we can use the Shapiro-Wilk multivariate normality test (conveniently, there is a built in package for this in the {mvnormtest} package).

Let’s create a matrix that has all of our dependent variables:

library(mvnormtest)
crimes_matrix<- cbind(crimes$Murder, crimes$Assault, crimes$UrbanPop, crimes$Burglary)

#The mshapiro.test() requires that the dependent variables be in rows and the independent variables be in columns, the function t() solves this
crimes_matrix_t<- t(crimes_matrix)  
mshapiro.test(crimes_matrix_t)
## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.93822, p-value = 0.01147

This p-value from this test is <0.05 which means technically our data is not normally distributed. Normality and homogeneity assumptions are not always considered absolute prerequisites for CDA so for the purpose of this module we will continue with this dataset. However, note that you should try to normalize your data beforehand in most cases!

Next Steps: Running a MANOVA

In order to investigate between-group differences of multivariate data, we can use multivariate analysis of variance, or MANOVA. Remember, ANOVA only looks at 1 response/dependent variable and tests for the difference in means across two or more groups. MANOVA extends this analysis to several dependent variables and tests for the difference in two or more vectors of means. For the purpose of the CDA, a MANOVA is the most appropriate approach. CDA then shows visual descriptions of differences between the groups that a simple MANOVA does not.

colnames(crimes_matrix)= c("Murder", "Assault", "Urban Population", "Blurglary") #setting column names in our matrix now will allow for better visualize with our CDA plot
crimes_manova<- manova(data= crimes, crimes_matrix ~ region)
summary(crimes_manova)
##           Df Pillai approx F num Df den Df    Pr(>F)    
## region     5 1.2876   4.1774     20    176 9.527e-08 ***
## Residuals 44                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Great! We can see that there is a significant difference in the means of our variables across the 6 regions. To look at the differences more closely, let’s quickly run an ANOVA summary…

summary.aov(crimes_manova)
##  Response Murder :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## region       5 417.68  83.536  7.1806 5.477e-05 ***
## Residuals   44 511.87  11.634                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Assault :
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## region       5  91313 18262.5  3.2271 0.01441 *
## Residuals   44 249000  5659.1                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Urban Population :
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## region       5 1959.0  391.81  2.0752 0.08661 .
## Residuals   44 8307.4  188.80                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Blurglary :
##             Df Sum Sq Mean Sq F value   Pr(>F)   
## region       5  62749 12549.7  3.7338 0.006647 **
## Residuals   44 147889  3361.1                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a significant difference in the mean number of murders, assaults, and burglaries across the regions but not in the percentage of people living in urban areas.

CDA Time!

library(candisc)
## Loading required package: heplots
## 
## Attaching package: 'candisc'
## The following object is masked from 'package:stats':
## 
##     cancor
crimes_CDA<- candisc(crimes_manova)
summary(crimes_CDA)
## 
## Canonical Discriminant Analysis for region:
## 
##     CanRsq Eigenvalue Difference  Percent Cumulative
## 1 0.695113    2.27990      1.598 71.54378     71.544
## 2 0.405419    0.68186      1.598 21.39682     92.941
## 3 0.176408    0.21419      1.598  6.72145     99.662
## 4 0.010655    0.01077      1.598  0.33795    100.000
## 
## Class means:
## 
##                   Can1       Can2
## Midwest        0.39690  0.0082801
## Noncontiguous  1.70210 -1.9205945
## Northeast      0.27187  1.2616902
## Southeast     -2.45823 -0.4275676
## Southwest      0.11105 -0.0409363
## West           1.67131 -0.5864507
## 
##  std coefficients:
##                      Can1     Can2
## Murder           -1.38225 -0.41149
## Assault          -0.22115  0.83969
## Urban Population  0.33252  0.66148
## Blurglary         1.19140 -1.30648

The summary() function with our CDA object shows us the standardized canonical coefficients for the first two dimensions across our dependent variables. The canonical coefficients yield relative information on each variable in distinguishing between groups. Each canonical dimension is most strongly influenced by the variable(s) with the greatest absolute value coefficients.

We can see that CDF1 is most strongly influenced by the Murder and Burglary variables while CDF2 is most strongly influenced by Assault and Burglary.

Let’s plot this sucker!

par(mfrow = c(1,1))
heplot(crimes_CDA, term = "region") 

## Vector scale factor set to  3.503351

Overall, we can see that the mean number of arrests for murder and assault in 1973 is most strongly associated with the southeast region of the United States. Arrests for burglary are most strongly associated with the noncontiguous states, Hawaii and Alaksa. Remember that our ANOVA showed that the percentage of people living in urban areas was not significantly different across the regions. Therefore, there is a weak association of UrbanPop with any of the regions in our CDA plot. In addition, the further away variables are from the “Error” circle, the stronger the association. Therefore, we see that the northeast, midwest, and southwest regions are not strongly associated with any of our response variables.

CDA CHALLENGE

Load in the built-in R dataset “Wine” in the {candisc} package, which contains the results of a chemical analysis of three types of wine, Barolo, Grignolino, and Barbera (all red wine), grown in a specific area of Italy.

data("Wine")
head(Wine)
##   Cultivar Alcohol MalicAcid  Ash AlcAsh  Mg Phenols Flav NonFlavPhenols
## 1   barolo   14.23      1.71 2.43   15.6 127    2.80 3.06           0.28
## 2   barolo   13.20      1.78 2.14   11.2 100    2.65 2.76           0.26
## 3   barolo   13.16      2.36 2.67   18.6 101    2.80 3.24           0.30
## 4   barolo   14.37      1.95 2.50   16.8 113    3.85 3.49           0.24
## 5   barolo   13.24      2.59 2.87   21.0 118    2.80 2.69           0.39
## 6   barolo   14.20      1.76 2.45   15.2 112    3.27 3.39           0.34
##   Proa Color  Hue   OD Proline
## 1 2.29  5.64 1.04 3.92    1065
## 2 1.28  4.38 1.05 3.40    1050
## 3 2.81  5.68 1.03 3.17    1185
## 4 2.18  7.80 0.86 3.45    1480
## 5 1.82  4.32 1.04 2.93     735
## 6 1.97  6.75 1.05 2.85    1450
levels(Wine$Cultivar) #these are the categories of our independent variable... the three types of wine
## [1] "barolo"     "grignolino" "barbera"

Since this dataset is also not normally distributed, we are going to use the log transformation of the most normally distributed variables: Alcohol, Akalinity of Ash, and Color.

par(mfrow = c(1,2))
hist(log(Wine$Alcohol))
qqnorm(log(Wine$Alcohol))

hist(log(Wine$AlcAsh))
qqnorm(log(Wine$AlcAsh))

hist(log(Wine$Color))
qqnorm(log(Wine$Color))

Plot these log transformed variables against the Cultivar variable.

par(mfrow = c(2,2))
par(xaxt = "s") #since we removed the x-axis above, we need this code to add them back in
plot(data = Wine, log(Alcohol) ~ Cultivar) 
plot(data = Wine, log(AlcAsh) ~ Cultivar)
plot(data = Wine, log(Color) ~ Cultivar)

Create a log transformed matrix containing these variables.

wine_matrix<- cbind(Wine$Alcohol, Wine$AlcAsh, Wine$Color)
log_wine_matrix<- log(wine_matrix) #log transforming the matrix

Name the columns of the matrix and run the appropriate MANOVA to use for this CDA analysis. What does the MANOVA tell us? How can you tell which variables are significantly different across the three types of wine?

colnames(log_wine_matrix) = c("Alcohol", "Alkalinity of Ash", "Color")
log_wine_manova<- manova(data = Wine, log_wine_matrix ~ Cultivar)
summary(log_wine_manova)
##            Df Pillai approx F num Df den Df    Pr(>F)    
## Cultivar    2 1.1529   78.946      6    348 < 2.2e-16 ***
## Residuals 175                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(log_wine_manova)
##  Response Alcohol :
##              Df  Sum Sq  Mean Sq F value    Pr(>F)    
## Cultivar      2 0.42355 0.211774  135.51 < 2.2e-16 ***
## Residuals   175 0.27350 0.001563                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Alkalinity of Ash :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Cultivar      2 1.6350 0.81748  37.316 3.176e-14 ***
## Residuals   175 3.8337 0.02191                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Color :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Cultivar      2 23.948 11.9741  154.81 < 2.2e-16 ***
## Residuals   175 13.536  0.0773                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Run the CDA and investigate which variables most strongly influence the first two CDFs, then plot it.

log_wine_CDA<- candisc(log_wine_manova)
summary(log_wine_CDA)
## 
## Canonical Discriminant Analysis for Cultivar:
## 
##    CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.69347    2.26235     1.4123  72.688     72.688
## 2 0.45948    0.85006     1.4123  27.312    100.000
## 
## Class means:
## 
##               Can1      Can2
## barolo      1.2609  1.043183
## grignolino -1.8302  0.030076
## barbera     1.1573 -1.326732
## 
##  std coefficients:
##                       Can1     Can2
## Alcohol            0.56268  0.60585
## Alkalinity of Ash -0.06216 -0.65614
## Color              0.65306 -0.68300
par(mfrow= c(1,1))
heplot(log_wine_CDA, term = "Cultivar")

## Vector scale factor set to  10.17842

Let’s add some color!

myColors <- hsv((c(0,120,240) + 80)/360,s = 0.9,v = 0.8,0.7) #creating a vector for the colors
plot(log_wine_CDA, col = myColors, pch = rep(16,20), xpd = T)
## Vector scale factor set to 5.693

Thanks for joining us! We hope you enjoyed the module and learned a few things about discriminant analysis!