Linear and Quadratic Discriminant Analysis

Author

Gentry Miller and Alexandra Walsh

Preliminaries

Packages

# data handling
library(curl)
library(tidyverse)

# discriminant analysis
library(MASS)

# data visualization
library(ggplot2)
library(plotly)
library(geometry)
library(mulgar)
library(ggpubr)

Introduction

What is Discriminant Analysis?

Discriminant analysis is a type of statistical analysis used to predict the probability of belonging to a given class or category based on one or more predictor variables. We will be covering two different types of discriminant analysis: linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA).

Implementing Discriminant Analysis

Linear Discriminant Analysis

Linear discriminant analysis (LDA) is used to find a linear combination of features that separates different classes in a data set. LDA works by calculating the means and covariance matrices for each data class, assuming a common covariance matrix between all classes. When effectively utilized, LDA finds the linear combination of variables that provide the best possible separation between the groups.

Properties and Assumptions of LDA

LDAs operate by projecting a feature space (a data set of n dimensions) onto a smaller space “g”, where \(g \le n-1\) without losing class information.
A covariance matrix that measures how each variable or feature relates to others within the same class.
An LDA model comprises the statistical properties that are calculated for the data in each class - where there are multiple variables, these properties are calculated over the multivariate Gaussian (normal) distribution. The model assumes the following:
1. the input data set has a normal distribution
2. the data set is linearly separable, meaning LDA can draw a decision boundary that separates the data points
3. each class has the same covariance matrix
- Note: These assumptions carry the limitation that LDA may not perform well in very high-dimensional feature spaces

Covariance Matrix

The covariance matrix represents the between-group sum of squares matrix as such:

\[ B = \sum_{k=1}^{g}n_k \, (\bar{X}_k - \bar{X})(\bar{X}_k - \bar{X})^\top \]

This measures the differences between the class means, compared with the mean of the whole data set, \(\bar{X}\) and the within-group sum of squares matrix,

\[ W = \sum_{k = 1}^g \sum_{i = 1}^{n_k} \, (X_{ki} - \bar{X}_k)(X_{ki} - \bar{X}_k)^\top \]

which measures the variation of values around each class mean, \(X_{ki}\).

The linear discriminant space is generated by computing the eigenvectors of \(W^{-1}B\), and this is the g dimensional space where the group means are most separated with respect to the pooled covariance. For each class, we compute:

\[ \delta_k(x) = (x-\mu_k)^\top W^{-1} \mu_k + \log{\pi}_k \]

where \(\pi_k\) is a prior probability for class \(k\). The LDA classifier rule is to assign a new observation to the class with the largest value of \(\delta_k(x)\).

Let’s start by simulating some data. Linear Discriminant Analysis works best with normal, continuous variables with low colinearity. Considering this, let’s imagine a study where a geneticist wants to classify a monkey into one of three genotypes based on the number of transcriptions of certain genes:

# First, let's set a seed for reproducibility,
set.seed(647)

# then set n as the number of observations to simulate.
n <- 300

# Now we can simulate the transcription frequencies.
# Let's assume that the classification is set up as such:
# Genotype A: higher expression of gene X
# Genotype B: higher expression of gene Y
# Genotype C: higher expression of gene Z

geno <- sample(c(0, 1, 2), size = n, replace = TRUE)

# now the gene expressions

gene_x <- rnorm(n, mean = 5 + (2 * (geno == 0)), sd = 1) 
# produces a distribution that is higher in genotype A individuals
gene_y <- rnorm(n, mean = 5 + (2 * (geno == 1)), sd = 1)
# produces a distribution that is higher in genotype B individuals
gene_z <- rnorm(n, mean = 5 + (2 * (geno == 2)), sd = 1)
# produces a distribution that is higher in genotype C individuals

# Now, let's store this simulated data
lda_data <- data.frame(
  gene_x = gene_x,
  gene_y = gene_y,
  gene_z = gene_z,
  geno = factor(geno, labels = c("A", "B", "C"))
)

# and lastly, check the first few rows to make sure everything is right
head(lda_data)

Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")

gene_x	gene_y	gene_z	geno
6.200588	4.518938	2.835381	A
5.933105	7.303336	5.784455	B
6.971770	5.305419	5.695243	A
6.231134	6.795563	4.026929	B
6.881066	5.391924	6.790874	A
5.117153	4.388053	7.003496	C

Challenge 1 - Linear Discriminant Analysis with 2 Classes

Linear discriminant analysis reduces the number of dimensions by creating a new axis while simultaneously considering two criteria:

On the new axis, we want to maximize the distance, \(d = \mu_1 - \mu_2\), between the means of each class
On the new axis, we want to minimize the spread, \(s^2\), in each class

When we have two classes , we can accomplish this by expressing the rate:

\[ \frac{(\mu_1 - \mu_2)^2}{s^2_1 + s^2_2} = \frac{d^2}{s^2_1 + s^2_2} \]

For an example, lets start by calculating the values we need. The means are easy, but we have to make sure to distinguish between genotype and gene.

mu1 <- colMeans(lda_data[lda_data$geno == "A", 1:2])
mu2 <- colMeans(lda_data[lda_data$geno == "B", 1:2])

Things get a little tricky when it comes to spread. Because LDA assumes that the data all share a covariance matrix, \(W\), we can’t just use the spread of the raw data. The covariance matrix is computed internally by lda(), but it can be computed manually like this:

\[ W = \frac{S_1(n_1 - 1) + S_2(n_2 - 1)}{(n_1 + n_2) - 2} \]

where

\(W\) = the pooled covariance matrix
\(S_1\) = covariance of class 1
\(S_2\) = covariance of class 2
\(n_1\) = size of class 1
\(n_2\) = size of class 2

S1 <- cov(lda_data[lda_data$geno == "A", 1:2])
# covariance of class A in gene x and gene y
S2 <- cov(lda_data[lda_data$geno == "B", 1:2])
# covariance of class B in gene x and gene y
n1 <- sum(lda_data$geno == "A")
# size of class A
n2 <- sum(lda_data$geno == "B")
# size of class B

cov_pooled <- ((n1 - 1)*S1 + (n2 - 1)*S2) / (n1 + n2 - 2)

cov_pooled

            gene_x      gene_y
gene_x  0.90394338 -0.04142668
gene_y -0.04142668  0.99434430

Division doesn’t work well with matrices, so to calculate the eigenvalues that serve as coefficients for the linear boundary, we have to invert the covariance matrix \(W\):

cov_inv <- solve(cov_pooled)

and use matrix multiplication, %*%, to multiply the inverse matrix, \(W^{-1}\), by the difference between the two means, a.k.a. the distance between the classes. This is effectively an expression of \[ \frac{d^2}{s^2_1 + s^2_2} \]

using \(W^{-1}\) instead of the sum of spreads to calculate the “weights” (w) represented by the eigenvalues and:

\[ \frac{d}{W} = d * W^{-1} \]

d <- (mu1 - mu2)
w <- as.vector(cov_inv %*% d)
b <- -0.5 * t(mu1 + mu2) %*% cov_inv %*% d

Let’s plot this boundary and see where the best place to split the classes is

Code

# Filter out geno C data
lda_data_xy <- lda_data[lda_data$geno != "C", ]

# Create a grid to visualize the boundary
x_vals <- seq(min(lda_data_xy$gene_x), max(lda_data_xy$gene_x),
              length.out = 100)
y_vals <- (-w[1] * x_vals - b) / w[2]  # Solving for y
boundary <- data.frame(gene_x = x_vals, gene_y = as.vector(y_vals))
                       
ggplot(lda_data_xy, aes(x = gene_x, y = gene_y, color = geno)) +
  geom_point()  +
  geom_line(data = boundary, aes(x = x_vals, y = y_vals), 
            color = "black", linetype = "dashed") +
  labs(title = "LDA Boundary for gene_x vs gene_y") + 
  coord_fixed() +
  theme_bw()

Now let’s project these onto a lower dimension. First, we should make sure our LDA vector is normalized, a.k.a made to be a unit vector

w_unit <- w/sqrt(sum(w^2))

Now, center the data about the means

X <- as.matrix(lda_data_xy[, c("gene_x", "gene_y")])

# Center around the global mean
X_centered <- scale(X, center = TRUE, scale = FALSE)

and finally, project the data onto the discriminant axis.

# Projection of each point onto the LDA axis
lda_proj <- X_centered %*% w_unit

lda_data_xy$lda_proj <- as.vector(lda_proj)

Now let’s see this in context

Code

proj_coords <- X_centered %*% w_unit %*% t(w_unit)

global_mean <- colMeans(X)
proj_coords <- sweep(proj_coords, 2, global_mean, FUN = "+")

# Build a dataframe for plotting
proj_df <- data.frame(
  gene_x_proj = proj_coords[, 1],
  gene_y_proj = proj_coords[, 2],
  geno = lda_data_xy$geno
)

# Add the projected points to the plot
ggplot() +
  geom_point(data = lda_data_xy, 
             aes(x = gene_x, y = gene_y, color = geno), 
             alpha = 0.6, size = 2) +
  geom_point(data = proj_df, 
             aes(x = gene_x_proj, 
                 y = gene_y_proj, 
                 color = geno), 
             shape = 4, 
             size = 2.5) +
  geom_segment(aes(x = lda_data_xy$gene_x, 
                   y = lda_data_xy$gene_y,
                   xend = proj_df$gene_x_proj, 
                   yend = proj_df$gene_y_proj,
                   color = lda_data_xy$geno),
               alpha = 0.4) +
  labs(title = "Projection of Points onto LDA Axis", 
       x = "gene_x", 
       y = "gene_y") +
  coord_fixed() +
  theme_bw()

And finally, let’s see both the boundary and discriminant axis together

Code

ggplot() +
  geom_point(data = lda_data_xy, 
             aes(x = gene_x, 
                 y = gene_y, 
                 color = geno), 
             alpha = 0.6, size = 2) +
  geom_point(data = proj_df, 
             aes(x = gene_x_proj, 
                 y = gene_y_proj, 
                 color = geno), 
             shape = 4, 
             size = 2.5) +
  geom_segment(aes(x = lda_data_xy$gene_x, 
                   y = lda_data_xy$gene_y,
                   xend = proj_df$gene_x_proj, 
                   yend = proj_df$gene_y_proj,
                   color = lda_data_xy$geno),
               alpha = 0.4) +
  geom_line(data = boundary, 
            aes(x = x_vals, 
                y = y_vals), 
            color = "black", 
            linetype = "dashed") +
  labs(title = "LDA Boundary and Axis", 
       x = "gene_x", 
       y = "gene_y") + 
  coord_fixed() +
  theme_bw()

And we can see that the boundary line that divides the group is perpendicular to the LDA axis that shows the direction of separation.

Challenge 2 - LDA with More than 2 Categories

So you might be wondering, if \(d\) is the difference between the means of each category, how do we use this formula with more than 2 categories? We can’t just use the distance between any two.

Well, we use a central point (C) that is equidistant from the mean of each class. So instead of maximizing the distance between classes, we try to maximize the distance of each class from C while still minimizing the scatter within each category.

Instead of

\[ \frac{(d)^2}{s^2_1 + s^2_2} \]

we use

\[ \frac{d_1^2 + d_2^2 + ... + d_n^2}{s_1^2 + s_2^2 + ... + s_n^2} \]

where

\(n\) = the number of classes
\(d_n^2\) is the distance between a given class and the central point, C
\(s_n^2\) is the spread of a given class

Let’s use the function lda() from the {MASS} package to check out LDA with multiple classes.

lda_model <- lda(geno ~ gene_x + gene_y + gene_z, data = lda_data)

# Check model output
print(lda_model)

Call:
lda(geno ~ gene_x + gene_y + gene_z, data = lda_data)

Prior probabilities of groups:
        A         B         C 
0.2933333 0.3500000 0.3566667 

Group means:
    gene_x   gene_y   gene_z
A 7.095949 5.085835 4.974837
B 5.027599 7.129011 5.118581
C 5.083457 5.167254 7.087386

Coefficients of linear discriminants:
              LD1        LD2
gene_x  0.1357291  0.7295128
gene_y -0.7969376 -0.3420057
gene_z  0.6540238 -0.5538144

Proportion of trace:
   LD1    LD2 
0.5313 0.4687

In the output we see:

the prior probabilities, which represents the proportion of spread explained by each class discriminant
the group means, which are the means of each gene’s spread in each genotype
the coefficients of the linear discriminants, which represent the direction of separation and are the eigenvectors of \(W^{-1}B\). The higher the absolute value of each coefficient, the more discriminative power it has.

These values inform the predictions made by projecting the data onto a lower dimension space. Luckily, {MASS} has the predict() function so we don’t have to do it by hand again.

lda_pred <- predict(lda_model, lda_data)

# The LDA projection scores for each observation
lda_data$LD1 <- lda_pred$x[,1]
lda_data$LD2 <- lda_pred$x[,2]

Now, let’s visualize this

Code

# Create a grid of points to evaluate the decision boundary
grid <- expand.grid(LD1 = seq(min(lda_data$LD1) - 1, 
                              max(lda_data$LD1) + 1, 
                              length.out = 100),
                    LD2 = seq(min(lda_data$LD2) - 1, 
                              max(lda_data$LD2) + 1, 
                              length.out = 100))

grid$gene_x <- (grid$LD1)  
# Placeholder for gene_x, gene_y, gene_z transformation
grid$gene_y <- (grid$LD2)  
# Same for gene_y
grid$gene_z <- 1 
# If you want to include gene_z as a constant, modify as necessary.


# Predict the class for each point in the grid
grid$pred_geno <- predict(lda_model, newdata = grid)$class

# Plot with decision boundary
ggplot(lda_data, 
       aes(x = LD1, 
           y = LD2, 
           color = geno)) +
  geom_point(size = 3, 
             alpha = 0.7) +
    labs(title = "2D LDA Projection with Decision Boundary",
       x = "Linear Discriminant 1 (LD1)",
       y = "Linear Discriminant 2 (LD2)") +
  geom_contour(data = grid, 
               aes(x = LD1, 
                   y = LD2, 
                   z = as.numeric(pred_geno)), 
               color = "black", 
               bins = 1) +
  theme_bw() +
  theme(aspect.ratio = 1, 
        legend.title = element_blank())

Here, the line is squiggly because it represents a planar version of the LDA boundary. It might be easier to see by restructuring the plot - one common method for this in LDA is through ellipses:

ell <- NULL
for(i in unique(lda_data$geno)){
  x <- lda_data |> dplyr::filter(geno == i)
  e <- gen_xvar_ellipse(x[, 1:3], n = 300, nstd = 1)
  e$geno <- i
  ell <- bind_rows(ell, e)
}

Code

lda1 <- ggplot(lda_data,
               aes(x = gene_x, 
                   y = gene_y, 
                   color = geno)) +
  geom_point() + 
  scale_color_discrete() + 
  ggtitle("(a)") + 
  theme_bw() +
  theme(aspect.ratio = 1, 
        legend.title = element_blank())

lda2 <- ggplot(ell, 
               aes(x = gene_x, 
                   y = gene_y, 
                   color = geno)) +
  geom_point() +
  scale_color_discrete() +
  ggtitle("(b)") + 
  theme_bw() + 
  theme(aspect.ratio = 1, 
        legend.title = element_blank())

ggarrange(lda1, lda2, ncol = 2, 
          common.legend = TRUE, legend = "bottom")

Code

# Create a 3D scatter plot with LD1, LD2, and LD3
fig <- plot_ly(data = ell, 
                x = ~gene_x, 
                y = ~gene_y, 
                z = ~gene_z, 
                color = ~geno, 
                type = 'scatter3d', 
                mode = 'markers', 
                marker = list(size = 5)) %>%
   layout(title = "3D LDA Projection",
          scene = list(xaxis = list(title = 'Gene X'),
                       yaxis = list(title = 'Gene Y'),
                       zaxis = list(title = 'Gene Z')))
 fig

Challenge 3 - Quadratic Discriminant Analysis

Quadratic discriminant analysis is similar to LDA, but assumes that that each class has its own unique covariance matrix. QDA can be thought of as an extension to LDA, as it is still classifying observations from a set of predictor variables. Since each class has its own, unique covariance matrix, QDA allows for complex, non-linear modeling.

Loading Data

First things first, we’ll load in the data that we need and clean it up to get that out of the way.

# store data from Kamilar and Cooper in holder f 
f <- curl("https://raw.githubusercontent.com/bygentry/AN588_Vignette/refs/heads/main/KamilarAndCooperData.csv")  
# convert data in f to a data frame and store in data.frame d 
d <- read.csv(f, sep = ",", stringsAsFactors = TRUE)

Cleaning Data

# Using complete.cases() from the built-in stats package, 
# we can remove rows with NA in specific columns  
tidy.data <- d[complete.cases(d[ , c(5, 8:10)]), ]

Implementing QDA

First, we must set seed for reproducibility.

set.seed(482)

Now, we nee to split the data into training and test. We’ll use 70% for training and 30% for testing.

train_index <- sample(1:nrow(tidy.data), 0.7 * nrow(tidy.data))  
train_data <- tidy.data[train_index, ]  
test_data <- tidy.data[-train_index, ]

Now we will fit the model for all for features.

qda_model <- qda(Species ~ ., data = train_data)

If the above line of code is not working, trouble shoot it with the following code:

train_data <- na.omit(train_data)

This will remove all rows from the data frame or matrix that may contain missing values.

This can also be used to check for missing values:

colSums(is.na(train_data))

To view the model:

print(qda_model)

We first want to make prediction, based on the test set.

qda_pred <- predict(qda_model, test_data)

Next, display predicted classes.

head(qda_predclass)

Create a confusion matrix and a print confusion matrix.

A confusion matrix evaluates the performance of classification models. It provides a detailed breakdown of possible prediction outcomes. A

confusion matrix can be made as a table, to compare the positives and negatives of the outcomes.

A print confusion matrix refers to the act of generating the confusion matrix table.

conf_matrix <- table(Predicted = qda_pred$class,
                     Actual = test_data$Species) 
print(conf_matrix)

Finally, check for accuracy.

accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)  
cat("Accuracy:", round(accuracy * 100, 2), "%\n")