# data handling
library(curl)
library(tidyverse)
# discriminant analysis
library(MASS)
# data visualization
library(ggplot2)
library(plotly)
library(geometry)
library(mulgar)
library(ggpubr)
Linear and Quadratic Discriminant Analysis
Preliminaries
Packages
Introduction
What is Discriminant Analysis?
Discriminant analysis is a type of statistical analysis used to predict the probability of belonging to a given class or category based on one or more predictor variables. We will be covering two different types of discriminant analysis: linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA).
Implementing Discriminant Analysis
Linear Discriminant Analysis
Linear discriminant analysis (LDA) is used to find a linear combination of features that separates different classes in a data set. LDA works by calculating the means and covariance matrices for each data class, assuming a common covariance matrix between all classes. When effectively utilized, LDA finds the linear combination of variables that provide the best possible separation between the groups.
Properties and Assumptions of LDA
LDAs operate by projecting a feature space (a data set of n dimensions) onto a smaller space “g”, where \(g \le n-1\) without losing class information.
A covariance matrix that measures how each variable or feature relates to others within the same class.
An LDA model comprises the statistical properties that are calculated for the data in each class - where there are multiple variables, these properties are calculated over the multivariate Gaussian (normal) distribution. The model assumes the following:
the input data set has a normal distribution
the data set is linearly separable, meaning LDA can draw a decision boundary that separates the data points
each class has the same covariance matrix
- Note: These assumptions carry the limitation that LDA may not perform well in very high-dimensional feature spaces
Covariance Matrix
The covariance matrix represents the between-group sum of squares matrix as such:
\[ B = \sum_{k=1}^{g}n_k \, (\bar{X}_k - \bar{X})(\bar{X}_k - \bar{X})^\top \]
This measures the differences between the class means, compared with the mean of the whole data set, \(\bar{X}\) and the within-group sum of squares matrix,
\[ W = \sum_{k = 1}^g \sum_{i = 1}^{n_k} \, (X_{ki} - \bar{X}_k)(X_{ki} - \bar{X}_k)^\top \]
which measures the variation of values around each class mean, \(X_{ki}\).
The linear discriminant space is generated by computing the eigenvectors of \(W^{-1}B\), and this is the g dimensional space where the group means are most separated with respect to the pooled covariance. For each class, we compute:
\[ \delta_k(x) = (x-\mu_k)^\top W^{-1} \mu_k + \log{\pi}_k \]
where \(\pi_k\) is a prior probability for class \(k\). The LDA classifier rule is to assign a new observation to the class with the largest value of \(\delta_k(x)\).
Let’s start by simulating some data. Linear Discriminant Analysis works best with normal, continuous variables with low colinearity. Considering this, let’s imagine a study where a geneticist wants to classify a monkey into one of three genotypes based on the number of transcriptions of certain genes:
# First, let's set a seed for reproducibility,
set.seed(647)
# then set n as the number of observations to simulate.
<- 300
n
# Now we can simulate the transcription frequencies.
# Let's assume that the classification is set up as such:
# Genotype A: higher expression of gene X
# Genotype B: higher expression of gene Y
# Genotype C: higher expression of gene Z
<- sample(c(0, 1, 2), size = n, replace = TRUE)
geno
# now the gene expressions
<- rnorm(n, mean = 5 + (2 * (geno == 0)), sd = 1)
gene_x # produces a distribution that is higher in genotype A individuals
<- rnorm(n, mean = 5 + (2 * (geno == 1)), sd = 1)
gene_y # produces a distribution that is higher in genotype B individuals
<- rnorm(n, mean = 5 + (2 * (geno == 2)), sd = 1)
gene_z # produces a distribution that is higher in genotype C individuals
# Now, let's store this simulated data
<- data.frame(
lda_data gene_x = gene_x,
gene_y = gene_y,
gene_z = gene_z,
geno = factor(geno, labels = c("A", "B", "C"))
)
# and lastly, check the first few rows to make sure everything is right
head(lda_data)
Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
gene_x | gene_y | gene_z | geno |
---|---|---|---|
6.200588 | 4.518938 | 2.835381 | A |
5.933105 | 7.303336 | 5.784455 | B |
6.971770 | 5.305419 | 5.695243 | A |
6.231134 | 6.795563 | 4.026929 | B |
6.881066 | 5.391924 | 6.790874 | A |
5.117153 | 4.388053 | 7.003496 | C |
Challenge 1 - Linear Discriminant Analysis with 2 Classes
Linear discriminant analysis reduces the number of dimensions by creating a new axis while simultaneously considering two criteria:
- On the new axis, we want to maximize the distance, \(d = \mu_1 - \mu_2\), between the means of each class
- On the new axis, we want to minimize the spread, \(s^2\), in each class
When we have two classes , we can accomplish this by expressing the rate:
\[ \frac{(\mu_1 - \mu_2)^2}{s^2_1 + s^2_2} = \frac{d^2}{s^2_1 + s^2_2} \]
For an example, lets start by calculating the values we need. The means are easy, but we have to make sure to distinguish between genotype and gene.
<- colMeans(lda_data[lda_data$geno == "A", 1:2])
mu1 <- colMeans(lda_data[lda_data$geno == "B", 1:2]) mu2
Things get a little tricky when it comes to spread. Because LDA assumes that the data all share a covariance matrix, \(W\), we can’t just use the spread of the raw data. The covariance matrix is computed internally by lda()
, but it can be computed manually like this:
\[ W = \frac{S_1(n_1 - 1) + S_2(n_2 - 1)}{(n_1 + n_2) - 2} \]
where
\(W\) = the pooled covariance matrix
\(S_1\) = covariance of class 1
\(S_2\) = covariance of class 2
\(n_1\) = size of class 1
\(n_2\) = size of class 2
<- cov(lda_data[lda_data$geno == "A", 1:2])
S1 # covariance of class A in gene x and gene y
<- cov(lda_data[lda_data$geno == "B", 1:2])
S2 # covariance of class B in gene x and gene y
<- sum(lda_data$geno == "A")
n1 # size of class A
<- sum(lda_data$geno == "B")
n2 # size of class B
<- ((n1 - 1)*S1 + (n2 - 1)*S2) / (n1 + n2 - 2)
cov_pooled
cov_pooled
gene_x gene_y
gene_x 0.90394338 -0.04142668
gene_y -0.04142668 0.99434430
Division doesn’t work well with matrices, so to calculate the eigenvalues that serve as coefficients for the linear boundary, we have to invert the covariance matrix \(W\):
<- solve(cov_pooled) cov_inv
and use matrix multiplication, %*%
, to multiply the inverse matrix, \(W^{-1}\), by the difference between the two means, a.k.a. the distance between the classes. This is effectively an expression of \[
\frac{d^2}{s^2_1 + s^2_2}
\]
using \(W^{-1}\) instead of the sum of spreads to calculate the “weights” (w) represented by the eigenvalues and:
\[ \frac{d}{W} = d * W^{-1} \]
<- (mu1 - mu2)
d <- as.vector(cov_inv %*% d)
w <- -0.5 * t(mu1 + mu2) %*% cov_inv %*% d b
Let’s plot this boundary and see where the best place to split the classes is
Code
# Filter out geno C data
<- lda_data[lda_data$geno != "C", ]
lda_data_xy
# Create a grid to visualize the boundary
<- seq(min(lda_data_xy$gene_x), max(lda_data_xy$gene_x),
x_vals length.out = 100)
<- (-w[1] * x_vals - b) / w[2] # Solving for y
y_vals <- data.frame(gene_x = x_vals, gene_y = as.vector(y_vals))
boundary
ggplot(lda_data_xy, aes(x = gene_x, y = gene_y, color = geno)) +
geom_point() +
geom_line(data = boundary, aes(x = x_vals, y = y_vals),
color = "black", linetype = "dashed") +
labs(title = "LDA Boundary for gene_x vs gene_y") +
coord_fixed() +
theme_bw()
Now let’s project these onto a lower dimension. First, we should make sure our LDA vector is normalized, a.k.a made to be a unit vector
<- w/sqrt(sum(w^2)) w_unit
Now, center the data about the means
<- as.matrix(lda_data_xy[, c("gene_x", "gene_y")])
X
# Center around the global mean
<- scale(X, center = TRUE, scale = FALSE) X_centered
and finally, project the data onto the discriminant axis.
# Projection of each point onto the LDA axis
<- X_centered %*% w_unit
lda_proj
$lda_proj <- as.vector(lda_proj) lda_data_xy
Now let’s see this in context
Code
<- X_centered %*% w_unit %*% t(w_unit)
proj_coords
<- colMeans(X)
global_mean <- sweep(proj_coords, 2, global_mean, FUN = "+")
proj_coords
# Build a dataframe for plotting
<- data.frame(
proj_df gene_x_proj = proj_coords[, 1],
gene_y_proj = proj_coords[, 2],
geno = lda_data_xy$geno
)
# Add the projected points to the plot
ggplot() +
geom_point(data = lda_data_xy,
aes(x = gene_x, y = gene_y, color = geno),
alpha = 0.6, size = 2) +
geom_point(data = proj_df,
aes(x = gene_x_proj,
y = gene_y_proj,
color = geno),
shape = 4,
size = 2.5) +
geom_segment(aes(x = lda_data_xy$gene_x,
y = lda_data_xy$gene_y,
xend = proj_df$gene_x_proj,
yend = proj_df$gene_y_proj,
color = lda_data_xy$geno),
alpha = 0.4) +
labs(title = "Projection of Points onto LDA Axis",
x = "gene_x",
y = "gene_y") +
coord_fixed() +
theme_bw()
And finally, let’s see both the boundary and discriminant axis together
Code
ggplot() +
geom_point(data = lda_data_xy,
aes(x = gene_x,
y = gene_y,
color = geno),
alpha = 0.6, size = 2) +
geom_point(data = proj_df,
aes(x = gene_x_proj,
y = gene_y_proj,
color = geno),
shape = 4,
size = 2.5) +
geom_segment(aes(x = lda_data_xy$gene_x,
y = lda_data_xy$gene_y,
xend = proj_df$gene_x_proj,
yend = proj_df$gene_y_proj,
color = lda_data_xy$geno),
alpha = 0.4) +
geom_line(data = boundary,
aes(x = x_vals,
y = y_vals),
color = "black",
linetype = "dashed") +
labs(title = "LDA Boundary and Axis",
x = "gene_x",
y = "gene_y") +
coord_fixed() +
theme_bw()
And we can see that the boundary line that divides the group is perpendicular to the LDA axis that shows the direction of separation.
Challenge 2 - LDA with More than 2 Categories
So you might be wondering, if \(d\) is the difference between the means of each category, how do we use this formula with more than 2 categories? We can’t just use the distance between any two.
Well, we use a central point (C) that is equidistant from the mean of each class. So instead of maximizing the distance between classes, we try to maximize the distance of each class from C while still minimizing the scatter within each category.
Instead of
\[ \frac{(d)^2}{s^2_1 + s^2_2} \]
we use
\[ \frac{d_1^2 + d_2^2 + ... + d_n^2}{s_1^2 + s_2^2 + ... + s_n^2} \]
where
\(n\) = the number of classes
\(d_n^2\) is the distance between a given class and the central point, C
\(s_n^2\) is the spread of a given class
Let’s use the function lda()
from the {MASS}
package to check out LDA with multiple classes.
<- lda(geno ~ gene_x + gene_y + gene_z, data = lda_data)
lda_model
# Check model output
print(lda_model)
Call:
lda(geno ~ gene_x + gene_y + gene_z, data = lda_data)
Prior probabilities of groups:
A B C
0.2933333 0.3500000 0.3566667
Group means:
gene_x gene_y gene_z
A 7.095949 5.085835 4.974837
B 5.027599 7.129011 5.118581
C 5.083457 5.167254 7.087386
Coefficients of linear discriminants:
LD1 LD2
gene_x 0.1357291 0.7295128
gene_y -0.7969376 -0.3420057
gene_z 0.6540238 -0.5538144
Proportion of trace:
LD1 LD2
0.5313 0.4687
In the output we see:
the prior probabilities, which represents the proportion of spread explained by each class discriminant
the group means, which are the means of each gene’s spread in each genotype
the coefficients of the linear discriminants, which represent the direction of separation and are the eigenvectors of \(W^{-1}B\). The higher the absolute value of each coefficient, the more discriminative power it has.
These values inform the predictions made by projecting the data onto a lower dimension space. Luckily, {MASS}
has the predict()
function so we don’t have to do it by hand again.
<- predict(lda_model, lda_data)
lda_pred
# The LDA projection scores for each observation
$LD1 <- lda_pred$x[,1]
lda_data$LD2 <- lda_pred$x[,2] lda_data
Now, let’s visualize this
Code
# Create a grid of points to evaluate the decision boundary
<- expand.grid(LD1 = seq(min(lda_data$LD1) - 1,
grid max(lda_data$LD1) + 1,
length.out = 100),
LD2 = seq(min(lda_data$LD2) - 1,
max(lda_data$LD2) + 1,
length.out = 100))
$gene_x <- (grid$LD1)
grid# Placeholder for gene_x, gene_y, gene_z transformation
$gene_y <- (grid$LD2)
grid# Same for gene_y
$gene_z <- 1
grid# If you want to include gene_z as a constant, modify as necessary.
# Predict the class for each point in the grid
$pred_geno <- predict(lda_model, newdata = grid)$class
grid
# Plot with decision boundary
ggplot(lda_data,
aes(x = LD1,
y = LD2,
color = geno)) +
geom_point(size = 3,
alpha = 0.7) +
labs(title = "2D LDA Projection with Decision Boundary",
x = "Linear Discriminant 1 (LD1)",
y = "Linear Discriminant 2 (LD2)") +
geom_contour(data = grid,
aes(x = LD1,
y = LD2,
z = as.numeric(pred_geno)),
color = "black",
bins = 1) +
theme_bw() +
theme(aspect.ratio = 1,
legend.title = element_blank())
Here, the line is squiggly because it represents a planar version of the LDA boundary. It might be easier to see by restructuring the plot - one common method for this in LDA is through ellipses:
<- NULL
ell for(i in unique(lda_data$geno)){
<- lda_data |> dplyr::filter(geno == i)
x <- gen_xvar_ellipse(x[, 1:3], n = 300, nstd = 1)
e $geno <- i
e<- bind_rows(ell, e)
ell }
Code
<- ggplot(lda_data,
lda1 aes(x = gene_x,
y = gene_y,
color = geno)) +
geom_point() +
scale_color_discrete() +
ggtitle("(a)") +
theme_bw() +
theme(aspect.ratio = 1,
legend.title = element_blank())
<- ggplot(ell,
lda2 aes(x = gene_x,
y = gene_y,
color = geno)) +
geom_point() +
scale_color_discrete() +
ggtitle("(b)") +
theme_bw() +
theme(aspect.ratio = 1,
legend.title = element_blank())
ggarrange(lda1, lda2, ncol = 2,
common.legend = TRUE, legend = "bottom")
Code
# Create a 3D scatter plot with LD1, LD2, and LD3
<- plot_ly(data = ell,
fig x = ~gene_x,
y = ~gene_y,
z = ~gene_z,
color = ~geno,
type = 'scatter3d',
mode = 'markers',
marker = list(size = 5)) %>%
layout(title = "3D LDA Projection",
scene = list(xaxis = list(title = 'Gene X'),
yaxis = list(title = 'Gene Y'),
zaxis = list(title = 'Gene Z')))
fig
Challenge 3 - Quadratic Discriminant Analysis
Quadratic discriminant analysis is similar to LDA, but assumes that that each class has its own unique covariance matrix. QDA can be thought of as an extension to LDA, as it is still classifying observations from a set of predictor variables. Since each class has its own, unique covariance matrix, QDA allows for complex, non-linear modeling.
Loading Data
First things first, we’ll load in the data that we need and clean it up to get that out of the way.
# store data from Kamilar and Cooper in holder f
<- curl("https://raw.githubusercontent.com/bygentry/AN588_Vignette/refs/heads/main/KamilarAndCooperData.csv")
f # convert data in f to a data frame and store in data.frame d
<- read.csv(f, sep = ",", stringsAsFactors = TRUE) d
Cleaning Data
# Using complete.cases() from the built-in stats package,
# we can remove rows with NA in specific columns
<- d[complete.cases(d[ , c(5, 8:10)]), ] tidy.data
Implementing QDA
First, we must set seed for reproducibility.
set.seed(482)
Now, we nee to split the data into training and test. We’ll use 70% for training and 30% for testing.
<- sample(1:nrow(tidy.data), 0.7 * nrow(tidy.data))
train_index <- tidy.data[train_index, ]
train_data <- tidy.data[-train_index, ] test_data
Now we will fit the model for all for features.
<- qda(Species ~ ., data = train_data) qda_model
If the above line of code is not working, trouble shoot it with the following code:
train_data <- na.omit(train_data)
This will remove all rows from the data frame or matrix that may contain missing values.
This can also be used to check for missing values:
colSums(is.na(train_data))
To view the model:
print(qda_model)
We first want to make prediction, based on the test set.
<- predict(qda_model, test_data) qda_pred
Next, display predicted classes.
head(qda_predclass)
Create a confusion matrix and a print confusion matrix.
A confusion matrix evaluates the performance of classification models. It provides a detailed breakdown of possible prediction outcomes. A
confusion matrix can be made as a table, to compare the positives and negatives of the outcomes.
A print confusion matrix refers to the act of generating the confusion matrix table.
<- table(Predicted = qda_pred$class,
conf_matrix Actual = test_data$Species)
print(conf_matrix)
Finally, check for accuracy.
<- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy cat("Accuracy:", round(accuracy * 100, 2), "%\n")