Weighted Gene Correlation Network Analysis

Objective: To successfully run network and module construction with gene expression measurements and to use that network to identify potential candidate genes associated with measured phenotypes.

WGCNA is a variation of correlation network construction. Networks are visual representations of interactions between ‘nodes’ in a system. The nodes in Weighted Gene Correlation Network Analysis are individual genes. Therefore WGCNA is a great tool for visualizing patterns and relationships between gene expression profiles (transcripts). WGCNA can:

Identify clusters of similarly expressed genes
Identify highly connected ‘hub’ genes
Relate clusters (modules) of genes to one another
Relate gene expression to external sample traits.

All of these are important for identifying potential candidate genes associated with measured traits as well as identifying genes that are consistently co-expressed and could be contributing to similar molecular pathways. Using WGCNA is also extremely useful statistically as it accounts for inter-individual variation in gene expression and alleviates issues associated with multiple testing.

Preliminaries

Before running through the following code you will need to install the following packages: {BiocManager}, {WGCNA}, {flashClust}

BiocManager is a package that will facilitate the installation of the other two packages as they are not updated to the current version of R. While running this module it is best to update your R and Rstudio to the most current version.

#install.packages('BiocManager')
#library(BiocManager)
#BiocManager::install('WGCNA')
#BiocManager::install('flashClust')

NOTE: Installing these packages will result in prompts that require you to input additional commands at the command line.

library(WGCNA)
library(flashClust)
library(curl)

WGCNA utilizes gene expression data from micro array or RNA-seq experiments. For a basic understanding of these experiments you can view this educational video.

The data set used in this analysis is available from the authors of the WGCNA pathway (Steve Hovarth and Peter Langfelder) in their tutorial.

For this tutorial you can read in the data set from our github page.

d <- curl('https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/bioanth-stats/module-F21-Group1/FemaleLiver-Data/LiverFemale3600.csv')
liver.data <- read.csv(file = d, stringsAsFactors = FALSE, header = TRUE)
head(liver.data)

##   substanceBXH   gene_symbol LocusLinkID ProteomeID cytogeneticLoc CHROMOSOME
## 1  MMT00000044 1700007N18Rik       69339     286025              0         16
## 2  MMT00000046         Mast2       17776     157466              0          4
## 3  MMT00000051       Ankrd32      105377     321939              0         13
## 4  MMT00000076             0      383154          0              0         16
## 5  MMT00000080          Ldb2       16826     157383              0          5
## 6  MMT00000102          Rdhs      216453          0     10_70.0_cM         10
##   StartPosition EndPosition     F2_2    F2_3     F2_14    F2_15    F2_19
## 1      50911260    50912491 -0.01810  0.0642  6.44e-05 -0.05800  0.04830
## 2     115215318   115372404 -0.07730 -0.0297  1.12e-01 -0.05890  0.04430
## 3      74940309    74982847 -0.02260  0.0617 -1.29e-01  0.08710 -0.11500
## 4      49345114    49477048 -0.00924 -0.1450  2.87e-02 -0.04390  0.00425
## 5      43546124    43613704 -0.04870  0.0582 -4.83e-02 -0.03710  0.02510
## 6       1337265     1347607  0.17600 -0.1890 -6.50e-02 -0.00846 -0.00574
##         F2_20    F2_23    F2_24   F2_26    F2_37        F2_42   F2_43    F2_45
## 1 -0.15197410 -0.00129 -0.23600 -0.0307 -0.02610  0.073705890 -0.0466 -0.00673
## 2 -0.09380000  0.09340  0.02690 -0.1330  0.07570 -0.009193803 -0.0075  0.01700
## 3 -0.06502607  0.00249 -0.10200  0.1420 -0.10200  0.064289290  0.0169 -0.01590
## 4 -0.23610000 -0.06900  0.01440  0.0363 -0.01820  0.477874600  0.1440  0.11100
## 5  0.08504274  0.04450  0.00167 -0.0680  0.00567 -0.075348680 -0.0673 -0.04720
## 6 -0.01807182 -0.12500 -0.06820  0.1250  0.00998 -0.037366600 -0.0402 -0.02190
##     F2_46    F2_47   F2_48   F2_51   F2_52   F2_54    F2_63     F2_65   F2_66
## 1 -0.0193  0.09040  0.0290  0.0356 -0.0388 -0.0360 -0.05600  0.009840 -0.0261
## 2  0.0722 -0.08390  0.0273 -0.0784 -0.0178  0.1120  0.12300  0.051700  0.0731
## 3 -0.1430 -0.00492 -0.0735  0.0657 -0.0197 -0.1290 -0.14300 -0.061600  0.0419
## 4  0.0113  0.11900  0.0225  0.0932  0.1430  0.2640 -0.09280 -0.000635 -0.0126
## 5  0.0701 -0.08790 -0.0180 -0.1290 -0.0469 -0.0352 -0.00166  0.058700 -0.0206
## 6  0.0269  0.13300  0.0732  0.1070 -0.0362 -0.0696 -0.19400 -0.117000 -0.0400
##      F2_68    F2_69    F2_70    F2_71   F2_72    F2_78        F2_79   F2_80
## 1  0.00856 -0.01180 -0.03350 -0.08310 -0.0471 -0.02820  0.047264410  0.0296
## 2  0.08670  0.05710  0.00693 -0.00606 -0.0390  0.01870  0.008471275 -0.0687
## 3 -0.29000 -0.10800 -0.09950 -0.00315  0.0975  0.01030 -0.134271000  0.1010
## 4  0.06910  0.02260 -0.08630 -0.22900  0.0178  0.00166  0.064096960  0.0103
## 5 -0.13000  0.00392  0.05450 -0.11200  0.1070  0.01190  0.008985630 -0.1030
## 6  0.06890  0.04320 -0.00338 -0.05270 -0.0416 -0.03040  0.025920240  0.0697
##     F2_81   F2_83   F2_86    F2_87    F2_88   F2_89  F2_107  F2_108   F2_109
## 1  0.0114  0.0498 -0.0249 -0.00264 -0.02050  0.0826 -0.0421  0.0663  0.03620
## 2 -0.0114 -0.0262 -0.0215 -0.09580 -0.01930 -0.1140  0.0815  0.0285  0.00299
## 3  0.0521 -0.0607 -0.0285  0.02560 -0.01350  0.0796  0.0553 -0.0380  0.12900
## 4 -0.0258 -0.0837  0.1880  0.03310 -0.00652  0.1550  0.0458  0.0752  0.12200
## 5 -0.1400 -0.0282 -0.1090  0.02070 -0.01370 -0.0288 -0.1220  0.1270 -0.09390
## 6  0.1150  0.0953  0.0127  0.05490  0.00311  0.0955 -0.1520 -0.0670 -0.00599
##    F2_110  F2_111  F2_112   F2_117  F2_119   F2_125   F2_126   F2_127   F2_141
## 1  0.0808 -0.0404  0.0877  0.07240 -0.0210  0.04540 -0.03220 -0.00654  0.03490
## 2 -0.0407 -0.0657  0.0643 -0.00022 -0.0877  0.00167  0.00321 -0.01260 -0.04530
## 3 -0.0361  0.0441 -0.1640 -0.01420 -0.0279  0.00677  0.07360  0.01750  0.10900
## 4 -0.0104  0.0914 -0.0355  0.06520  0.1280  0.05940  0.01630  0.00292  0.00714
## 5  0.1200 -0.0850  0.1400  0.00867  0.1440  0.08710 -0.03360  0.17300  0.08270
## 6 -0.0438  0.0634  0.1380 -0.04010  0.1310 -0.12600  0.00484 -0.00256 -0.06800
##    F2_142   F2_143   F2_144  F2_145      F2_154    F2_155  F2_156   F2_157
## 1 -0.0315 -0.02170  0.00370  0.0322 -0.02150730 -0.000958 -0.0850  0.00462
## 2 -0.0579  0.05920  0.00239 -0.0383  0.02457782 -0.030300 -0.1260 -0.06670
## 3 -0.0216 -0.01250  0.05460  0.0403 -0.01674888  0.059900  0.0311 -0.05190
## 4 -0.0565  0.10200  0.03480  0.0245  0.06776892  0.016500 -0.0382  0.02120
## 5  0.0594 -0.00317 -0.06750  0.0495  0.13520570  0.016500  0.0832  0.04350
## 6  0.0941 -0.04220  0.12000  0.1080 -0.05128296 -0.005590  0.0136  0.09910
##     F2_162  F2_163  F2_164   F2_165  F2_166  F2_167   F2_169   F2_180
## 1  0.03990  0.0716 -0.0923  0.10900  0.0102  0.0337  0.00911  0.03210
## 2 -0.00637 -0.0161 -0.2340 -0.09610 -0.1290 -0.0109 -0.11300 -0.00677
## 3  0.01890  0.0207  0.0929  0.00917  0.0874 -0.1260 -0.00949 -0.09900
## 4  0.06690  0.0512 -0.2450  1.23000 -0.0402 -0.0635  0.06880  0.03790
## 5  0.19300  0.0586 -0.0768  0.04600  0.0484  0.2810  0.07210 -0.00630
## 6  0.06770 -0.0520  0.1550  0.07890  0.0336  0.0648  0.14400  0.02770
##        F2_181  F2_182   F2_187   F2_188  F2_189   F2_190  F2_191  F2_192
## 1  0.03144772  0.0543  0.01120  0.01060  0.1130 -0.03960 -0.0504  0.0877
## 2 -0.16704700 -0.0239  0.00304 -0.03580 -0.1330 -0.01830 -0.0623 -0.0648
## 3  0.02700180 -0.0570 -0.05160 -0.04970  0.1660  0.05000  0.0498  0.0431
## 4 -0.02058180  0.0227  0.04180  0.01010  0.2170  0.00206 -0.0155  0.6550
## 5  0.37074790  0.0618  0.10800  0.12100  0.0237  0.02960  0.1130  0.0839
## 6  0.09297908  0.0601  0.02960  0.00198  0.0251  0.00059 -0.0282  0.0429
##    F2_194   F2_195  F2_200  F2_201      F2_212  F2_213  F2_214   F2_215  F2_221
## 1 -0.0563 -0.00557 -0.0484 -0.0273 -0.10816380 -0.0183 -0.0132 -0.00432 -0.6630
## 2 -0.0652  0.05020 -0.0912 -0.0180  0.05682362 -0.0238  0.0721  0.03910  0.1070
## 3 -0.0224 -0.10700  0.0715  0.0432 -0.13217820  0.0205 -0.0411  0.07670 -0.0783
## 4  0.2820 -0.01310 -0.0387 -0.0667 -0.32395020 -0.0245  0.0865  0.06470 -2.0000
## 5  0.1050  0.15500  0.0823  0.1140  0.03542023 -0.2020  0.0822  0.04260  0.1030
## 6  0.0697  0.04930  0.0414 -0.0708 -0.10881230  0.0359 -0.0678 -0.11000 -0.1420
##     F2_222  F2_223   F2_224   F2_225   F2_226  F2_227  F2_228  F2_241  F2_242
## 1  0.01440  0.0310  0.00818 -0.00892 -0.08710  0.0129  0.0937  0.0313  0.0821
## 2  0.00923 -0.0397 -0.06400  0.06300 -0.00152  0.0555  0.0947 -0.0387  0.0592
## 3 -0.06860 -0.0254 -0.05680 -0.13300 -0.07560 -0.0557 -0.0890 -0.1460 -0.0739
## 4  0.00874  0.0847 -0.09720  0.00746 -0.55200  0.0415  0.0733  0.0815  0.1100
## 5 -0.10100  0.1630  0.07410 -0.01640  0.08700 -0.0557 -0.1910  0.0219  0.0913
## 6  0.08430 -0.0610  0.08760 -0.03960  0.10200  0.0190 -0.1190  0.0687 -0.0525
##     F2_243  F2_244   F2_245    F2_247       F2_248   F2_261  F2_263   F2_264
## 1  0.00621  0.0307 -0.13700  0.075300 -0.096881950 -0.01670 -0.0928 -0.00957
## 2 -0.00636  0.0614  0.02850 -0.000633  0.001598228 -0.00267 -0.0198  0.16300
## 3 -0.01120 -0.0528  0.05050  0.027700 -0.067933370 -0.02220 -0.0684 -0.04930
## 4  0.21400  0.0135 -0.13500 -0.003100  0.072318780  0.01030 -0.3150  0.08420
## 5  0.01120  0.1190  0.00383  0.041700 -0.038618510  0.11800  0.0123  0.03700
## 6 -0.00716 -0.1460 -0.14500  0.029400  0.035281240 -0.05660  0.0917 -0.08080
##    F2_270   F2_271  F2_272   F2_278  F2_287      F2_288    F2_289  F2_290
## 1  0.0287 -0.01300 -0.0292 -0.03810 -0.0488  0.17361240 -0.097900  0.0383
## 2 -0.1310 -0.04260 -0.0514  0.07260 -0.0481 -0.16211430 -0.123000 -0.1370
## 3  0.0328  0.00537 -0.0259 -0.14400  0.0170  0.25924220 -0.041400 -0.0229
## 4  0.0351       NA  0.0730  0.00914  0.0556  0.18311140  0.051700  0.1780
## 5 -0.0142  0.00563 -0.0504 -0.05970 -0.0871  0.20897910 -0.000188 -0.0328
## 6  0.0362  0.00790 -0.0246 -0.07330  0.0125 -0.04778892  0.082500  0.1360
##     F2_291      F2_296  F2_298   F2_299    F2_300  F2_302  F2_303  F2_304
## 1  0.01850 -0.08937784  0.0230 -0.06250 -0.000142  0.0344  0.0358 -0.0139
## 2 -0.05720 -0.07416870 -0.0688 -0.06540 -0.102000 -0.0780 -0.0820 -0.1830
## 3 -0.00664 -0.05915232 -0.0134  0.09740  0.015500 -0.0934  0.1780  0.0842
## 4  0.05250 -0.21653720 -0.2210 -0.00266  0.545000  0.0127  0.0273 -0.0928
## 5 -0.16600 -0.07897525  0.1410 -0.12900  0.090600 -0.1330 -0.2120 -0.0797
## 6  0.04620  0.03811979 -0.0346  0.04690 -0.034800  0.0110  0.0323  0.1660
##    F2_305      F2_306   F2_307   F2_308  F2_309   F2_310  F2_311  F2_312
## 1  0.0134 -0.03145069  0.02780 -0.01190 -0.0744  0.00197 -0.0151 -0.0721
## 2 -0.0270 -0.09822316 -0.07890 -0.05480 -0.1320 -0.11000 -0.1130 -0.0805
## 3  0.0870  0.15520470  0.03410 -0.06830  0.0555 -0.04060  0.0835  0.0514
## 4  0.0469  0.10038160 -2.00000  0.05240  0.1260  0.07280  0.0600 -0.0455
## 5 -0.0191 -0.11958500  0.00294 -0.10600 -0.0518 -0.13200  0.0494  0.0221
## 6 -0.0866  0.05385017  0.09570 -0.00949  0.1120  0.20800  0.0872 -0.0555
##    F2_320  F2_321  F2_323    F2_324  F2_325  F2_326  F2_327   F2_328  F2_329
## 1 -0.0118  0.0200  0.0222  0.047700 -0.0488  0.0168 -0.0309  0.02740 -0.0310
## 2 -0.1200  0.0101 -0.1610 -0.049200 -0.0350 -0.0738 -0.1730 -0.07380 -0.2010
## 3  0.0713 -0.1130  0.0466  0.000612  0.1210  0.0996  0.1090  0.02730  0.1200
## 4 -0.0464  0.0667 -0.1850 -0.270000  0.0803  0.0424  0.1610  0.05120  0.2410
## 5  0.0272 -0.0938  0.1020  0.113000 -0.0859 -0.1340  0.0639  0.00731  0.1240
## 6  0.0748 -0.1420  0.0590 -0.080000 -0.1200  0.1230  0.1870  0.05410  0.0699
##    F2_330  F2_332  F2_355    F2_357
## 1  0.0660 -0.0199 -0.0146  0.065000
## 2 -0.0820 -0.0939  0.0192 -0.049900
## 3 -0.0629 -0.0395  0.1090  0.000253
## 4  0.3890  0.0251 -0.0348  0.114000
## 5 -0.0212  0.0870  0.0512  0.024300
## 6  0.0708  0.1450 -0.0399  0.037500

This .csv file contains quantified and normalized gene expression data from livers of several F2 female mice.

For the purposes of this vignette it is just important to understand the values in our data set are gene expression levels relative to a control.

NOTE: For information on how to generate expression data from raw sequence reads, there is a great online course available from the Sanger Institute detailing the process.

The data set contains a lot of additional trait data not relevant to your analysis and is in an incompatible format for the WGCNA pathway. Therefore you need to reformat the data so it is conducive to the WGCNA package functions.

Data Cleaning and Removing Outliers

First, the additional trait data needs to be removed.

expression.data <- liver.data[,-c(1:8)] #removing variables not holding expression data

Then, the data frame needs to be transformed so the rows represent samples instead of gene sequences and the columns represent gene sequences instead of samples.

expression.data <- as.data.frame(t(expression.data)) #transforming the data.frame so columns now represent genes and rows represent samples
names(expression.data) <- liver.data$substanceBXH
#renaming the columns so we don't lose the gene IDs

Identifying Outlier Genes

The WGCNA package has a built in function to identify outlier genes called goodSampleGenes(). The function checks the data and returns a list object of samples and genes that pass its filtering criteria. You can adjust how strict the filtering process is by changing the default of the arguments (which can be found under the goodSampleGene documentation, ?goodSampleGenes).

gsg <-goodSamplesGenes(expression.data)

##  Flagging genes and samples with too many missing values...
##   ..step 1

summary(gsg)

##             Length Class  Mode   
## goodGenes   3600   -none- logical
## goodSamples  135   -none- logical
## allOK          1   -none- logical

By viewing the gsg list object you can see it contains 3 logical vectors (good genes, good samples and allOK). If you want to see if the function identified any possible outlier all you have to do is evaluate the allOK vector.

gsg$allOK

## [1] TRUE

If the allOK object returns true, which it does in this case, there are no outliers present. If it doesn’t return true, we need to filter out the outliers manually using the following code.

if (!gsg$allOK)
{
if (sum(!gsg$goodGenes)>0) 
printFlush(paste("Removing genes:", paste(names(expression.data)[!gsg$goodGenes], collapse = ", "))); #Identifies and prints outlier genes
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(expression.data)[!gsg$goodSamples], collapse = ", "))); #Identifies and prints oulier samples
expression.data <- expression.data[gsg$goodSamples == TRUE, gsg$goodGenes == TRUE] # Removes the offending genes and samples from the data
}

Identifying outlier samples

You can identify outlier samples by using hierarchical clustering. It is probably beneficial to review the clustering module as clustering is used several times within this pathway.

sampleTree <- hclust(dist(expression.data), method = "average") #Clustering samples based on distance 

#Setting the graphical parameters
par(cex = 0.6);
par(mar = c(0,4,2,0))

#Plotting the cluster dendrogram
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)

Here you can see sample F2_221 seems to be distant from all of the other samples indicating it is likely an outlier sample.

You can remove the outlier using a cutree function.

#Setting the graphical parameters
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)
#draw on line to show cutoff height
abline(h = 15, col = "red");

In this case it looks like the cut height of 15 would cut F2_221 and retain the rest of the samples.

cut.sampleTree <- cutreeStatic(sampleTree, cutHeight = 15, minSize = 10) #returns numeric vector
#Remove outlier
expression.data <- expression.data[cut.sampleTree==1, ]

Network Construction

Pairwise Gene Co-expression similarity

Once you have purged the outliers, you can begin your network construction. The first step in network construction is to identify some sort of similarity measurement between pairs of genes. A similarity measurement represents the concordance of gene expression profiles across samples. In WGCNA the similarity measurement for each pair of genes (gene i and gene j) is denoted by their Pearson correlation coefficient.

Unsigned Networks

For unsigned networks you take the absolute value of the correlation. So the similarity measurement is denoted by:

\(sab=|cor(a, b)|\)

Although, it is more common practice to create a signed network. As unsigned networks make biological interpretation of results extremely difficult. If the network is unsigned, you do not know if the gene is up or down regulated in the sample trait, just that its expression is significantly different.

Signed Networks

The similarity measurement for a signed network is calculated as follows:

\(sab = .5 + .5 * cor(a,b)\)

Adjacency: Pairwise connection

The next step after calculating the similarity measurement for each gene pair is to translate that similarity measurement into the gene pairs adjacency to generate an adjacency matrix. Adjacency is the assignment of a connection strength based on the co-expression similarity measurement (Pearson correlation coefficient). Nodes are considered connected if they have a significant pairwise correlation association.

The generation of the adjacency matrix uses an adjacency function for its conversion. The function and its parameters vary based on what type of network construction you are trying to perform.

Unweighted Networks

In unweighted networks, the adjacency matrix indicates whether or not a pair of nodes are connected in a binary fashion
Unweighted networks utilize the signum function with the input of a hard threshold parameter \(\tau\)

\(aij = signum(sij,\tau) = \{ \text{1 if sij} \geq \tau , \text{0 if sij} < \tau \}\)

Here you can see the similarity measurement is compared to a hard threshold parameter of \(\tau\). If the similarity measurement is greater than \(\tau\) the pair of genes are assigned a value of 1 (connected) in the adjacency matrix. If the similarity measurement is less than \(\tau\), the pair is assigned a 0 (not connected) in the adjacency matrix.

NOTE: Utilizing a hard threshold can mean you lose a lot of information. If you set the threshold to .9, even if a pair has a similarity of .89999999999 they will be considered to have no connection.

Weighted Networks

In weighted networks the adjacency/connection is not binary and therefore can also distinguish the strength of connection.

weighted networks utilize a power function based on a soft threshold parameter \(\beta\)

\(aij = power(sij, \beta) = |sij|^\beta\)

Here you can see the adjacency matrix value is calculated by raising the similarity measurement to the defined threshold parameter \(\beta\).

NOTE: The WGCNA package has a built in function called adjacency() that will generate the pairwise similarity measurements and the adjacency matrix but before you call the function you need to identify what value you want to use for the threshold parameter.

Determining the Soft Power Threshold

The choice of the threshold parameter determines the sensitivity and specificity of the pairwise connection strengths (adjacency).

To determine the \(\beta\) parameter in a weighted network analysis we try to maximize a model fit (\(R^2\) value under a linear regression analysis) under a scale free topology model, while minimizing the number of connections lost when fitting the model (maintaining a high mean number of connections). As \(R^2\) values approach 1, we usually see networks with very few connections. Usually this happy medium occurs at \(R^2\) of > .8.

The scale-free topology is used because it is based on the idea that the probability that a node (gene) is connected with k other nodes (genes) decays as a power law:

\(p(k) \sim k^{-\gamma}\)

NOTE: Here is a great publication for more information about scale free topology model assumptions.

Again the WGCNA package has a built in function that helps to determine what the appropriate \(\beta\) parameter will be to calculate our adjacency matrix. The pickSoftThreshold() function calculates multiple networks all based on different \(\beta\) values and returns a data frame with the \(R^2\) values for the networks scale-free topology model fit as well as the mean connectivity measures.

spt <- pickSoftThreshold(expression.data)

## Warning: executing %dopar% sequentially: no parallel backend registered

##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1   0.0278  0.345          0.456  747.00  762.0000 1210.0
## 2      2   0.1260 -0.597          0.843  254.00  251.0000  574.0
## 3      3   0.3400 -1.030          0.972  111.00  102.0000  324.0
## 4      4   0.5060 -1.420          0.973   56.50   47.2000  202.0
## 5      5   0.6810 -1.720          0.940   32.20   25.1000  134.0
## 6      6   0.9020 -1.500          0.962   19.90   14.5000   94.8
## 7      7   0.9210 -1.670          0.917   13.20    8.6800   84.1
## 8      8   0.9040 -1.720          0.876    9.25    5.3900   76.3
## 9      9   0.8590 -1.700          0.836    6.80    3.5600   70.5
## 10    10   0.8330 -1.660          0.831    5.19    2.3800   65.8
## 11    12   0.8530 -1.480          0.911    3.33    1.1500   58.1
## 12    14   0.8760 -1.380          0.949    2.35    0.5740   51.9
## 13    16   0.9070 -1.300          0.970    1.77    0.3090   46.8
## 14    18   0.9120 -1.240          0.973    1.39    0.1670   42.5
## 15    20   0.9310 -1.210          0.977    1.14    0.0951   38.7

spt

## $powerEstimate
## [1] 6
## 
## $fitIndices
##    Power   SFT.R.sq      slope truncated.R.sq    mean.k.    median.k.
## 1      1 0.02776584  0.3445936      0.4558234 746.977852 761.73341294
## 2      2 0.12638769 -0.5966302      0.8432105 254.497776 250.84325945
## 3      3 0.34037012 -1.0299440      0.9724417 111.004761 101.73566693
## 4      4 0.50624635 -1.4215375      0.9726991  56.535969  47.24823353
## 5      5 0.68066112 -1.7161877      0.9395555  32.154311  25.08876080
## 6      6 0.90234867 -1.4974162      0.9621313  19.905790  14.49789064
## 7      7 0.92124780 -1.6673879      0.9171489  13.192338   8.67834321
## 8      8 0.90354653 -1.7240456      0.8759911   9.248524   5.39430786
## 9      9 0.85856532 -1.7025327      0.8364939   6.795315   3.55559547
## 10    10 0.83254728 -1.6632029      0.8314529   5.193521   2.38181026
## 11    12 0.85343233 -1.4816042      0.9114375   3.332680   1.14718679
## 12    14 0.87592425 -1.3827267      0.9493085   2.348905   0.57447184
## 13    16 0.90676370 -1.2955569      0.9697400   1.767929   0.30878389
## 14    18 0.91230435 -1.2435667      0.9725703   1.393515   0.16712574
## 15    20 0.93127802 -1.2079138      0.9773708   1.135087   0.09513045
##        max.k.
## 1  1206.42886
## 2   573.54490
## 3   324.15389
## 4   202.30723
## 5   134.17527
## 6    94.77388
## 7    84.10328
## 8    76.31468
## 9    70.53172
## 10   65.75199
## 11   58.06119
## 12   51.93683
## 13   46.83330
## 14   42.47241
## 15   38.68884

You can then plot this data frame to better visualize what \(\beta\) value you should choose.

REMINDER: We should be maximizing the \(R^2\) value and minimizing mean connectivity.

Plot the \(R^2\) values as a function of the soft thresholds

par(mar=c(1,1,1,1))
plot(spt$fitIndices[,1],spt$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(spt$fitIndices[,1],spt$fitIndices[,2],col="red")
abline(h=0.80,col="red")

Plot mean connectivity as a function of soft thresholds

par(mar=c(1,1,1,1))
plot(spt$fitIndices[,1], spt$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(spt$fitIndices[,1], spt$fitIndices[,5], labels= spt$fitIndices[,1],col="red")

You can determine the soft power threshold should be set to 6 as it is the spt that retains the highest mean connectivity while reaching an \(R^2\) value above 0.80.

NOTE: the higher the value, the stronger the connection strength will be of highly correlated gene expression profiles and the more devalued low correlations will be.

CHALLENGE 1

Using the following data set, find the appropriate soft threshold parameter.

f <- curl('https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/bioanth-stats/module-F21-Group1/FemaleLiver-Data/LiverMale3600.csv')
male.liver.data <- read.csv(file = f, stringsAsFactors = FALSE, header = TRUE)
male.expression.data <- male.liver.data[,-c(1:8)]
male.expression.data <- as.data.frame(t(male.expression.data))
names(male.expression.data) <- male.liver.data$substanceBXH

ANSWER 1

spt <- pickSoftThreshold(male.expression.data)

##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1   0.0166  0.291          0.560  711.00  736.0000 1130.0
## 2      2   0.2150 -0.744          0.750  235.00  238.0000  533.0
## 3      3   0.4490 -1.220          0.900  101.00   96.2000  306.0
## 4      4   0.5770 -1.580          0.938   51.30   45.1000  196.0
## 5      5   0.6300 -1.680          0.926   29.10   23.4000  132.0
## 6      6   0.8610 -1.360          0.968   17.90   12.9000   91.8
## 7      7   0.9030 -1.520          0.932   11.90    7.5200   79.6
## 8      8   0.9080 -1.610          0.889    8.28    4.5600   72.0
## 9      9   0.8950 -1.620          0.867    6.07    2.8800   66.2
## 10    10   0.8820 -1.590          0.868    4.63    1.9000   61.3
## 11    12   0.8900 -1.490          0.904    2.96    0.8760   53.5
## 12    14   0.9310 -1.380          0.958    2.09    0.4370   47.3
## 13    16   0.9540 -1.310          0.980    1.57    0.2270   42.1
## 14    18   0.9700 -1.260          0.984    1.24    0.1190   37.7
## 15    20   0.9840 -1.220          0.995    1.01    0.0642   33.9

spt

## $powerEstimate
## [1] 6
## 
## $fitIndices
##    Power   SFT.R.sq      slope truncated.R.sq    mean.k.    median.k.
## 1      1 0.01662141  0.2910424      0.5600083 711.394196 736.27770151
## 2      2 0.21469296 -0.7438635      0.7504513 235.481706 237.64208316
## 3      3 0.44898034 -1.2170444      0.8999491 101.360510  96.22414016
## 4      4 0.57675213 -1.5801735      0.9377900  51.322132  45.08006719
## 5      5 0.62969225 -1.6783150      0.9256004  29.085033  23.37605519
## 6      6 0.86066502 -1.3595362      0.9683174  17.946077  12.87067193
## 7      7 0.90330684 -1.5244404      0.9316391  11.852555   7.51956295
## 8      8 0.90841840 -1.6066163      0.8894454   8.281637   4.56367136
## 9      9 0.89475031 -1.6197059      0.8665099   6.067619   2.87757673
## 10    10 0.88175329 -1.5925927      0.8675172   4.627542   1.89844724
## 11    12 0.89032242 -1.4854185      0.9044572   2.963698   0.87586571
## 12    14 0.93119658 -1.3795330      0.9582245   2.089336   0.43660024
## 13    16 0.95396790 -1.3091814      0.9802766   1.574216   0.22720402
## 14    18 0.96961183 -1.2589717      0.9844259   1.241877   0.11909272
## 15    20 0.98363568 -1.2202300      0.9953312   1.011695   0.06419666
##        max.k.
## 1  1133.60311
## 2   532.50264
## 3   306.16208
## 4   195.85679
## 5   131.78839
## 6    91.76490
## 7    79.64545
## 8    72.02569
## 9    66.16730
## 10   61.32762
## 11   53.52075
## 12   47.29833
## 13   42.12377
## 14   37.72071
## 15   33.92194

par(mar=c(1,1,1,1))
plot(spt$fitIndices[,1],spt$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(spt$fitIndices[,1],spt$fitIndices[,2],col="red")
abline(h=0.80,col="red")

par(mar=c(1,1,1,1))
plot(spt$fitIndices[,1], spt$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(spt$fitIndices[,1], spt$fitIndices[,5], labels= spt$fitIndices[,1],col="red")

Calling the Adjacency Function

Now that you have the soft threshold power determined you can call on the adjacency() function of the WGCNA package.

REMINDER: This function calculates the similarity measurement and transforms the similarity by the adjacency function and generates a weighted network adjacency matrix.

softPower <- 6
adjacency <- adjacency(expression.data, power = softPower)

Module Construction

Defining Dissimilarity

Once the network is constructed, you can begin to extract some meaningful relationships. You can use hierarchical clustering yet again to cluster the network into modules.

NOTE: A module is a group of gene profiles that are highly correlated, or have a high topological overlap.

In order to utilize the clustering functions in R you must transform the adjacency matrix into measures of gene dissimilarity (distance of a gene from every other gene in the system).

NOTE: This is due to the fact that dissimilarity is used in traditional cluster analyses.

Topological Overlap Matrix

The TOM-based dissimilarity is preferentially used over dissimilarity based on correlation coefficients because TOM-based dissimilarity generates more distinct modules.

NOTE: Here is a great publication for information about why topological overlap is used.

To convert the adjacency matrix into a TOM similarity matrix we can call the WGCNA function TOMsimilarity().

TOM <- TOMsimilarity(adjacency)

## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.

CHALLENGE 2

The current matrix gives us the similarity between genes. How do you convert this into measures of dissimilarity?

ANSWER 2

To convert this matrix into a dissimilarity matrix you can subtract the TOM object from 1.

TOM.dissimilarity <- 1-TOM

Hierarchical Clustering Analysis

The dissimilarity/distance measures are then clustered using linkage hierarchical clustering and a dendrogram (cluster tree) of genes is constructed.

#creating the dendrogram 
geneTree <- hclust(as.dist(TOM.dissimilarity), method = "average")

#plotting the dendrogram
sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity", 
labels = FALSE, hang = 0.04)

To identify modules from this gene dendrogram, you can use the cutreeDynamic() function. This will allow you to set a minimum cluster size. For genomic data like this it is more beneficial to set minimum module sizes relatively high as you are working with high loads of data. The authors of WGCNA recommend to start at a minClusterSize = 30.

Modules <- cutreeDynamic(dendro = geneTree, distM = TOM.dissimilarity, deepSplit = 2, pamRespectsDendro = FALSE, minClusterSize = 30)

##  ..cutHeight not given, setting it to 0.996  ===>  99% of the (truncated) height range in dendro.
##  ..done.

table(Modules) #returns a table of the counts of factor levels in an object. In this case how many genes are assigned to each created module.

## Modules
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
##  88 614 316 311 257 235 225 212 158 153 121 106 102 100  94  91  78  76  65  58 
##  20  21  22 
##  58  48  34

Here you can see 22 modules were created. The Label 0 Module is reserved for unassigned genes (genes that do not fit in any module).

You can now plot the module assignment under the gene dendrogram for visualization.

ModuleColors <- labels2colors(Modules) #assigns each module number a color
table(ModuleColors) #returns the counts for each color (aka the number of genes within each module)

## ModuleColors
##        black         blue        brown         cyan    darkgreen      darkred 
##          212          316          311           94           34           48 
##        green  greenyellow         grey       grey60    lightcyan   lightgreen 
##          235          106           88           76           78           65 
##  lightyellow      magenta midnightblue         pink       purple          red 
##           58          153           91          158          121          225 
##    royalblue       salmon          tan    turquoise       yellow 
##           58          100          102          614          257

#plots the gene dendrogram with the module colors
plotDendroAndColors(geneTree, ModuleColors,"Module",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")

Module Eigengene Identification

A ME (Module Eigengene) is the standardized gene expression profile for a given module.

To identify the Module Eigengene you can call on the expression data into the moduleEigengenes() function.

MElist <- moduleEigengenes(expression.data, colors = ModuleColors) 
MEs <- MElist$eigengenes 
head(MEs)

##            MEblack        MEblue     MEbrown      MEcyan  MEdarkgreen
## F2_2   0.013902476  0.0410177922 0.007072125  0.12978459  0.006276361
## F2_3   0.066675342 -0.0009540238 0.072447744 -0.07777835  0.010326534
## F2_14  0.066711912 -0.0841292811 0.062700422 -0.19072152  0.003707524
## F2_15 -0.064480250  0.0909333146 0.050275810  0.04077621 -0.019067137
## F2_19  0.063634038 -0.0709378322 0.016600588 -0.04036901  0.017796637
## F2_20 -0.001201217  0.0653004166 0.049766750  0.10391289 -0.040252274
##          MEdarkred     MEgreen MEgreenyellow        MEgrey    MEgrey60
## F2_2   0.006971934 -0.13278003    0.04138109 -0.0055751098  0.02466696
## F2_3  -0.016017527 -0.03032564   -0.02369461  0.0134091891  0.01111424
## F2_14 -0.041321626  0.08744352   -0.20480126 -0.0138913444 -0.07769904
## F2_15 -0.014390509 -0.03874689   -0.03421073 -0.0199902439  0.04570456
## F2_19 -0.023401174  0.09939074   -0.04141718  0.0008595106 -0.01838634
## F2_20  0.113170728  0.02934683   -0.02197066 -0.0408911933  0.11407210
##        MElightcyan MElightgreen MElightyellow   MEmagenta MEmidnightblue
## F2_2  -0.040861194 -0.031003554  -0.011283565 -0.06241287    -0.08179216
## F2_3  -0.032374749  0.004675618  -0.012048700  0.01915760     0.09657692
## F2_14 -0.088374613  0.656010381   0.466954776  0.04810497    -0.11483834
## F2_15 -0.005731497 -0.032201993  -0.028306082  0.04155203     0.06290445
## F2_19 -0.021386305  0.004599456  -0.002610937  0.15373516    -0.19758250
## F2_20  0.080658254 -0.019607656   0.082282807  0.01633041     0.03713012
##             MEpink     MEpurple       MEred MEroyalblue     MEsalmon      MEtan
## F2_2  -0.085323887 -0.066835723 -0.11834913 -0.01614832  0.010275223 -0.1345626
## F2_3  -0.003466306  0.148067941 -0.05146571  0.09714258 -0.013670972  0.0925069
## F2_14 -0.116213720 -0.106130935 -0.04774316  0.12067477  0.028568172 -0.1013999
## F2_15 -0.028853363  0.128292790 -0.01118307  0.03893785 -0.022640042  0.1156341
## F2_19 -0.030412088  0.025959589  0.06600651  0.02163683  0.003104716 -0.0461558
## F2_20 -0.091590726  0.007945583  0.02456608 -0.02039788  0.033463526  0.0206224
##        MEturquoise     MEyellow
## F2_2   0.020351560 -0.064924328
## F2_3   0.037611494 -0.054134964
## F2_14  0.192752495 -0.017465028
## F2_15 -0.022177225  0.046759236
## F2_19  0.062446319 -0.019378632
## F2_20  0.008454122  0.003720271

You have now identified the eigengenes for the data.

Module Merging

To further condense the clusters (branches) into more meaningful modules you can cluster modules based on pairwise eigengene correlations and merge the modules that have similar expression profiles.

REMINDER: An eigengene is the gene whose expression is representative of the the majority of genes expressed within a module.

In order to do cluster analysis you must first find a measurement of dissimilarity (distance) between module eigengenes.

However, because there are missing values present in the input variable, you will need to add an argument use= "complete". This command removes rows of the matrix which have NA values. Removing these NAs allows ME.dissimilarity to run. This may or may not be necessary depending on your data set.

ME.dissimilarity = 1-cor(MElist$eigengenes, use="complete") #Calculate eigengene dissimilarity

Now, using the new found measurements of dissimilarity, you can construct a cluster tree. You will also be adding a line at the height of .25. This height corresponds to a correlation of over 75%. Any branches below this line are more than 75% related, and you will thus be merging them!

METree = hclust(as.dist(ME.dissimilarity), method = "average") #Clustering eigengenes 
par(mar = c(0,4,2,0)) #seting margin sizes
par(cex = 0.6);#scaling the graphic
plot(METree)
abline(h=.25, col = "red") #a height of .25 corresponds to correlation of .75

This figure shows all of the modules which are more than 75% similar. For example you can see that MEcyan and MEpurple are more than 75% similar. Now you can merge these two modules, and others like them using the mergeCloseModules() command

merge <- mergeCloseModules(expression.data, ModuleColors, cutHeight = .25)

##  mergeCloseModules: Merging modules whose distance is less than 0.25
##    Calculating new MEs...

# The merged module colors, assigning one color to each module
mergedColors = merge$colors
# Eigengenes of the new merged modules
mergedMEs = merge$newMEs

The similar modules are now merged! Let’s compare them with the original modules by creating another dendrogram

CHALLENGE 3

Using the knowledge you gained from the Hierarchical Clustering Analysis section, create a dendrogram which shows both the original AND merged module colors.

ANSWER 3

plotDendroAndColors(geneTree, cbind(ModuleColors, mergedColors), 
c("Original Module", "Merged Module"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors for original and merged modules")

Coooollioo! you can see that generally, there are less colors in the merged modules row, showing that our modules which were more than 75% similar have merged. For example, the lime green and red section merged into one large lime green section. This new merged output is cleaner and further identifies groups of highly correlated genes which co-occur across samples.

External Trait Matching

Once you have constructed the network (the adjacency matrix) and divided it into meaningful modules, you can begin to relate the network to external traits.

Start by reading in Trait Data Table

t <- curl('https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/bioanth-stats/module-F21-Group1/FemaleLiver-Data/ClinicalTraits.csv')
traitData <- read.csv(t, header = TRUE, stringsAsFactors = FALSE)
head(traitData)

##   X   Mice Number Mouse_ID          Strain sex        DOB parents Western_Diet
## 1 1 F2_290    290    306-4 BxH ApoE-/-, F2   2 2002-03-22  229232   2002-05-14
## 2 2 F2_291    291    307-1 BxH ApoE-/-, F2   2 2002-03-22     232   2002-05-14
## 3 3 F2_292    292    307-2 BxH ApoE-/-, F2   1 2002-03-22     232   2002-05-14
## 4 4 F2_293    293    307-3 BxH ApoE-/-, F2   1 2002-03-22     232   2002-05-14
## 5 5 F2_294    294    307-4 BxH ApoE-/-, F2   1 2002-03-22     232   2002-05-14
## 6 6 F2_295    295    308-1 BxH ApoE-/-, F2   1 2002-03-22     232   2002-05-14
##     Sac_Date weight_g length_cm ab_fat other_fat total_fat comments
## 1 2002-09-11     36.9       9.9   2.53      2.26      4.79     <NA>
## 2 2002-09-11     48.5      10.7   2.90      2.97      5.87     <NA>
## 3 2002-09-11     45.7      10.4   1.04      2.31      3.35     <NA>
## 4 2002-09-11     50.3      10.9   0.91      1.89      2.80     <NA>
## 5 2002-09-11     44.8       9.8   1.22      2.47      3.69     <NA>
## 6 2002-09-11     39.2      10.2   3.06      2.49      5.55     <NA>
##   X100xfat_weight Trigly Total_Chol HDL_Chol  UC FFA Glucose LDL_plus_VLDL
## 1       12.981030     53       1167       50 484 121     437          1117
## 2       12.103093     61       1230       32 592 173     572          1198
## 3        7.330416     41       1285       81 460  96     497          1204
## 4        5.566600    271       1299       64 476 122     553          1235
## 5        8.236607    114       1410       50 516 118     535          1360
## 6       14.158163     72       1533       18 620 106     382          1515
##   MCP_1_phys Insulin_ug_l Glucose_Insulin Leptin_pg_ml Adiponectin
## 1    175.850          924      0.47294372    245462.00      11.274
## 2     92.430         5781      0.09894482     84420.88       7.099
## 3    196.398         2074      0.23963356    105889.76       5.795
## 4     97.466        11874      0.04657234    100398.68       5.495
## 5     95.452         9181      0.05827252    130846.30       6.868
## 6    144.270          485      0.78762887     75166.22      17.328
##   Aortic.lesions Note Aneurysm Aortic_cal_M Aortic_cal_L CoronaryArtery_Cal
## 1         496250   NA       16            0           17                  0
## 2             NA   NA       16            4            0                  2
## 3         218500   NA        0            0           11                  0
## 4          61250   NA        0            0            0                  0
## 5         243750   NA       12           10            0                  0
## 6         104250   NA       17            2            0                  0
##   Myocardial_cal BMD_all_limbs BMD_femurs_only
## 1              0            NA              NA
## 2              4        0.0548         0.07730
## 3              0        0.0554         0.08065
## 4            236        0.0597         0.08680
## 5              0            NA              NA
## 6              0        0.0557         0.07700

You can clean the data by removing unnecessary columns and pulling out the continuous traits.

allTraits <- traitData[, -c(31, 16)] #removing notes and comments sections 
allTraits <- allTraits[, c(2, 11:36) ] #pulling out only continuous traits

Finally, you must match the trait data to the expression data by the sample number

Samples <- rownames(expression.data)
traitRows <- match(Samples, allTraits$Mice)
datTraits <- allTraits[traitRows, -1]
rownames(datTraits) <- allTraits[traitRows, 1]

Module-Trait associations

First, you will quantify the association between the expression profile and a particular trait of interest by calculating the correlation of the trait with previously identified module eigengenes. This pairwise correlation is known as the eigengene gene significance

# Define numbers of genes and samples
nGenes = ncol(expression.data)
nSamples = nrow(expression.data)
module.trait.correlation = cor(mergedMEs, datTraits, use = "p") #p for pearson correlation coefficient 
module.trait.Pvalue = corPvalueStudent(module.trait.correlation, nSamples) #calculate the p-value associated with the correlation

Once you have the gene signficance and the corresponding p-value for all the modules and traits, you can create a graphical representation (module-trait heatmap) that will be helpful to neatly visualize the module-trait relationships.

# Will display correlations and their p-values
textMatrix = paste(signif(module.trait.correlation, 2), "\n(",
signif(module.trait.Pvalue, 1), ")", sep = "");
dim(textMatrix) = dim(module.trait.correlation)
par(mar = c(6, 8.5, 3, 1))
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = module.trait.correlation,
xLabels = names(datTraits),
yLabels = names(mergedMEs),
ySymbols = names(mergedMEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.4,
zlim = c(-1,1),
main = paste("Module-trait relationships"))

Each row corresponds to a module eigengene, and the columns correspond to a trait. Each cell contains a p-value and correlation. Those with strong positive correlations are shaded a darker red while those with stronger negative correlations become more blue.

Target Gene Identification

You can use the gene significance along with the genes intramodular connectivity to identify potential target genes associated with a particular trait of interest. For this analysis weight will be the clinical trait.

Connectivity - how connected a speficic node is in the network (how many nodes have high correlation with that node). High connectivity indicates a hub gene (central to many nodes). Whole Network connectivity - a measure for how well the node is connected throughout the entire system Intramodular connectivity - a measure for how well the node is connected within its assigned module. Also an indicator for how well that node belongs to its module. This is also known as module membership.

The module membership/intramodular connectivity is calculated as the correlation of the eigengene and the gene expression profile. This quantifies the similarity of all genes on the array to every module.

# Define variable weight containing the weight column of datTrait
weight = as.data.frame(datTraits$weight_g)
names(weight) = "weight"

modNames = substring(names(mergedMEs), 3) #extract module names

#Calculate the module membership and the associated p-values
geneModuleMembership = as.data.frame(cor(expression.data, mergedMEs, use = "p"))
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
names(geneModuleMembership) = paste("MM", modNames, sep="")
names(MMPvalue) = paste("p.MM", modNames, sep="")

#Calculate the gene significance and associated p-values
geneTraitSignificance = as.data.frame(cor(expression.data, weight, use = "p"))
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
names(geneTraitSignificance) = paste("GS.", names(weight), sep="")
names(GSPvalue) = paste("p.GS.", names(weight), sep="")
head(GSPvalue)

##             p.GS.weight
## MMT00000044 0.425749831
## MMT00000046 0.387501807
## MMT00000051 0.004593408
## MMT00000076 0.047920465
## MMT00000080 0.013359743
## MMT00000102 0.007297804

Using the gene significance you can identify genes that have a high significance for weight. Using the module membership measures you can identify genes with high module membership in interesting modules.

As an example, you can look at the magenta module as it has the highest significant association with weight (.59).

Plot a scatter plot of gene significance vs. module membership in the magenta module.

par(mar=c(1,1,1,1))
module = "magenta"
column = match(module, modNames)
moduleGenes = mergedColors==module
verboseScatterplot(abs(geneModuleMembership[moduleGenes,column]),
abs(geneTraitSignificance[moduleGenes,1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for body weight",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

The magenta gene significance and module membership have a positive correlation of .49 with a very significant p-value. This indicates that the genes that are highly significantly associated with the trait (high gene significance) are also the genes that are the most connected within their module (high module membership). Therefore genes in the magenta module could be potential target genes when looking at body weight.

CHALLENGE 4

Plot the gene significance and the module membership values of another module into a scatter plot. Does the module you selected contain potential candidate genes?

ANSWER 4

par(mar=c(1,1,1,1))
module = "purple"
column = match(module, modNames)
moduleGenes = mergedColors==module
verboseScatterplot(abs(geneModuleMembership[moduleGenes,column]),
abs(geneTraitSignificance[moduleGenes,1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for body weight",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

Network Visualization of Eigengenes

It is also possible to study the relationship among found modules. One way to do this is to quantify module similarity (adjacency) by calculating the pairwise correlation of representative eigengenes.

You are able to generate a summary plot of the eigengene correlations using the function plotEigengeneNetworks(). To visualize how the trait fits into the eigengene network you can bind the trait data to the module eigengenes and argue the data frame to plotEigengeneNetworks() which will display the relationship on a heatmap.

# Isolate weight from the clinical traits
weight = as.data.frame(datTraits$weight_g);
names(weight) = "weight"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, weight))
# Plot the relationships among the eigengenes and the trait
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(5,4,1,2), cex.lab = 0.8, xLabelsAngle
= 90)

With this heatmap you can identify groups of correlated eigengenes called meta modules. Modules with mutual correlations stronger than their correlation with the specified clinical trait would be grouped into a meta module.

The plotEigengeneNetworks() function produces a dendrogram of eigengenes and the specified trait, as well as the corresponding eigengene heat map. You can separate the two using the following code:

# Plot the dendrogram
par(cex = 1.0)
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
plotHeatmaps = FALSE)

# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0, mar = c(1,1,1,1))
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(5,5,2,2),
plotDendrograms = FALSE, xLabelsAngle = 90)

CHALLENGE 5

Create a dendrogram and heatmap using two clinical traits. For example, weight AND Glucose.

ANSWER 5

# Isolate weight and Glucose from the clinical traits
weight = as.data.frame(datTraits$weight_g);
names(weight) = "weight"
Glucose = as.data.frame(datTraits$Glucose)
names(Glucose) = "Glucose"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, weight,Glucose))
# Plot the relationships among the eigengenes and the trait
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(5,4,1,2), cex.lab = 0.8, xLabelsAngle
= 90)

Great! As you are able to focus more on how these two traits fit into the eigengene network, rather than be distracted by the other clinical trait data.

References:

Hovarth’s original Tutorial

https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/

R in Action Chapter 16
Langfelder and Horvath 2008

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2631488/

Horvath and Zhang (2005) doi: 10.2202/1544-6115.1128

https://www-degruyter-com.ezproxy.bu.edu/document/doi/10.2202/1544-6115.1128/html

WGCNA

Victoria French, CeCe Gerstenbacher, Warrenkevin Henderson, and Elizabeth Varghese

11/29/2021