Objective: To successfully run network and module construction with gene expression measurements and to use that network to identify potential candidate genes associated with measured phenotypes.
WGCNA is a variation of correlation network construction. Networks are visual representations of interactions between ‘nodes’ in a system. The nodes in Weighted Gene Correlation Network Analysis are individual genes. Therefore WGCNA is a great tool for visualizing patterns and relationships between gene expression profiles (transcripts). WGCNA can:
All of these are important for identifying potential candidate genes associated with measured traits as well as identifying genes that are consistently co-expressed and could be contributing to similar molecular pathways. Using WGCNA is also extremely useful statistically as it accounts for inter-individual variation in gene expression and alleviates issues associated with multiple testing.
Before running through the following code you will need to install the following packages: {BiocManager}, {WGCNA}, {flashClust}
BiocManager is a package that will facilitate the installation of the other two packages as they are not updated to the current version of R. While running this module it is best to update your R and Rstudio to the most current version.
#install.packages('BiocManager')
#library(BiocManager)
#BiocManager::install('WGCNA')
#BiocManager::install('flashClust')
NOTE: Installing these packages will result in prompts that require you to input additional commands at the command line.
library(WGCNA)
library(flashClust)
library(curl)
WGCNA utilizes gene expression data from micro array or RNA-seq experiments. For a basic understanding of these experiments you can view this educational video.
The data set used in this analysis is available from the authors of the WGCNA pathway (Steve Hovarth and Peter Langfelder) in their tutorial.
For this tutorial you can read in the data set from our github page.
d <- curl('https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/bioanth-stats/module-F21-Group1/FemaleLiver-Data/LiverFemale3600.csv')
liver.data <- read.csv(file = d, stringsAsFactors = FALSE, header = TRUE)
head(liver.data)
## substanceBXH gene_symbol LocusLinkID ProteomeID cytogeneticLoc CHROMOSOME
## 1 MMT00000044 1700007N18Rik 69339 286025 0 16
## 2 MMT00000046 Mast2 17776 157466 0 4
## 3 MMT00000051 Ankrd32 105377 321939 0 13
## 4 MMT00000076 0 383154 0 0 16
## 5 MMT00000080 Ldb2 16826 157383 0 5
## 6 MMT00000102 Rdhs 216453 0 10_70.0_cM 10
## StartPosition EndPosition F2_2 F2_3 F2_14 F2_15 F2_19
## 1 50911260 50912491 -0.01810 0.0642 6.44e-05 -0.05800 0.04830
## 2 115215318 115372404 -0.07730 -0.0297 1.12e-01 -0.05890 0.04430
## 3 74940309 74982847 -0.02260 0.0617 -1.29e-01 0.08710 -0.11500
## 4 49345114 49477048 -0.00924 -0.1450 2.87e-02 -0.04390 0.00425
## 5 43546124 43613704 -0.04870 0.0582 -4.83e-02 -0.03710 0.02510
## 6 1337265 1347607 0.17600 -0.1890 -6.50e-02 -0.00846 -0.00574
## F2_20 F2_23 F2_24 F2_26 F2_37 F2_42 F2_43 F2_45
## 1 -0.15197410 -0.00129 -0.23600 -0.0307 -0.02610 0.073705890 -0.0466 -0.00673
## 2 -0.09380000 0.09340 0.02690 -0.1330 0.07570 -0.009193803 -0.0075 0.01700
## 3 -0.06502607 0.00249 -0.10200 0.1420 -0.10200 0.064289290 0.0169 -0.01590
## 4 -0.23610000 -0.06900 0.01440 0.0363 -0.01820 0.477874600 0.1440 0.11100
## 5 0.08504274 0.04450 0.00167 -0.0680 0.00567 -0.075348680 -0.0673 -0.04720
## 6 -0.01807182 -0.12500 -0.06820 0.1250 0.00998 -0.037366600 -0.0402 -0.02190
## F2_46 F2_47 F2_48 F2_51 F2_52 F2_54 F2_63 F2_65 F2_66
## 1 -0.0193 0.09040 0.0290 0.0356 -0.0388 -0.0360 -0.05600 0.009840 -0.0261
## 2 0.0722 -0.08390 0.0273 -0.0784 -0.0178 0.1120 0.12300 0.051700 0.0731
## 3 -0.1430 -0.00492 -0.0735 0.0657 -0.0197 -0.1290 -0.14300 -0.061600 0.0419
## 4 0.0113 0.11900 0.0225 0.0932 0.1430 0.2640 -0.09280 -0.000635 -0.0126
## 5 0.0701 -0.08790 -0.0180 -0.1290 -0.0469 -0.0352 -0.00166 0.058700 -0.0206
## 6 0.0269 0.13300 0.0732 0.1070 -0.0362 -0.0696 -0.19400 -0.117000 -0.0400
## F2_68 F2_69 F2_70 F2_71 F2_72 F2_78 F2_79 F2_80
## 1 0.00856 -0.01180 -0.03350 -0.08310 -0.0471 -0.02820 0.047264410 0.0296
## 2 0.08670 0.05710 0.00693 -0.00606 -0.0390 0.01870 0.008471275 -0.0687
## 3 -0.29000 -0.10800 -0.09950 -0.00315 0.0975 0.01030 -0.134271000 0.1010
## 4 0.06910 0.02260 -0.08630 -0.22900 0.0178 0.00166 0.064096960 0.0103
## 5 -0.13000 0.00392 0.05450 -0.11200 0.1070 0.01190 0.008985630 -0.1030
## 6 0.06890 0.04320 -0.00338 -0.05270 -0.0416 -0.03040 0.025920240 0.0697
## F2_81 F2_83 F2_86 F2_87 F2_88 F2_89 F2_107 F2_108 F2_109
## 1 0.0114 0.0498 -0.0249 -0.00264 -0.02050 0.0826 -0.0421 0.0663 0.03620
## 2 -0.0114 -0.0262 -0.0215 -0.09580 -0.01930 -0.1140 0.0815 0.0285 0.00299
## 3 0.0521 -0.0607 -0.0285 0.02560 -0.01350 0.0796 0.0553 -0.0380 0.12900
## 4 -0.0258 -0.0837 0.1880 0.03310 -0.00652 0.1550 0.0458 0.0752 0.12200
## 5 -0.1400 -0.0282 -0.1090 0.02070 -0.01370 -0.0288 -0.1220 0.1270 -0.09390
## 6 0.1150 0.0953 0.0127 0.05490 0.00311 0.0955 -0.1520 -0.0670 -0.00599
## F2_110 F2_111 F2_112 F2_117 F2_119 F2_125 F2_126 F2_127 F2_141
## 1 0.0808 -0.0404 0.0877 0.07240 -0.0210 0.04540 -0.03220 -0.00654 0.03490
## 2 -0.0407 -0.0657 0.0643 -0.00022 -0.0877 0.00167 0.00321 -0.01260 -0.04530
## 3 -0.0361 0.0441 -0.1640 -0.01420 -0.0279 0.00677 0.07360 0.01750 0.10900
## 4 -0.0104 0.0914 -0.0355 0.06520 0.1280 0.05940 0.01630 0.00292 0.00714
## 5 0.1200 -0.0850 0.1400 0.00867 0.1440 0.08710 -0.03360 0.17300 0.08270
## 6 -0.0438 0.0634 0.1380 -0.04010 0.1310 -0.12600 0.00484 -0.00256 -0.06800
## F2_142 F2_143 F2_144 F2_145 F2_154 F2_155 F2_156 F2_157
## 1 -0.0315 -0.02170 0.00370 0.0322 -0.02150730 -0.000958 -0.0850 0.00462
## 2 -0.0579 0.05920 0.00239 -0.0383 0.02457782 -0.030300 -0.1260 -0.06670
## 3 -0.0216 -0.01250 0.05460 0.0403 -0.01674888 0.059900 0.0311 -0.05190
## 4 -0.0565 0.10200 0.03480 0.0245 0.06776892 0.016500 -0.0382 0.02120
## 5 0.0594 -0.00317 -0.06750 0.0495 0.13520570 0.016500 0.0832 0.04350
## 6 0.0941 -0.04220 0.12000 0.1080 -0.05128296 -0.005590 0.0136 0.09910
## F2_162 F2_163 F2_164 F2_165 F2_166 F2_167 F2_169 F2_180
## 1 0.03990 0.0716 -0.0923 0.10900 0.0102 0.0337 0.00911 0.03210
## 2 -0.00637 -0.0161 -0.2340 -0.09610 -0.1290 -0.0109 -0.11300 -0.00677
## 3 0.01890 0.0207 0.0929 0.00917 0.0874 -0.1260 -0.00949 -0.09900
## 4 0.06690 0.0512 -0.2450 1.23000 -0.0402 -0.0635 0.06880 0.03790
## 5 0.19300 0.0586 -0.0768 0.04600 0.0484 0.2810 0.07210 -0.00630
## 6 0.06770 -0.0520 0.1550 0.07890 0.0336 0.0648 0.14400 0.02770
## F2_181 F2_182 F2_187 F2_188 F2_189 F2_190 F2_191 F2_192
## 1 0.03144772 0.0543 0.01120 0.01060 0.1130 -0.03960 -0.0504 0.0877
## 2 -0.16704700 -0.0239 0.00304 -0.03580 -0.1330 -0.01830 -0.0623 -0.0648
## 3 0.02700180 -0.0570 -0.05160 -0.04970 0.1660 0.05000 0.0498 0.0431
## 4 -0.02058180 0.0227 0.04180 0.01010 0.2170 0.00206 -0.0155 0.6550
## 5 0.37074790 0.0618 0.10800 0.12100 0.0237 0.02960 0.1130 0.0839
## 6 0.09297908 0.0601 0.02960 0.00198 0.0251 0.00059 -0.0282 0.0429
## F2_194 F2_195 F2_200 F2_201 F2_212 F2_213 F2_214 F2_215 F2_221
## 1 -0.0563 -0.00557 -0.0484 -0.0273 -0.10816380 -0.0183 -0.0132 -0.00432 -0.6630
## 2 -0.0652 0.05020 -0.0912 -0.0180 0.05682362 -0.0238 0.0721 0.03910 0.1070
## 3 -0.0224 -0.10700 0.0715 0.0432 -0.13217820 0.0205 -0.0411 0.07670 -0.0783
## 4 0.2820 -0.01310 -0.0387 -0.0667 -0.32395020 -0.0245 0.0865 0.06470 -2.0000
## 5 0.1050 0.15500 0.0823 0.1140 0.03542023 -0.2020 0.0822 0.04260 0.1030
## 6 0.0697 0.04930 0.0414 -0.0708 -0.10881230 0.0359 -0.0678 -0.11000 -0.1420
## F2_222 F2_223 F2_224 F2_225 F2_226 F2_227 F2_228 F2_241 F2_242
## 1 0.01440 0.0310 0.00818 -0.00892 -0.08710 0.0129 0.0937 0.0313 0.0821
## 2 0.00923 -0.0397 -0.06400 0.06300 -0.00152 0.0555 0.0947 -0.0387 0.0592
## 3 -0.06860 -0.0254 -0.05680 -0.13300 -0.07560 -0.0557 -0.0890 -0.1460 -0.0739
## 4 0.00874 0.0847 -0.09720 0.00746 -0.55200 0.0415 0.0733 0.0815 0.1100
## 5 -0.10100 0.1630 0.07410 -0.01640 0.08700 -0.0557 -0.1910 0.0219 0.0913
## 6 0.08430 -0.0610 0.08760 -0.03960 0.10200 0.0190 -0.1190 0.0687 -0.0525
## F2_243 F2_244 F2_245 F2_247 F2_248 F2_261 F2_263 F2_264
## 1 0.00621 0.0307 -0.13700 0.075300 -0.096881950 -0.01670 -0.0928 -0.00957
## 2 -0.00636 0.0614 0.02850 -0.000633 0.001598228 -0.00267 -0.0198 0.16300
## 3 -0.01120 -0.0528 0.05050 0.027700 -0.067933370 -0.02220 -0.0684 -0.04930
## 4 0.21400 0.0135 -0.13500 -0.003100 0.072318780 0.01030 -0.3150 0.08420
## 5 0.01120 0.1190 0.00383 0.041700 -0.038618510 0.11800 0.0123 0.03700
## 6 -0.00716 -0.1460 -0.14500 0.029400 0.035281240 -0.05660 0.0917 -0.08080
## F2_270 F2_271 F2_272 F2_278 F2_287 F2_288 F2_289 F2_290
## 1 0.0287 -0.01300 -0.0292 -0.03810 -0.0488 0.17361240 -0.097900 0.0383
## 2 -0.1310 -0.04260 -0.0514 0.07260 -0.0481 -0.16211430 -0.123000 -0.1370
## 3 0.0328 0.00537 -0.0259 -0.14400 0.0170 0.25924220 -0.041400 -0.0229
## 4 0.0351 NA 0.0730 0.00914 0.0556 0.18311140 0.051700 0.1780
## 5 -0.0142 0.00563 -0.0504 -0.05970 -0.0871 0.20897910 -0.000188 -0.0328
## 6 0.0362 0.00790 -0.0246 -0.07330 0.0125 -0.04778892 0.082500 0.1360
## F2_291 F2_296 F2_298 F2_299 F2_300 F2_302 F2_303 F2_304
## 1 0.01850 -0.08937784 0.0230 -0.06250 -0.000142 0.0344 0.0358 -0.0139
## 2 -0.05720 -0.07416870 -0.0688 -0.06540 -0.102000 -0.0780 -0.0820 -0.1830
## 3 -0.00664 -0.05915232 -0.0134 0.09740 0.015500 -0.0934 0.1780 0.0842
## 4 0.05250 -0.21653720 -0.2210 -0.00266 0.545000 0.0127 0.0273 -0.0928
## 5 -0.16600 -0.07897525 0.1410 -0.12900 0.090600 -0.1330 -0.2120 -0.0797
## 6 0.04620 0.03811979 -0.0346 0.04690 -0.034800 0.0110 0.0323 0.1660
## F2_305 F2_306 F2_307 F2_308 F2_309 F2_310 F2_311 F2_312
## 1 0.0134 -0.03145069 0.02780 -0.01190 -0.0744 0.00197 -0.0151 -0.0721
## 2 -0.0270 -0.09822316 -0.07890 -0.05480 -0.1320 -0.11000 -0.1130 -0.0805
## 3 0.0870 0.15520470 0.03410 -0.06830 0.0555 -0.04060 0.0835 0.0514
## 4 0.0469 0.10038160 -2.00000 0.05240 0.1260 0.07280 0.0600 -0.0455
## 5 -0.0191 -0.11958500 0.00294 -0.10600 -0.0518 -0.13200 0.0494 0.0221
## 6 -0.0866 0.05385017 0.09570 -0.00949 0.1120 0.20800 0.0872 -0.0555
## F2_320 F2_321 F2_323 F2_324 F2_325 F2_326 F2_327 F2_328 F2_329
## 1 -0.0118 0.0200 0.0222 0.047700 -0.0488 0.0168 -0.0309 0.02740 -0.0310
## 2 -0.1200 0.0101 -0.1610 -0.049200 -0.0350 -0.0738 -0.1730 -0.07380 -0.2010
## 3 0.0713 -0.1130 0.0466 0.000612 0.1210 0.0996 0.1090 0.02730 0.1200
## 4 -0.0464 0.0667 -0.1850 -0.270000 0.0803 0.0424 0.1610 0.05120 0.2410
## 5 0.0272 -0.0938 0.1020 0.113000 -0.0859 -0.1340 0.0639 0.00731 0.1240
## 6 0.0748 -0.1420 0.0590 -0.080000 -0.1200 0.1230 0.1870 0.05410 0.0699
## F2_330 F2_332 F2_355 F2_357
## 1 0.0660 -0.0199 -0.0146 0.065000
## 2 -0.0820 -0.0939 0.0192 -0.049900
## 3 -0.0629 -0.0395 0.1090 0.000253
## 4 0.3890 0.0251 -0.0348 0.114000
## 5 -0.0212 0.0870 0.0512 0.024300
## 6 0.0708 0.1450 -0.0399 0.037500
This .csv file contains quantified and normalized gene expression data from livers of several F2 female mice.
For the purposes of this vignette it is just important to understand the values in our data set are gene expression levels relative to a control.
NOTE: For information on how to generate expression data from raw sequence reads, there is a great online course available from the Sanger Institute detailing the process.
The data set contains a lot of additional trait data not relevant to your analysis and is in an incompatible format for the WGCNA pathway. Therefore you need to reformat the data so it is conducive to the WGCNA package functions.
First, the additional trait data needs to be removed.
expression.data <- liver.data[,-c(1:8)] #removing variables not holding expression data
Then, the data frame needs to be transformed so the rows represent samples instead of gene sequences and the columns represent gene sequences instead of samples.
expression.data <- as.data.frame(t(expression.data)) #transforming the data.frame so columns now represent genes and rows represent samples
names(expression.data) <- liver.data$substanceBXH
#renaming the columns so we don't lose the gene IDs
The WGCNA package has a built in function to identify outlier genes called goodSampleGenes()
. The function checks the data and returns a list object of samples and genes that pass its filtering criteria. You can adjust how strict the filtering process is by changing the default of the arguments (which can be found under the goodSampleGene documentation, ?goodSampleGenes).
gsg <-goodSamplesGenes(expression.data)
## Flagging genes and samples with too many missing values...
## ..step 1
summary(gsg)
## Length Class Mode
## goodGenes 3600 -none- logical
## goodSamples 135 -none- logical
## allOK 1 -none- logical
By viewing the gsg list object you can see it contains 3 logical vectors (good genes, good samples and allOK). If you want to see if the function identified any possible outlier all you have to do is evaluate the allOK vector.
gsg$allOK
## [1] TRUE
If the allOK object returns true, which it does in this case, there are no outliers present. If it doesn’t return true, we need to filter out the outliers manually using the following code.
if (!gsg$allOK)
{
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(expression.data)[!gsg$goodGenes], collapse = ", "))); #Identifies and prints outlier genes
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(expression.data)[!gsg$goodSamples], collapse = ", "))); #Identifies and prints oulier samples
expression.data <- expression.data[gsg$goodSamples == TRUE, gsg$goodGenes == TRUE] # Removes the offending genes and samples from the data
}
You can identify outlier samples by using hierarchical clustering. It is probably beneficial to review the clustering module as clustering is used several times within this pathway.
sampleTree <- hclust(dist(expression.data), method = "average") #Clustering samples based on distance
#Setting the graphical parameters
par(cex = 0.6);
par(mar = c(0,4,2,0))
#Plotting the cluster dendrogram
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)
Here you can see sample F2_221 seems to be distant from all of the other samples indicating it is likely an outlier sample.
You can remove the outlier using a cutree function.
#Setting the graphical parameters
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)
#draw on line to show cutoff height
abline(h = 15, col = "red");
In this case it looks like the cut height of 15 would cut F2_221 and retain the rest of the samples.
cut.sampleTree <- cutreeStatic(sampleTree, cutHeight = 15, minSize = 10) #returns numeric vector
#Remove outlier
expression.data <- expression.data[cut.sampleTree==1, ]
Once you have purged the outliers, you can begin your network construction. The first step in network construction is to identify some sort of similarity measurement between pairs of genes. A similarity measurement represents the concordance of gene expression profiles across samples. In WGCNA the similarity measurement for each pair of genes (gene i and gene j) is denoted by their Pearson correlation coefficient.
For unsigned networks you take the absolute value of the correlation. So the similarity measurement is denoted by:
\(sab=|cor(a, b)|\)
Although, it is more common practice to create a signed network. As unsigned networks make biological interpretation of results extremely difficult. If the network is unsigned, you do not know if the gene is up or down regulated in the sample trait, just that its expression is significantly different.
The similarity measurement for a signed network is calculated as follows:
\(sab = .5 + .5 * cor(a,b)\)
The next step after calculating the similarity measurement for each gene pair is to translate that similarity measurement into the gene pairs adjacency to generate an adjacency matrix. Adjacency is the assignment of a connection strength based on the co-expression similarity measurement (Pearson correlation coefficient). Nodes are considered connected if they have a significant pairwise correlation association.
The generation of the adjacency matrix uses an adjacency function for its conversion. The function and its parameters vary based on what type of network construction you are trying to perform.
In unweighted networks, the adjacency matrix indicates whether or not a pair of nodes are connected in a binary fashion
Unweighted networks utilize the signum function with the input of a hard threshold parameter \(\tau\)
\(aij = signum(sij,\tau) = \{ \text{1 if sij} \geq \tau , \text{0 if sij} < \tau \}\)
Here you can see the similarity measurement is compared to a hard threshold parameter of \(\tau\). If the similarity measurement is greater than \(\tau\) the pair of genes are assigned a value of 1 (connected) in the adjacency matrix. If the similarity measurement is less than \(\tau\), the pair is assigned a 0 (not connected) in the adjacency matrix.
NOTE: Utilizing a hard threshold can mean you lose a lot of information. If you set the threshold to .9, even if a pair has a similarity of .89999999999 they will be considered to have no connection.
In weighted networks the adjacency/connection is not binary and therefore can also distinguish the strength of connection.
\(aij = power(sij, \beta) = |sij|^\beta\)
Here you can see the adjacency matrix value is calculated by raising the similarity measurement to the defined threshold parameter \(\beta\).
NOTE: The WGCNA package has a built in function called
adjacency()
that will generate the pairwise similarity measurements and the adjacency matrix but before you call the function you need to identify what value you want to use for the threshold parameter.
The choice of the threshold parameter determines the sensitivity and specificity of the pairwise connection strengths (adjacency).
To determine the \(\beta\) parameter in a weighted network analysis we try to maximize a model fit (\(R^2\) value under a linear regression analysis) under a scale free topology model, while minimizing the number of connections lost when fitting the model (maintaining a high mean number of connections). As \(R^2\) values approach 1, we usually see networks with very few connections. Usually this happy medium occurs at \(R^2\) of > .8.
The scale-free topology is used because it is based on the idea that the probability that a node (gene) is connected with k other nodes (genes) decays as a power law:
\(p(k) \sim k^{-\gamma}\)
NOTE: Here is a great publication for more information about scale free topology model assumptions.
Again the WGCNA package has a built in function that helps to determine what the appropriate \(\beta\) parameter will be to calculate our adjacency matrix. The pickSoftThreshold()
function calculates multiple networks all based on different \(\beta\) values and returns a data frame with the \(R^2\) values for the networks scale-free topology model fit as well as the mean connectivity measures.
spt <- pickSoftThreshold(expression.data)
## Warning: executing %dopar% sequentially: no parallel backend registered
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.0278 0.345 0.456 747.00 762.0000 1210.0
## 2 2 0.1260 -0.597 0.843 254.00 251.0000 574.0
## 3 3 0.3400 -1.030 0.972 111.00 102.0000 324.0
## 4 4 0.5060 -1.420 0.973 56.50 47.2000 202.0
## 5 5 0.6810 -1.720 0.940 32.20 25.1000 134.0
## 6 6 0.9020 -1.500 0.962 19.90 14.5000 94.8
## 7 7 0.9210 -1.670 0.917 13.20 8.6800 84.1
## 8 8 0.9040 -1.720 0.876 9.25 5.3900 76.3
## 9 9 0.8590 -1.700 0.836 6.80 3.5600 70.5
## 10 10 0.8330 -1.660 0.831 5.19 2.3800 65.8
## 11 12 0.8530 -1.480 0.911 3.33 1.1500 58.1
## 12 14 0.8760 -1.380 0.949 2.35 0.5740 51.9
## 13 16 0.9070 -1.300 0.970 1.77 0.3090 46.8
## 14 18 0.9120 -1.240 0.973 1.39 0.1670 42.5
## 15 20 0.9310 -1.210 0.977 1.14 0.0951 38.7
spt
## $powerEstimate
## [1] 6
##
## $fitIndices
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k.
## 1 1 0.02776584 0.3445936 0.4558234 746.977852 761.73341294
## 2 2 0.12638769 -0.5966302 0.8432105 254.497776 250.84325945
## 3 3 0.34037012 -1.0299440 0.9724417 111.004761 101.73566693
## 4 4 0.50624635 -1.4215375 0.9726991 56.535969 47.24823353
## 5 5 0.68066112 -1.7161877 0.9395555 32.154311 25.08876080
## 6 6 0.90234867 -1.4974162 0.9621313 19.905790 14.49789064
## 7 7 0.92124780 -1.6673879 0.9171489 13.192338 8.67834321
## 8 8 0.90354653 -1.7240456 0.8759911 9.248524 5.39430786
## 9 9 0.85856532 -1.7025327 0.8364939 6.795315 3.55559547
## 10 10 0.83254728 -1.6632029 0.8314529 5.193521 2.38181026
## 11 12 0.85343233 -1.4816042 0.9114375 3.332680 1.14718679
## 12 14 0.87592425 -1.3827267 0.9493085 2.348905 0.57447184
## 13 16 0.90676370 -1.2955569 0.9697400 1.767929 0.30878389
## 14 18 0.91230435 -1.2435667 0.9725703 1.393515 0.16712574
## 15 20 0.93127802 -1.2079138 0.9773708 1.135087 0.09513045
## max.k.
## 1 1206.42886
## 2 573.54490
## 3 324.15389
## 4 202.30723
## 5 134.17527
## 6 94.77388
## 7 84.10328
## 8 76.31468
## 9 70.53172
## 10 65.75199
## 11 58.06119
## 12 51.93683
## 13 46.83330
## 14 42.47241
## 15 38.68884
You can then plot this data frame to better visualize what \(\beta\) value you should choose.
REMINDER: We should be maximizing the \(R^2\) value and minimizing mean connectivity.
Plot the \(R^2\) values as a function of the soft thresholds
par(mar=c(1,1,1,1))
plot(spt$fitIndices[,1],spt$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(spt$fitIndices[,1],spt$fitIndices[,2],col="red")
abline(h=0.80,col="red")
Plot mean connectivity as a function of soft thresholds
par(mar=c(1,1,1,1))
plot(spt$fitIndices[,1], spt$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(spt$fitIndices[,1], spt$fitIndices[,5], labels= spt$fitIndices[,1],col="red")
You can determine the soft power threshold should be set to 6 as it is the spt that retains the highest mean connectivity while reaching an \(R^2\) value above 0.80.
NOTE: the higher the value, the stronger the connection strength will be of highly correlated gene expression profiles and the more devalued low correlations will be.
Using the following data set, find the appropriate soft threshold parameter.
f <- curl('https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/bioanth-stats/module-F21-Group1/FemaleLiver-Data/LiverMale3600.csv')
male.liver.data <- read.csv(file = f, stringsAsFactors = FALSE, header = TRUE)
male.expression.data <- male.liver.data[,-c(1:8)]
male.expression.data <- as.data.frame(t(male.expression.data))
names(male.expression.data) <- male.liver.data$substanceBXH
spt <- pickSoftThreshold(male.expression.data)
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.0166 0.291 0.560 711.00 736.0000 1130.0
## 2 2 0.2150 -0.744 0.750 235.00 238.0000 533.0
## 3 3 0.4490 -1.220 0.900 101.00 96.2000 306.0
## 4 4 0.5770 -1.580 0.938 51.30 45.1000 196.0
## 5 5 0.6300 -1.680 0.926 29.10 23.4000 132.0
## 6 6 0.8610 -1.360 0.968 17.90 12.9000 91.8
## 7 7 0.9030 -1.520 0.932 11.90 7.5200 79.6
## 8 8 0.9080 -1.610 0.889 8.28 4.5600 72.0
## 9 9 0.8950 -1.620 0.867 6.07 2.8800 66.2
## 10 10 0.8820 -1.590 0.868 4.63 1.9000 61.3
## 11 12 0.8900 -1.490 0.904 2.96 0.8760 53.5
## 12 14 0.9310 -1.380 0.958 2.09 0.4370 47.3
## 13 16 0.9540 -1.310 0.980 1.57 0.2270 42.1
## 14 18 0.9700 -1.260 0.984 1.24 0.1190 37.7
## 15 20 0.9840 -1.220 0.995 1.01 0.0642 33.9
spt
## $powerEstimate
## [1] 6
##
## $fitIndices
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k.
## 1 1 0.01662141 0.2910424 0.5600083 711.394196 736.27770151
## 2 2 0.21469296 -0.7438635 0.7504513 235.481706 237.64208316
## 3 3 0.44898034 -1.2170444 0.8999491 101.360510 96.22414016
## 4 4 0.57675213 -1.5801735 0.9377900 51.322132 45.08006719
## 5 5 0.62969225 -1.6783150 0.9256004 29.085033 23.37605519
## 6 6 0.86066502 -1.3595362 0.9683174 17.946077 12.87067193
## 7 7 0.90330684 -1.5244404 0.9316391 11.852555 7.51956295
## 8 8 0.90841840 -1.6066163 0.8894454 8.281637 4.56367136
## 9 9 0.89475031 -1.6197059 0.8665099 6.067619 2.87757673
## 10 10 0.88175329 -1.5925927 0.8675172 4.627542 1.89844724
## 11 12 0.89032242 -1.4854185 0.9044572 2.963698 0.87586571
## 12 14 0.93119658 -1.3795330 0.9582245 2.089336 0.43660024
## 13 16 0.95396790 -1.3091814 0.9802766 1.574216 0.22720402
## 14 18 0.96961183 -1.2589717 0.9844259 1.241877 0.11909272
## 15 20 0.98363568 -1.2202300 0.9953312 1.011695 0.06419666
## max.k.
## 1 1133.60311
## 2 532.50264
## 3 306.16208
## 4 195.85679
## 5 131.78839
## 6 91.76490
## 7 79.64545
## 8 72.02569
## 9 66.16730
## 10 61.32762
## 11 53.52075
## 12 47.29833
## 13 42.12377
## 14 37.72071
## 15 33.92194
par(mar=c(1,1,1,1))
plot(spt$fitIndices[,1],spt$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(spt$fitIndices[,1],spt$fitIndices[,2],col="red")
abline(h=0.80,col="red")
par(mar=c(1,1,1,1))
plot(spt$fitIndices[,1], spt$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(spt$fitIndices[,1], spt$fitIndices[,5], labels= spt$fitIndices[,1],col="red")
Now that you have the soft threshold power determined you can call on the adjacency()
function of the WGCNA package.
REMINDER: This function calculates the similarity measurement and transforms the similarity by the adjacency function and generates a weighted network adjacency matrix.
softPower <- 6
adjacency <- adjacency(expression.data, power = softPower)
Once the network is constructed, you can begin to extract some meaningful relationships. You can use hierarchical clustering yet again to cluster the network into modules.
NOTE: A module is a group of gene profiles that are highly correlated, or have a high topological overlap.
In order to utilize the clustering functions in R you must transform the adjacency matrix into measures of gene dissimilarity (distance of a gene from every other gene in the system).
NOTE: This is due to the fact that dissimilarity is used in traditional cluster analyses.
The TOM-based dissimilarity is preferentially used over dissimilarity based on correlation coefficients because TOM-based dissimilarity generates more distinct modules.
NOTE: Here is a great publication for information about why topological overlap is used.
To convert the adjacency matrix into a TOM similarity matrix we can call the WGCNA function TOMsimilarity()
.
TOM <- TOMsimilarity(adjacency)
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
The current matrix gives us the similarity between genes. How do you convert this into measures of dissimilarity?
To convert this matrix into a dissimilarity matrix you can subtract the TOM object from 1.
TOM.dissimilarity <- 1-TOM
The dissimilarity/distance measures are then clustered using linkage hierarchical clustering and a dendrogram (cluster tree) of genes is constructed.
#creating the dendrogram
geneTree <- hclust(as.dist(TOM.dissimilarity), method = "average")
#plotting the dendrogram
sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04)
To identify modules from this gene dendrogram, you can use the cutreeDynamic()
function. This will allow you to set a minimum cluster size. For genomic data like this it is more beneficial to set minimum module sizes relatively high as you are working with high loads of data. The authors of WGCNA recommend to start at a minClusterSize = 30
.
Modules <- cutreeDynamic(dendro = geneTree, distM = TOM.dissimilarity, deepSplit = 2, pamRespectsDendro = FALSE, minClusterSize = 30)
## ..cutHeight not given, setting it to 0.996 ===> 99% of the (truncated) height range in dendro.
## ..done.
table(Modules) #returns a table of the counts of factor levels in an object. In this case how many genes are assigned to each created module.
## Modules
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 88 614 316 311 257 235 225 212 158 153 121 106 102 100 94 91 78 76 65 58
## 20 21 22
## 58 48 34
Here you can see 22 modules were created. The Label 0 Module is reserved for unassigned genes (genes that do not fit in any module).
You can now plot the module assignment under the gene dendrogram for visualization.
ModuleColors <- labels2colors(Modules) #assigns each module number a color
table(ModuleColors) #returns the counts for each color (aka the number of genes within each module)
## ModuleColors
## black blue brown cyan darkgreen darkred
## 212 316 311 94 34 48
## green greenyellow grey grey60 lightcyan lightgreen
## 235 106 88 76 78 65
## lightyellow magenta midnightblue pink purple red
## 58 153 91 158 121 225
## royalblue salmon tan turquoise yellow
## 58 100 102 614 257
#plots the gene dendrogram with the module colors
plotDendroAndColors(geneTree, ModuleColors,"Module",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
A ME (Module Eigengene) is the standardized gene expression profile for a given module.
To identify the Module Eigengene you can call on the expression data into the moduleEigengenes()
function.
MElist <- moduleEigengenes(expression.data, colors = ModuleColors)
MEs <- MElist$eigengenes
head(MEs)
## MEblack MEblue MEbrown MEcyan MEdarkgreen
## F2_2 0.013902476 0.0410177922 0.007072125 0.12978459 0.006276361
## F2_3 0.066675342 -0.0009540238 0.072447744 -0.07777835 0.010326534
## F2_14 0.066711912 -0.0841292811 0.062700422 -0.19072152 0.003707524
## F2_15 -0.064480250 0.0909333146 0.050275810 0.04077621 -0.019067137
## F2_19 0.063634038 -0.0709378322 0.016600588 -0.04036901 0.017796637
## F2_20 -0.001201217 0.0653004166 0.049766750 0.10391289 -0.040252274
## MEdarkred MEgreen MEgreenyellow MEgrey MEgrey60
## F2_2 0.006971934 -0.13278003 0.04138109 -0.0055751098 0.02466696
## F2_3 -0.016017527 -0.03032564 -0.02369461 0.0134091891 0.01111424
## F2_14 -0.041321626 0.08744352 -0.20480126 -0.0138913444 -0.07769904
## F2_15 -0.014390509 -0.03874689 -0.03421073 -0.0199902439 0.04570456
## F2_19 -0.023401174 0.09939074 -0.04141718 0.0008595106 -0.01838634
## F2_20 0.113170728 0.02934683 -0.02197066 -0.0408911933 0.11407210
## MElightcyan MElightgreen MElightyellow MEmagenta MEmidnightblue
## F2_2 -0.040861194 -0.031003554 -0.011283565 -0.06241287 -0.08179216
## F2_3 -0.032374749 0.004675618 -0.012048700 0.01915760 0.09657692
## F2_14 -0.088374613 0.656010381 0.466954776 0.04810497 -0.11483834
## F2_15 -0.005731497 -0.032201993 -0.028306082 0.04155203 0.06290445
## F2_19 -0.021386305 0.004599456 -0.002610937 0.15373516 -0.19758250
## F2_20 0.080658254 -0.019607656 0.082282807 0.01633041 0.03713012
## MEpink MEpurple MEred MEroyalblue MEsalmon MEtan
## F2_2 -0.085323887 -0.066835723 -0.11834913 -0.01614832 0.010275223 -0.1345626
## F2_3 -0.003466306 0.148067941 -0.05146571 0.09714258 -0.013670972 0.0925069
## F2_14 -0.116213720 -0.106130935 -0.04774316 0.12067477 0.028568172 -0.1013999
## F2_15 -0.028853363 0.128292790 -0.01118307 0.03893785 -0.022640042 0.1156341
## F2_19 -0.030412088 0.025959589 0.06600651 0.02163683 0.003104716 -0.0461558
## F2_20 -0.091590726 0.007945583 0.02456608 -0.02039788 0.033463526 0.0206224
## MEturquoise MEyellow
## F2_2 0.020351560 -0.064924328
## F2_3 0.037611494 -0.054134964
## F2_14 0.192752495 -0.017465028
## F2_15 -0.022177225 0.046759236
## F2_19 0.062446319 -0.019378632
## F2_20 0.008454122 0.003720271
You have now identified the eigengenes for the data.
To further condense the clusters (branches) into more meaningful modules you can cluster modules based on pairwise eigengene correlations and merge the modules that have similar expression profiles.
REMINDER: An eigengene is the gene whose expression is representative of the the majority of genes expressed within a module.
In order to do cluster analysis you must first find a measurement of dissimilarity (distance) between module eigengenes.
However, because there are missing values present in the input variable, you will need to add an argument use= "complete"
. This command removes rows of the matrix which have NA values. Removing these NAs allows ME.dissimilarity to run. This may or may not be necessary depending on your data set.
ME.dissimilarity = 1-cor(MElist$eigengenes, use="complete") #Calculate eigengene dissimilarity
Now, using the new found measurements of dissimilarity, you can construct a cluster tree. You will also be adding a line at the height of .25. This height corresponds to a correlation of over 75%. Any branches below this line are more than 75% related, and you will thus be merging them!
METree = hclust(as.dist(ME.dissimilarity), method = "average") #Clustering eigengenes
par(mar = c(0,4,2,0)) #seting margin sizes
par(cex = 0.6);#scaling the graphic
plot(METree)
abline(h=.25, col = "red") #a height of .25 corresponds to correlation of .75
This figure shows all of the modules which are more than 75% similar. For example you can see that MEcyan and MEpurple are more than 75% similar. Now you can merge these two modules, and others like them using the mergeCloseModules()
command
merge <- mergeCloseModules(expression.data, ModuleColors, cutHeight = .25)
## mergeCloseModules: Merging modules whose distance is less than 0.25
## Calculating new MEs...
# The merged module colors, assigning one color to each module
mergedColors = merge$colors
# Eigengenes of the new merged modules
mergedMEs = merge$newMEs
The similar modules are now merged! Let’s compare them with the original modules by creating another dendrogram
Using the knowledge you gained from the Hierarchical Clustering Analysis section, create a dendrogram which shows both the original AND merged module colors.
plotDendroAndColors(geneTree, cbind(ModuleColors, mergedColors),
c("Original Module", "Merged Module"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors for original and merged modules")
Coooollioo! you can see that generally, there are less colors in the merged modules row, showing that our modules which were more than 75% similar have merged. For example, the lime green and red section merged into one large lime green section. This new merged output is cleaner and further identifies groups of highly correlated genes which co-occur across samples.
Once you have constructed the network (the adjacency matrix) and divided it into meaningful modules, you can begin to relate the network to external traits.
Start by reading in Trait Data Table
t <- curl('https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/bioanth-stats/module-F21-Group1/FemaleLiver-Data/ClinicalTraits.csv')
traitData <- read.csv(t, header = TRUE, stringsAsFactors = FALSE)
head(traitData)
## X Mice Number Mouse_ID Strain sex DOB parents Western_Diet
## 1 1 F2_290 290 306-4 BxH ApoE-/-, F2 2 2002-03-22 229232 2002-05-14
## 2 2 F2_291 291 307-1 BxH ApoE-/-, F2 2 2002-03-22 232 2002-05-14
## 3 3 F2_292 292 307-2 BxH ApoE-/-, F2 1 2002-03-22 232 2002-05-14
## 4 4 F2_293 293 307-3 BxH ApoE-/-, F2 1 2002-03-22 232 2002-05-14
## 5 5 F2_294 294 307-4 BxH ApoE-/-, F2 1 2002-03-22 232 2002-05-14
## 6 6 F2_295 295 308-1 BxH ApoE-/-, F2 1 2002-03-22 232 2002-05-14
## Sac_Date weight_g length_cm ab_fat other_fat total_fat comments
## 1 2002-09-11 36.9 9.9 2.53 2.26 4.79 <NA>
## 2 2002-09-11 48.5 10.7 2.90 2.97 5.87 <NA>
## 3 2002-09-11 45.7 10.4 1.04 2.31 3.35 <NA>
## 4 2002-09-11 50.3 10.9 0.91 1.89 2.80 <NA>
## 5 2002-09-11 44.8 9.8 1.22 2.47 3.69 <NA>
## 6 2002-09-11 39.2 10.2 3.06 2.49 5.55 <NA>
## X100xfat_weight Trigly Total_Chol HDL_Chol UC FFA Glucose LDL_plus_VLDL
## 1 12.981030 53 1167 50 484 121 437 1117
## 2 12.103093 61 1230 32 592 173 572 1198
## 3 7.330416 41 1285 81 460 96 497 1204
## 4 5.566600 271 1299 64 476 122 553 1235
## 5 8.236607 114 1410 50 516 118 535 1360
## 6 14.158163 72 1533 18 620 106 382 1515
## MCP_1_phys Insulin_ug_l Glucose_Insulin Leptin_pg_ml Adiponectin
## 1 175.850 924 0.47294372 245462.00 11.274
## 2 92.430 5781 0.09894482 84420.88 7.099
## 3 196.398 2074 0.23963356 105889.76 5.795
## 4 97.466 11874 0.04657234 100398.68 5.495
## 5 95.452 9181 0.05827252 130846.30 6.868
## 6 144.270 485 0.78762887 75166.22 17.328
## Aortic.lesions Note Aneurysm Aortic_cal_M Aortic_cal_L CoronaryArtery_Cal
## 1 496250 NA 16 0 17 0
## 2 NA NA 16 4 0 2
## 3 218500 NA 0 0 11 0
## 4 61250 NA 0 0 0 0
## 5 243750 NA 12 10 0 0
## 6 104250 NA 17 2 0 0
## Myocardial_cal BMD_all_limbs BMD_femurs_only
## 1 0 NA NA
## 2 4 0.0548 0.07730
## 3 0 0.0554 0.08065
## 4 236 0.0597 0.08680
## 5 0 NA NA
## 6 0 0.0557 0.07700
You can clean the data by removing unnecessary columns and pulling out the continuous traits.
allTraits <- traitData[, -c(31, 16)] #removing notes and comments sections
allTraits <- allTraits[, c(2, 11:36) ] #pulling out only continuous traits
Finally, you must match the trait data to the expression data by the sample number
Samples <- rownames(expression.data)
traitRows <- match(Samples, allTraits$Mice)
datTraits <- allTraits[traitRows, -1]
rownames(datTraits) <- allTraits[traitRows, 1]
First, you will quantify the association between the expression profile and a particular trait of interest by calculating the correlation of the trait with previously identified module eigengenes. This pairwise correlation is known as the eigengene gene significance
# Define numbers of genes and samples
nGenes = ncol(expression.data)
nSamples = nrow(expression.data)
module.trait.correlation = cor(mergedMEs, datTraits, use = "p") #p for pearson correlation coefficient
module.trait.Pvalue = corPvalueStudent(module.trait.correlation, nSamples) #calculate the p-value associated with the correlation
Once you have the gene signficance and the corresponding p-value for all the modules and traits, you can create a graphical representation (module-trait heatmap) that will be helpful to neatly visualize the module-trait relationships.
# Will display correlations and their p-values
textMatrix = paste(signif(module.trait.correlation, 2), "\n(",
signif(module.trait.Pvalue, 1), ")", sep = "");
dim(textMatrix) = dim(module.trait.correlation)
par(mar = c(6, 8.5, 3, 1))
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = module.trait.correlation,
xLabels = names(datTraits),
yLabels = names(mergedMEs),
ySymbols = names(mergedMEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.4,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
Each row corresponds to a module eigengene, and the columns correspond to a trait. Each cell contains a p-value and correlation. Those with strong positive correlations are shaded a darker red while those with stronger negative correlations become more blue.
You can use the gene significance along with the genes intramodular connectivity to identify potential target genes associated with a particular trait of interest. For this analysis weight will be the clinical trait.
Connectivity - how connected a speficic node is in the network (how many nodes have high correlation with that node). High connectivity indicates a hub gene (central to many nodes). Whole Network connectivity - a measure for how well the node is connected throughout the entire system Intramodular connectivity - a measure for how well the node is connected within its assigned module. Also an indicator for how well that node belongs to its module. This is also known as module membership.
The module membership/intramodular connectivity is calculated as the correlation of the eigengene and the gene expression profile. This quantifies the similarity of all genes on the array to every module.
# Define variable weight containing the weight column of datTrait
weight = as.data.frame(datTraits$weight_g)
names(weight) = "weight"
modNames = substring(names(mergedMEs), 3) #extract module names
#Calculate the module membership and the associated p-values
geneModuleMembership = as.data.frame(cor(expression.data, mergedMEs, use = "p"))
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
names(geneModuleMembership) = paste("MM", modNames, sep="")
names(MMPvalue) = paste("p.MM", modNames, sep="")
#Calculate the gene significance and associated p-values
geneTraitSignificance = as.data.frame(cor(expression.data, weight, use = "p"))
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
names(geneTraitSignificance) = paste("GS.", names(weight), sep="")
names(GSPvalue) = paste("p.GS.", names(weight), sep="")
head(GSPvalue)
## p.GS.weight
## MMT00000044 0.425749831
## MMT00000046 0.387501807
## MMT00000051 0.004593408
## MMT00000076 0.047920465
## MMT00000080 0.013359743
## MMT00000102 0.007297804
Using the gene significance you can identify genes that have a high significance for weight. Using the module membership measures you can identify genes with high module membership in interesting modules.
As an example, you can look at the magenta module as it has the highest significant association with weight (.59).
Plot a scatter plot of gene significance vs. module membership in the magenta module.
par(mar=c(1,1,1,1))
module = "magenta"
column = match(module, modNames)
moduleGenes = mergedColors==module
verboseScatterplot(abs(geneModuleMembership[moduleGenes,column]),
abs(geneTraitSignificance[moduleGenes,1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for body weight",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
The magenta gene significance and module membership have a positive correlation of .49 with a very significant p-value. This indicates that the genes that are highly significantly associated with the trait (high gene significance) are also the genes that are the most connected within their module (high module membership). Therefore genes in the magenta module could be potential target genes when looking at body weight.
Plot the gene significance and the module membership values of another module into a scatter plot. Does the module you selected contain potential candidate genes?
par(mar=c(1,1,1,1))
module = "purple"
column = match(module, modNames)
moduleGenes = mergedColors==module
verboseScatterplot(abs(geneModuleMembership[moduleGenes,column]),
abs(geneTraitSignificance[moduleGenes,1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for body weight",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
It is also possible to study the relationship among found modules. One way to do this is to quantify module similarity (adjacency) by calculating the pairwise correlation of representative eigengenes.
You are able to generate a summary plot of the eigengene correlations using the function plotEigengeneNetworks()
. To visualize how the trait fits into the eigengene network you can bind the trait data to the module eigengenes and argue the data frame to plotEigengeneNetworks()
which will display the relationship on a heatmap.
# Isolate weight from the clinical traits
weight = as.data.frame(datTraits$weight_g);
names(weight) = "weight"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, weight))
# Plot the relationships among the eigengenes and the trait
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(5,4,1,2), cex.lab = 0.8, xLabelsAngle
= 90)
With this heatmap you can identify groups of correlated eigengenes called meta modules. Modules with mutual correlations stronger than their correlation with the specified clinical trait would be grouped into a meta module.
The plotEigengeneNetworks()
function produces a dendrogram of eigengenes and the specified trait, as well as the corresponding eigengene heat map. You can separate the two using the following code:
# Plot the dendrogram
par(cex = 1.0)
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
plotHeatmaps = FALSE)
# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0, mar = c(1,1,1,1))
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(5,5,2,2),
plotDendrograms = FALSE, xLabelsAngle = 90)
Create a dendrogram and heatmap using two clinical traits. For example, weight AND Glucose.
# Isolate weight and Glucose from the clinical traits
weight = as.data.frame(datTraits$weight_g);
names(weight) = "weight"
Glucose = as.data.frame(datTraits$Glucose)
names(Glucose) = "Glucose"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, weight,Glucose))
# Plot the relationships among the eigengenes and the trait
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(5,4,1,2), cex.lab = 0.8, xLabelsAngle
= 90)
Great! As you are able to focus more on how these two traits fit into the eigengene network, rather than be distracted by the other clinical trait data.
R in Action Chapter 16
Langfelder and Horvath 2008