As we learned in class, LD is the non-random assortment of alleles at different loci across the genome. Linked alleles will frequently travel with each other during crossover events during meiosis, and if we know LD is high between two alleles present in a population we can even use the frequency of one allele to predict the frequency of the other allele. Today, we will look at LD within TMPRSS2 in our own populations to get both a sense of the amount of linkage present within the gene, and what that linkage can tell us about our assigned population’s history.
In the Cheng et al. (2015) paper that you read, they looked at several SNPs within TMPRSS2 that were thought to be associated with greater susceptibility to H7N9 influenza infections: including rs2070788, rs383510 and rs4816720. The results of their analyses suggested that both rs2070788 and rs383510 were directly associated with flu severity in the sample under investigation. One of the other two alleles, however - rs4816720 - was found to be inlinkage with rs2070788. We will take a closer look at these three SNPs in our own populations, as well as the SNP in TMPRSS2 called rs12329760, associated with prostate cancer according the Maekawa et al. paper from Module 2.
We will be using two statistics to gauge LD in our populations: D’ (pronounced ‘D prime’) and R2 (pronounced ‘R-squared’). Both of these are useful for determining the amount of linkage between two SNPs, but each statistic tells us something slightly different.
D’ is slightly easier to understand, as it simply is a measure of the predictability of one SNP’s genotype based on the other. When D’ = 1, the two SNPs are in perfect LD (meaning that they will always co-segregate, or be inherited together as a unit), and when D’ = 0 the two SNPs are in Linkage Equilibrium (meaning that co-segregation is random, or around 50%).
R^2 is a little bit different. This statistic will also take into account the frequency of the allele in question. If one SNP genotype is linked to another, but the linked genotype of one SNP is the minor allele (less common than the other genotype), the R2 value will be lower than the D’ value. This does not make the SNPs any less linked, but is rather taking into account the (lower) allele frequency.
Learn about the SNPs rs2070788, rs383510, rs4816720, and rs12329760 and their roles in disease susceptibility.
Learn how to use the R package SNPStats to perform LD analysis in a population, including constructing LD matrices and LD heatmaps.
Learn about the two statistics D’ and R2, which are the most commonly used statistics to evaluate LD between SNPs. Learn what each can tell us about a population, and apply the two statistics to our own populations.
Learn how to use Ensembl to look at long-distance LD between SNPs in TMPRSS2, as well as for SNPs in other genes.
Log in to the SCC On Demand and bring up the R Studio window like we did in the last module:
Now that we’re in R, we can prepare our data for analysis. The functions we will use today require our VCF data to be in a special format called a SNPMatrix, so we’ll convert our data in to a SNPMatrix now. As usual, I will use the population YRI again as an example; change all YRI code to your population code!
First, we’ll nned to install a few packages:
install.packages("devtools")
library(devtools)
install.packages("BiocManager")
library(BiocManager)
devtools::install_github("SFUStatgen/LDheatmap")
BiocManager::install("snpStats")
BiocManager::install("VariantAnnotation")
BiocManager::install("GenomicFeatures")
BiocManager::install("gpart")
And then we’ll load the packages we will be using:
library(tidyverse)
library(snpStats)
library(VariantAnnotation)
library(LDheatmap)
Next, we will load in our data, in this case the VCF file of all the SNPs in TMPRSS2 (for me called TMPRSS2_YRI.vcf) in to our R space first, and then use another function to transform them into SNPMatrices.
We’ll start with the full TMPRSS2 file. Copy this code with the name of your VCF file. The hg19 option is telling the function to align our data with the ‘hg19’ human reference genome, which comes with the package VariantAnnotation:
TMPRSS2vcf <- readVcf("TMPRSS2_YRI.vcf.gz", "hg19")
Next, we’ll convert it to a SNPMatrix. Remember to replace “YRI” with the acronym for your population:
TMPRSS2matrix <- genotypeToSnpMatrix(TMPRSS2vcf)
## Warning in .local(x, ...): variants with >1 ALT allele are set to NA
## non-single nucleotide variations are set to NA
TMPRSS2matrix #repeating the file name will show you what it looks like
## $genotypes
## A SnpMatrix with 108 rows and 2246 columns
## Row names: NA18486 ... NA19257
## Col names: rs575760704 ... rs185299923
##
## $map
## DataFrame with 2246 rows and 4 columns
## snp.names allele.1 allele.2 ignore
## <character> <DNAStringSet> <DNAStringSetList> <logical>
## 1 rs575760704 T C FALSE
## 2 rs544440280 G A FALSE
## 3 rs564273360 A T FALSE
## 4 rs559811756 G A FALSE
## 5 rs553400003 G A FALSE
## ... ... ... ... ...
## 2242 rs193020956 G A FALSE
## 2243 rs149635596 C T FALSE
## 2244 rs371526198 C A FALSE
## 2245 rs713400 C T FALSE
## 2246 rs185299923 T G FALSE
One error message you might notice is “variants with >1 ALT allele are set to NA non-single nucleotide variations are set to NA”. This means that loci with 3 or more alleles are not included in any of these analyses. Because of this, you may notice a lot of what appears to be linkage equilibrium when making your charts, but this is merely due to the functions’ inability to process LD between SNPs with more than 2 alleles.
You may also notice that the output of the YRImatrix gives us some information about our data. Under “$genotypes” you’ll see that our SNPMatrix has a column and row count, and that our columns (“Col”) are comprised of dbSNP or Variant ID names, while our rows (“Row”) are comprised of the individual sample codes from the 1000 Genomes Project. You can use this information to count both how many SNPs and how many individuals are in your database. In my case, I’ve got 108 individuals in the YRI population, typed at 2246 variants.
Repeat this process with your ACE2 dataset:
ACE2vcf <- readVcf("ACE2_YRI.vcf.gz", "hg19")
ACE2matrix <- genotypeToSnpMatrix(ACE2vcf)
## Warning in .local(x, ...): variants with >1 ALT allele are set to NA
## non-single nucleotide variations are set to NA
## Warning in .local(x, ...): non-diploid variants are set to NA
head(ACE2matrix)
## $genotypes
## A SnpMatrix with 108 rows and 333 columns
## Row names: NA18486 ... NA19257
## Col names: rs182366225 ... rs186143966
##
## $map
## DataFrame with 333 rows and 4 columns
## snp.names allele.1 allele.2 ignore
## <character> <DNAStringSet> <DNAStringSetList> <logical>
## 1 rs182366225 C T FALSE
## 2 rs142017934 T C FALSE
## 3 rs188290800 T C FALSE
## 4 rs143079256 C T FALSE
## 5 rs192781231 G C FALSE
## ... ... ... ... ...
## 329 rs185721534 A G FALSE
## 330 rs189691652 C T FALSE
## 331 rs113208650 C T FALSE
## 332 rs182809041 A C FALSE
## 333 rs186143966 C T FALSE
Notice, here, that the ACE2 file has many fewer SNPs: 333 columns rather than 2246. This could mean two things: either the ACE2 gene region is smaller than TMPRSS2, or the ACE2 region simply has less variation. Which do you think is the case, and what might that mean about the evolution of each gene?
As you may have already noticed from the output, as well, the SNPMatrix files we made are actually two separate files under one file name, and only one of them is the actual SNPMatrix. In order to call only the SNPMatrix in our functions, we will modify the name with “$genotypes” after our SNPMatrix name. You will see this in our following analyses.
Next, we will be using a function from the snpStats package to look at LD across the TMPRSS2 gene. To do this, we will use the “ld” function to create an LD matrix, which we will then graph to comprehend better. First, we’ll use the “ld” function, like so:
#For "Stats," use "R.squared" for now. We will explore D' later.
LD <- ld(TMPRSS2matrix$genotypes, depth = 2245, stats = "R.squared")
If you type LD
into your console, you’ll see the large matrix of R^2 LD values you’ve just created, which is probably fairly confusing on first look. An LD analysis output like this can be hard to interpret without a visual aid, so let’s build one! The output of this function (our “LD” object) is what we’ll use as an input to visualize these LD values.
#sets color scale for the graph
cols = colorRampPalette(c("yellow", "red"))(10)
#building the image
image(LD, lwd = 0, cuts= 9, col.regions=cols, colorkey=TRUE)
You can see from the figure here in the module that much of this gene in the YRI population shows very low LD (consulting the scale on the right, that would be the yellow regions), meaning that these regions are segregating randomly. White regions represent those SNPs that were excluded due to having more than 2 alleles. The regions in red are those regions that have very high LD, suggesting non-random segregation. Large, contiguous regions of red (like the one we can see in the YRI population between 1500 and 1700) are what we might call linkage blocks, or larger contiguous areas along the genome with consistently high linkage. What could this mean about these regions of the gene?
Now, we will focus on four SNPs in particular: rs2070788, rs383510, rs4816720, and rs12329760. Now, three of these you read about in the homework as being either SNPs related to H7N9 flu severity or prostate cancer.
What we will be doing now is creating two LD heatmaps, one that displays the R2 LD statistics and one that displays the D’ (pronounced “D prime”) LD statistics. As we discussed in the introduction, these two statistics don’t tell us quite the same things about our data, so comparing the two statistics can be useful in making conclusions about LD in your populations
First, we will need a list of all positions in our VCF files:
grep -v "^##" TMPRSS2_YRI.vcf | cut -f1-3 > TMPRSS2_loci.txt #Get list of variant sites for TMPRSS2
grep -v "^##" ACE2_YRI.vcf | cut -f1-3 > ACE2_loci.txt #Get list of variant sites for ACE2
And import them into R:
positions_TMPRSS2<-read.table("TMPRSS2_loci.txt")
positions_TMPRSS2<-positions_TMPRSS2$V2
positions_ACE2<-read.table("ACE2_loci.txt")
positions_ACE2<-positions_ACE2$V2
Now, here is the code to create the LD heatmap showing the R2 statistic (notice where I’ve entered the names of our SNPs of interest, to note their locations on the heatmap):
R2heatmapTMPRSS2 <- LDheatmap(TMPRSS2matrix$genotypes,
genetic.distances=positions_TMPRSS2,
distances="physical",
LDmeasure="r",
title="Pairwise LD with R^2",
add.map=TRUE, add.key=TRUE,
geneMapLocation=0.15,
SNP.name=c("rs2070788", "rs383510", "rs4816720",
"rs12329760"),
color=NULL, newpage=TRUE,
name="TMPRSS2 LD Heatmap (R^2)")
This looks nice, but it’s still pretty difficult to see LD blocks… there’s an even better visualization tool in the {gpart} package for clearly seeing independent linkage blocks.
library(gpart)
tmprss2_res1 = BigLD(genofile = "TMPRSS2_YRI.vcf",LD="r2")
## CLQ done!
constructLDblock done!
tmprss2_res1
## chr start.index end.index start.rsID end.rsID start.bp
## 1 21 8 32 rs462471 rs11910678 41464549
## 2 21 63 80 rs149695119 rs79971314 41466024
## 3 21 82 256 rs73372161 rs465576 41466352
## 4 21 263 317 rs9305744 rs7364088 41471061
## 5 21 323 494 rs28548447 rs8134216 41472311
## 6 21 512 641 rs9636988 rs2276205 41478118
## 7 21 670 1043 rs9976780 rs2838041 41483308
## 8 21 1054 1131 rs3787950 rs55760462 41494369
## 9 21 1155 1163 rs455045 rs415731 41496961
## 10 21 1165 1199 rs1003030 rs3761373 41497679
## 11 21 1208 1213 rs422761 rs389001 41499066
## 12 21 1265 1285 rs435877 rs73903405 41500970
## 13 21 1333 1513 rs420737 rs8127674 41503477
## 14 21 1517 1747 rs8132655 rs7276907 41510388
## 15 21 1754 1831 rs66953820 rs12482673 41517170
## 16 21 1837 1844 rs7281569 rs7282236 41519634
## 17 21 1850 1861 rs555164860;rs34666868 rs6517673 41519883
## 18 21 1869 2152 rs73231906 rs144815871 41520421
## 19 21 2167 2168 rs884166 rs59484614 41528672
## 20 21 2182 2237 rs4327307 rs713399 41529106
## end.bp
## 1 41465059
## 2 41466342
## 3 41470927
## 4 41472203
## 5 41477331
## 6 41482276
## 7 41493979
## 8 41496279
## 9 41497136
## 10 41498595
## 11 41499204
## 12 41501773
## 13 41510076
## 14 41516865
## 15 41519531
## 16 41519797
## 17 41520129
## 18 41528416
## 19 41528707
## 20 41530956
Here, we can see that the BigLD algorithm has divided the YRI genomic region into 20 distinct linkage blocks, giving us both starting and ending positions for each. Using the row number (1-20) to name them, which linkage block(s) contain our SNPs of interest?
Now, let’s visualize these linkage blocks produced by {gpart}. This code will create a helpful LD heatmap figure in your working directory called heatmap_tmprss2_r2.png
. You’ll need to import it to your desktop using CyberDuck or MobaXTerm to open it. Let’s take a look:
nowcex=0.5
LDblockHeatmap(genofile = "TMPRSS2_YRI.vcf", chrN = 21, geneDB = "ensembl", assembly = "GRCh38", geneid = "hgnc_symbol", blocktype="bigld", LD="r2", type="png", filename = "heatmap_tmprss2_r2", CLQmode = "density", res = 600)
## CLQ done!
constructLDblock done!
This image is a LOT more helpful! It allows us to visualize distinct blocks of LD that are statistically partitioned by the package, and also gives us a range of SNP IDs and positions into which we might map SNPs of interest not explicitly listed.
Next, we’ll make the graph displaying the D’ statistic:
DheatmapTMPRSS2 <- LDheatmap(TMPRSS2matrix$genotypes,
genetic.distances=positions_TMPRSS2,
distances="physical",
LDmeasure="r",
title="Pairwise LD with D'",
add.map=TRUE, add.key=TRUE,
geneMapLocation=0.15,
SNP.name=c("rs2070788", "rs383510", "rs4816720",
"rs12329760"),
color=NULL, newpage=TRUE,
name="TMPRSS2 LD Heatmap (D')")
Again, this mode of visualization isn’t particularly helpful… let’s take a look using {gpart}:
library(gpart)
tmprss2_res1 = BigLD(genofile = "TMPRSS2_YRI.vcf", LD="Dprime")
## CLQ done!
constructLDblock done!
tmprss2_res1
## chr start.index end.index start.rsID end.rsID start.bp
## 1 21 8 32 rs462471 rs11910678 41464549
## 2 21 79 104 rs57474639 rs73905371 41466341
## 3 21 105 108 rs73372163 rs73372166 41467065
## 4 21 109 263 rs113562865 rs9305744 41467117
## 5 21 282 284 rs34624090 rs467375 41471514
## 6 21 307 385 rs458213 rs55964536 41472035
## 7 21 386 494 rs2298661 rs8134216 41473715
## 8 21 512 641 rs9636988 rs2276205 41478118
## 9 21 680 689 rs9977234 rs2410429 41483637
## 10 21 706 710 rs56066678 rs2838039 41484552
## 11 21 743 754 rs4818240 rs11702475 41485375
## 12 21 755 759 rs35871560 rs2257202 41485652
## 13 21 769 923 rs2298857 rs11911394 41486054
## 14 21 930 941 rs928871 rs6517669 41490604
## 15 21 952 1003 rs35899679 rs430915 41491393
## 16 21 1112 1121 rs58146697 rs57901138 41495740
## 17 21 1155 1163 rs455045 rs415731 41496961
## 18 21 1165 1199 rs1003030 rs3761373 41497679
## 19 21 1208 1472 rs422761 rs4303795 41499066
## 20 21 1485 1513 rs12481984 rs8127674 41508967
## 21 21 1517 1733 rs8132655 rs11414078 41510388
## 22 21 1740 1742 rs6517672 rs62219272 41516325
## 23 21 1757 1844 rs76043846 rs7282236 41517211
## 24 21 1850 1861 rs555164860;rs34666868 rs6517673 41519883
## 25 21 1894 1895 rs144274178 rs561517039 41521134
## 26 21 1902 2133 rs73357702 rs73359717 41521240
## 27 21 2167 2168 rs884166 rs59484614 41528672
## 28 21 2182 2233 rs4327307 rs4334565 41529106
## 29 21 2237 2238 rs713399 rs76728412 41530956
## end.bp
## 1 41465059
## 2 41467060
## 3 41467111
## 4 41471061
## 5 41471638
## 6 41473711
## 7 41477331
## 8 41482276
## 9 41484121
## 10 41484617
## 11 41485641
## 12 41485788
## 13 41490348
## 14 41491009
## 15 41492816
## 16 41495915
## 17 41497136
## 18 41498595
## 19 41508558
## 20 41510076
## 21 41516040
## 22 41516404
## 23 41519797
## 24 41520129
## 25 41521135
## 26 41527844
## 27 41528707
## 28 41530852
## 29 41530957
Here, we can see that, using the D’ measure, the BigLD algorithm has divided the YRI genomic region into 29 distinct linkage blocks, giving us both starting and ending positions for each. Using the row number (1-29) to name them, which linkage block(s) contain our SNPs of interest? Are any of our SNPs now in different linkage blocks than before?
Now, let’s visualize these linkage blocks using {gpart}, which will divide them visually into strong and weak LD (this means that blocks showing only weak LD will not be explicitly visualized on the graph). This will create a file called heatmap_tmprss2_Dp.png
. Let’s take a look now:
nowcex=0.5
LDblockHeatmap(genofile = "TMPRSS2_YRI.vcf", chrN = 21, geneDB = "ensembl", assembly = "GRCh38", geneid = "hgnc_symbol", blocktype="bigld", LD="Dp-str", type="png", filename = "heatmap_tmprss2_Dp", CLQmode = "density", res = 600)
## CLQ done!
constructLDblock done!
Once again, this is MUCH more helpful!
Now you’ve got several graphs that you can use to investigate LD in your gene region! You can see specifically whether or not the SNPs of interest (the labelled SNPs) are in LD by finding where on the graph their paths intersect, or by looking for their linkage block by position in the LD block analysis output. For the purpose of reporting your pairwise LD results, we can look at something called an LD matrix that was generated by the LDheatmap function. This is a matrix where the row and column names are SNP ID numbers, and the cell at the intersection of a row and a column will tell you the LD statistic for those two particular SNPs. For our purposes, you do not need to look at the whole chart, you just need to find the intersections of our four SNPs of interest to see if any of them are in LD, and what that LD value is for each respective statistical test.
To view your reduced LD matrices, run these two chunks of code separately:
#for R squared LD heatmap:
R2LD<-as.data.frame(R2heatmapTMPRSS2$LDmatrix)
snp<-rownames(R2LD)
R2LD<-
R2LD %>%
mutate(snp = snp) %>%
dplyr::select(snp,rs2070788,rs383510,rs4816720,rs12329760) %>%
filter(snp == "rs2070788" |
snp == "rs383510" |
snp == "rs4816720" |
snp == "rs12329760")
R2LD
## snp rs2070788 rs383510 rs4816720 rs12329760
## 1 rs2070788 NA 0.40666382 0.25668813 0.1196071
## 2 rs12329760 NA 0.08339459 0.03268553 NA
## 3 rs4816720 NA 0.48865306 NA NA
## 4 rs383510 NA NA NA NA
#for D prime LD heatmap:
DpLD<-as.data.frame(DheatmapTMPRSS2$LDmatrix)
snp<-rownames(DpLD)
DpLD<-
DpLD %>%
mutate(snp = snp) %>%
dplyr::select(snp,rs2070788,rs383510,rs4816720,rs12329760) %>%
filter(snp == "rs2070788" |
snp == "rs383510" |
snp == "rs4816720" |
snp == "rs12329760")
DpLD
## snp rs2070788 rs383510 rs4816720 rs12329760
## 1 rs2070788 NA 0.40666382 0.25668813 0.1196071
## 2 rs12329760 NA 0.08339459 0.03268553 NA
## 3 rs4816720 NA 0.48865306 NA NA
## 4 rs383510 NA NA NA NA
Does LD appear to correlate with physical distance among these SNPs, according to your LDheatmap images?
The last thing we’ll do in this lab is look at Ensembl’s information on LD for each of the SNPs we studied today in your populations.
To do this:
Go to Ensembl and search for TMPRSS2 in the Human search bar as we have done before
Use the left sidebar to navigate to the Variant Table:
Some populations will have more SNPs in high LD than others. Look specifically at the SNPs in high LD that are not in TMPRSS2. In the example above, there are SNPs listed in the gene BMX in high linkage with several variants in ACE2.
Repeat this process for all four SNPs we looked at in today’s lab. Where are the high LD SNPs that aren’t in TMPRSS2 located? Are any of them located in other gene regions?
Think about the results you produced today in the context of your population. Here are some guiding questions to help you:
What does it mean for two SNPs in your population to be in LD? How would two obesity SNPs being in LD reflect selection in that population?
Would you expect a high level of LD to develop in TMPRSS2 in your population based on its pandemic context? Why or why not?
What would LD between two COVID-19-related SNPs tell you about the evolutionary history of your population?
If you found any SNPs in high LD with our TMPRSS2 SNPs during the Ensembl exercise that are in a different gene, do a quick OMIM search on the gene(s). What does that gene code for? How might it be connected to TMPRSS2?