Lab 3 - Linkage Disequilibrium (LD) in TMPRSS2

Introduction to Linkage Disequilibrium (LD)

Homework for Lab 3: DUE Friday, October 23rd

Readings:

Wooster L, Nicholson CJ, Sigurslid HH, Lino Cardenas CL, Malhotra R. Preprint accessed 31AUG20. Polymorphisms in the ACE2 locus associate with severity of COVID-19 infection. medRxiv doi:10.1101/2020.06.18.20135152

Cheng Z., et al. 2015. Identification of TMPRSS2 as a susceptibility gene for severe 2009 pandemic A(H1N1) influenza and A(H7N9) influenza. The Journal of Infectious Diseases, 212(8):1214–1221.

Slatkin M. 2008. Linkage disequilibrium – understanding the evolutionary past and mapping the medical future. Nat Rev Genet 9: 477-485.

Zeberg H, Pääbo S. 2020. The major genetic risk factor for severe COVID-19 is inherited from Neanderthals. Nature 2998.

Claiborne Stephens J, Schneider JA, Tanguay DA, Choi J, Acharya Y, Stanley SE, Jiang R, et al. 2001. Haplotype variation and linkage disequilibrium in 313 human genes. Science 293: 489-93.

As we learned in class, LD is the non-random assortment of alleles at different loci across the genome. Linked alleles will frequently travel with each other during crossover events during meiosis, and if we know LD is high between two alleles present in a population we can even use the frequency of one allele to predict the frequency of the other allele. Today, we will look at LD within TMPRSS2 in our own populations to get both a sense of the amount of linkage present within the gene, and what that linkage can tell us about our assigned population’s history.

Our Focal SNPs

In the Cheng et al. (2015) paper that you read, they looked at several SNPs within TMPRSS2 that were thought to be associated with greater susceptibility to H7N9 influenza infections: including rs2070788, rs383510 and rs4816720. The results of their analyses suggested that both rs2070788 and rs383510 were directly associated with flu severity in the sample under investigation. One of the other two alleles, however - rs4816720 - was found to be inlinkage with rs2070788. We will take a closer look at these three SNPs in our own populations, as well as the SNP in TMPRSS2 called rs12329760, associated with prostate cancer according the Maekawa et al. paper from Module 2.

A Note On D’ and R²

We will be using two statistics to gauge LD in our populations: D’ (pronounced ‘D prime’) and R² (pronounced ‘R-squared’). Both of these are useful for determining the amount of linkage between two SNPs, but each statistic tells us something slightly different.

D’ is slightly easier to understand, as it simply is a measure of the predictability of one SNP’s genotype based on the other. When D’ = 1, the two SNPs are in perfect LD (meaning that they will always co-segregate, or be inherited together as a unit), and when D’ = 0 the two SNPs are in Linkage Equilibrium (meaning that co-segregation is random, or around 50%).
R^2 is a little bit different. This statistic will also take into account the frequency of the allele in question. If one SNP genotype is linked to another, but the linked genotype of one SNP is the minor allele (less common than the other genotype), the R² value will be lower than the D’ value. This does not make the SNPs any less linked, but is rather taking into account the (lower) allele frequency.

Learning Outcomes

Learn about the SNPs rs2070788, rs383510, rs4816720, and rs12329760 and their roles in disease susceptibility.
Learn how to use the R package SNPStats to perform LD analysis in a population, including constructing LD matrices and LD heatmaps.
Learn about the two statistics D’ and R², which are the most commonly used statistics to evaluate LD between SNPs. Learn what each can tell us about a population, and apply the two statistics to our own populations.
Learn how to use Ensembl to look at long-distance LD between SNPs in TMPRSS2, as well as for SNPs in other genes.

Step 1: Getting to R Studio and Preparing your Data

Now that we’re in R, we can prepare our data for analysis. The functions we will use today require our VCF data to be in a special format called a SNPMatrix, so we’ll convert our data in to a SNPMatrix now. As usual, I will use the population YRI again as an example; change all YRI code to your population code!

First, we’ll nned to install a few packages:

install.packages("devtools")
library(devtools)
install.packages("BiocManager")
library(BiocManager)
devtools::install_github("SFUStatgen/LDheatmap")
BiocManager::install("snpStats")
BiocManager::install("VariantAnnotation")
BiocManager::install("GenomicFeatures")
BiocManager::install("gpart")

And then we’ll load the packages we will be using:

library(tidyverse)
library(snpStats)
library(VariantAnnotation)
library(LDheatmap)

Next, we will load in our data, in this case the VCF file of all the SNPs in TMPRSS2 (for me called TMPRSS2_YRI.vcf) in to our R space first, and then use another function to transform them into SNPMatrices.

We’ll start with the full TMPRSS2 file. Copy this code with the name of your VCF file. The hg19 option is telling the function to align our data with the ‘hg19’ human reference genome, which comes with the package VariantAnnotation:

TMPRSS2vcf <- readVcf("TMPRSS2_YRI.vcf.gz", "hg19")

Next, we’ll convert it to a SNPMatrix. Remember to replace “YRI” with the acronym for your population:

TMPRSS2matrix <- genotypeToSnpMatrix(TMPRSS2vcf)

## Warning in .local(x, ...): variants with >1 ALT allele are set to NA

## non-single nucleotide variations are set to NA

TMPRSS2matrix #repeating the file name will show you what it looks like

## $genotypes
## A SnpMatrix with  108 rows and  2246 columns
## Row names:  NA18486 ... NA19257 
## Col names:  rs575760704 ... rs185299923 
## 
## $map
## DataFrame with 2246 rows and 4 columns
##        snp.names       allele.1           allele.2    ignore
##      <character> <DNAStringSet> <DNAStringSetList> <logical>
## 1    rs575760704              T                  C     FALSE
## 2    rs544440280              G                  A     FALSE
## 3    rs564273360              A                  T     FALSE
## 4    rs559811756              G                  A     FALSE
## 5    rs553400003              G                  A     FALSE
## ...          ...            ...                ...       ...
## 2242 rs193020956              G                  A     FALSE
## 2243 rs149635596              C                  T     FALSE
## 2244 rs371526198              C                  A     FALSE
## 2245    rs713400              C                  T     FALSE
## 2246 rs185299923              T                  G     FALSE

One error message you might notice is “variants with >1 ALT allele are set to NA non-single nucleotide variations are set to NA”. This means that loci with 3 or more alleles are not included in any of these analyses. Because of this, you may notice a lot of what appears to be linkage equilibrium when making your charts, but this is merely due to the functions’ inability to process LD between SNPs with more than 2 alleles.

You may also notice that the output of the YRImatrix gives us some information about our data. Under “$genotypes” you’ll see that our SNPMatrix has a column and row count, and that our columns (“Col”) are comprised of dbSNP or Variant ID names, while our rows (“Row”) are comprised of the individual sample codes from the 1000 Genomes Project. You can use this information to count both how many SNPs and how many individuals are in your database. In my case, I’ve got 108 individuals in the YRI population, typed at 2246 variants.

Repeat this process with your ACE2 dataset:

ACE2vcf <- readVcf("ACE2_YRI.vcf.gz", "hg19")

ACE2matrix <- genotypeToSnpMatrix(ACE2vcf)

## Warning in .local(x, ...): variants with >1 ALT allele are set to NA

## non-single nucleotide variations are set to NA

## Warning in .local(x, ...): non-diploid variants are set to NA

head(ACE2matrix)

## $genotypes
## A SnpMatrix with  108 rows and  333 columns
## Row names:  NA18486 ... NA19257 
## Col names:  rs182366225 ... rs186143966 
## 
## $map
## DataFrame with 333 rows and 4 columns
##       snp.names       allele.1           allele.2    ignore
##     <character> <DNAStringSet> <DNAStringSetList> <logical>
## 1   rs182366225              C                  T     FALSE
## 2   rs142017934              T                  C     FALSE
## 3   rs188290800              T                  C     FALSE
## 4   rs143079256              C                  T     FALSE
## 5   rs192781231              G                  C     FALSE
## ...         ...            ...                ...       ...
## 329 rs185721534              A                  G     FALSE
## 330 rs189691652              C                  T     FALSE
## 331 rs113208650              C                  T     FALSE
## 332 rs182809041              A                  C     FALSE
## 333 rs186143966              C                  T     FALSE

Notice, here, that the ACE2 file has many fewer SNPs: 333 columns rather than 2246. This could mean two things: either the ACE2 gene region is smaller than TMPRSS2, or the ACE2 region simply has less variation. Which do you think is the case, and what might that mean about the evolution of each gene?

As you may have already noticed from the output, as well, the SNPMatrix files we made are actually two separate files under one file name, and only one of them is the actual SNPMatrix. In order to call only the SNPMatrix in our functions, we will modify the name with “$genotypes” after our SNPMatrix name. You will see this in our following analyses.

Step 2: Using the “ld” Function

Next, we will be using a function from the snpStats package to look at LD across the TMPRSS2 gene. To do this, we will use the “ld” function to create an LD matrix, which we will then graph to comprehend better. First, we’ll use the “ld” function, like so:

#For "Stats," use "R.squared" for now. We will explore D' later.
LD <- ld(TMPRSS2matrix$genotypes, depth = 2245, stats = "R.squared")

If you type LD into your console, you’ll see the large matrix of R^2 LD values you’ve just created, which is probably fairly confusing on first look. An LD analysis output like this can be hard to interpret without a visual aid, so let’s build one! The output of this function (our “LD” object) is what we’ll use as an input to visualize these LD values.

#sets color scale for the graph
cols = colorRampPalette(c("yellow", "red"))(10)
#building the image 
image(LD, lwd = 0, cuts= 9, col.regions=cols, colorkey=TRUE)

You can see from the figure here in the module that much of this gene in the YRI population shows very low LD (consulting the scale on the right, that would be the yellow regions), meaning that these regions are segregating randomly. White regions represent those SNPs that were excluded due to having more than 2 alleles. The regions in red are those regions that have very high LD, suggesting non-random segregation. Large, contiguous regions of red (like the one we can see in the YRI population between 1500 and 1700) are what we might call linkage blocks, or larger contiguous areas along the genome with consistently high linkage. What could this mean about these regions of the gene?

Step 3: Measuring LD in our SNPs of Interest

Now, we will focus on four SNPs in particular: rs2070788, rs383510, rs4816720, and rs12329760. Now, three of these you read about in the homework as being either SNPs related to H7N9 flu severity or prostate cancer.

What we will be doing now is creating two LD heatmaps, one that displays the R² LD statistics and one that displays the D’ (pronounced “D prime”) LD statistics. As we discussed in the introduction, these two statistics don’t tell us quite the same things about our data, so comparing the two statistics can be useful in making conclusions about LD in your populations

First, we will need a list of all positions in our VCF files:

grep -v "^##" TMPRSS2_YRI.vcf | cut -f1-3 > TMPRSS2_loci.txt #Get list of variant sites for TMPRSS2
grep -v "^##" ACE2_YRI.vcf | cut -f1-3 > ACE2_loci.txt #Get list of variant sites for ACE2

And import them into R:

positions_TMPRSS2<-read.table("TMPRSS2_loci.txt")
positions_TMPRSS2<-positions_TMPRSS2$V2
positions_ACE2<-read.table("ACE2_loci.txt")
positions_ACE2<-positions_ACE2$V2

Now, here is the code to create the LD heatmap showing the R² statistic (notice where I’ve entered the names of our SNPs of interest, to note their locations on the heatmap):

R2heatmapTMPRSS2 <- LDheatmap(TMPRSS2matrix$genotypes,
                           genetic.distances=positions_TMPRSS2,
                           distances="physical",
                           LDmeasure="r",
                           title="Pairwise LD with R^2",
                           add.map=TRUE, add.key=TRUE,
                           geneMapLocation=0.15,
                           SNP.name=c("rs2070788", "rs383510", "rs4816720",
                                      "rs12329760"),
                           color=NULL, newpage=TRUE,
                           name="TMPRSS2 LD Heatmap (R^2)")

This looks nice, but it’s still pretty difficult to see LD blocks… there’s an even better visualization tool in the {gpart} package for clearly seeing independent linkage blocks.

library(gpart)

tmprss2_res1 = BigLD(genofile = "TMPRSS2_YRI.vcf",LD="r2")

## CLQ done!
constructLDblock done!

tmprss2_res1

##    chr start.index end.index             start.rsID    end.rsID start.bp
## 1   21           8        32               rs462471  rs11910678 41464549
## 2   21          63        80            rs149695119  rs79971314 41466024
## 3   21          82       256             rs73372161    rs465576 41466352
## 4   21         263       317              rs9305744   rs7364088 41471061
## 5   21         323       494             rs28548447   rs8134216 41472311
## 6   21         512       641              rs9636988   rs2276205 41478118
## 7   21         670      1043              rs9976780   rs2838041 41483308
## 8   21        1054      1131              rs3787950  rs55760462 41494369
## 9   21        1155      1163               rs455045    rs415731 41496961
## 10  21        1165      1199              rs1003030   rs3761373 41497679
## 11  21        1208      1213               rs422761    rs389001 41499066
## 12  21        1265      1285               rs435877  rs73903405 41500970
## 13  21        1333      1513               rs420737   rs8127674 41503477
## 14  21        1517      1747              rs8132655   rs7276907 41510388
## 15  21        1754      1831             rs66953820  rs12482673 41517170
## 16  21        1837      1844              rs7281569   rs7282236 41519634
## 17  21        1850      1861 rs555164860;rs34666868   rs6517673 41519883
## 18  21        1869      2152             rs73231906 rs144815871 41520421
## 19  21        2167      2168               rs884166  rs59484614 41528672
## 20  21        2182      2237              rs4327307    rs713399 41529106
##      end.bp
## 1  41465059
## 2  41466342
## 3  41470927
## 4  41472203
## 5  41477331
## 6  41482276
## 7  41493979
## 8  41496279
## 9  41497136
## 10 41498595
## 11 41499204
## 12 41501773
## 13 41510076
## 14 41516865
## 15 41519531
## 16 41519797
## 17 41520129
## 18 41528416
## 19 41528707
## 20 41530956

Here, we can see that the BigLD algorithm has divided the YRI genomic region into 20 distinct linkage blocks, giving us both starting and ending positions for each. Using the row number (1-20) to name them, which linkage block(s) contain our SNPs of interest?

Now, let’s visualize these linkage blocks produced by {gpart}. This code will create a helpful LD heatmap figure in your working directory called heatmap_tmprss2_r2.png. You’ll need to import it to your desktop using CyberDuck or MobaXTerm to open it. Let’s take a look:

nowcex=0.5
LDblockHeatmap(genofile = "TMPRSS2_YRI.vcf", chrN = 21, geneDB = "ensembl", assembly = "GRCh38", geneid = "hgnc_symbol", blocktype="bigld", LD="r2", type="png", filename = "heatmap_tmprss2_r2", CLQmode = "density", res = 600)

## CLQ done!
constructLDblock done!

This image is a LOT more helpful! It allows us to visualize distinct blocks of LD that are statistically partitioned by the package, and also gives us a range of SNP IDs and positions into which we might map SNPs of interest not explicitly listed.

Next, we’ll make the graph displaying the D’ statistic:

DheatmapTMPRSS2 <- LDheatmap(TMPRSS2matrix$genotypes,
                           genetic.distances=positions_TMPRSS2,
                           distances="physical",
                           LDmeasure="r",
                           title="Pairwise LD with D'",
                           add.map=TRUE, add.key=TRUE,
                           geneMapLocation=0.15,
                           SNP.name=c("rs2070788", "rs383510", "rs4816720",
                                      "rs12329760"),
                           color=NULL, newpage=TRUE,
                           name="TMPRSS2 LD Heatmap (D')")

Again, this mode of visualization isn’t particularly helpful… let’s take a look using {gpart}:

library(gpart)

tmprss2_res1 = BigLD(genofile = "TMPRSS2_YRI.vcf", LD="Dprime")

## CLQ done!
constructLDblock done!

tmprss2_res1

##    chr start.index end.index             start.rsID    end.rsID start.bp
## 1   21           8        32               rs462471  rs11910678 41464549
## 2   21          79       104             rs57474639  rs73905371 41466341
## 3   21         105       108             rs73372163  rs73372166 41467065
## 4   21         109       263            rs113562865   rs9305744 41467117
## 5   21         282       284             rs34624090    rs467375 41471514
## 6   21         307       385               rs458213  rs55964536 41472035
## 7   21         386       494              rs2298661   rs8134216 41473715
## 8   21         512       641              rs9636988   rs2276205 41478118
## 9   21         680       689              rs9977234   rs2410429 41483637
## 10  21         706       710             rs56066678   rs2838039 41484552
## 11  21         743       754              rs4818240  rs11702475 41485375
## 12  21         755       759             rs35871560   rs2257202 41485652
## 13  21         769       923              rs2298857  rs11911394 41486054
## 14  21         930       941               rs928871   rs6517669 41490604
## 15  21         952      1003             rs35899679    rs430915 41491393
## 16  21        1112      1121             rs58146697  rs57901138 41495740
## 17  21        1155      1163               rs455045    rs415731 41496961
## 18  21        1165      1199              rs1003030   rs3761373 41497679
## 19  21        1208      1472               rs422761   rs4303795 41499066
## 20  21        1485      1513             rs12481984   rs8127674 41508967
## 21  21        1517      1733              rs8132655  rs11414078 41510388
## 22  21        1740      1742              rs6517672  rs62219272 41516325
## 23  21        1757      1844             rs76043846   rs7282236 41517211
## 24  21        1850      1861 rs555164860;rs34666868   rs6517673 41519883
## 25  21        1894      1895            rs144274178 rs561517039 41521134
## 26  21        1902      2133             rs73357702  rs73359717 41521240
## 27  21        2167      2168               rs884166  rs59484614 41528672
## 28  21        2182      2233              rs4327307   rs4334565 41529106
## 29  21        2237      2238               rs713399  rs76728412 41530956
##      end.bp
## 1  41465059
## 2  41467060
## 3  41467111
## 4  41471061
## 5  41471638
## 6  41473711
## 7  41477331
## 8  41482276
## 9  41484121
## 10 41484617
## 11 41485641
## 12 41485788
## 13 41490348
## 14 41491009
## 15 41492816
## 16 41495915
## 17 41497136
## 18 41498595
## 19 41508558
## 20 41510076
## 21 41516040
## 22 41516404
## 23 41519797
## 24 41520129
## 25 41521135
## 26 41527844
## 27 41528707
## 28 41530852
## 29 41530957

Here, we can see that, using the D’ measure, the BigLD algorithm has divided the YRI genomic region into 29 distinct linkage blocks, giving us both starting and ending positions for each. Using the row number (1-29) to name them, which linkage block(s) contain our SNPs of interest? Are any of our SNPs now in different linkage blocks than before?

Now, let’s visualize these linkage blocks using {gpart}, which will divide them visually into strong and weak LD (this means that blocks showing only weak LD will not be explicitly visualized on the graph). This will create a file called heatmap_tmprss2_Dp.png. Let’s take a look now:

nowcex=0.5
LDblockHeatmap(genofile = "TMPRSS2_YRI.vcf", chrN = 21, geneDB = "ensembl", assembly = "GRCh38", geneid = "hgnc_symbol", blocktype="bigld", LD="Dp-str", type="png", filename = "heatmap_tmprss2_Dp", CLQmode = "density", res = 600)

## CLQ done!
constructLDblock done!

Once again, this is MUCH more helpful!

Now you’ve got several graphs that you can use to investigate LD in your gene region! You can see specifically whether or not the SNPs of interest (the labelled SNPs) are in LD by finding where on the graph their paths intersect, or by looking for their linkage block by position in the LD block analysis output. For the purpose of reporting your pairwise LD results, we can look at something called an LD matrix that was generated by the LDheatmap function. This is a matrix where the row and column names are SNP ID numbers, and the cell at the intersection of a row and a column will tell you the LD statistic for those two particular SNPs. For our purposes, you do not need to look at the whole chart, you just need to find the intersections of our four SNPs of interest to see if any of them are in LD, and what that LD value is for each respective statistical test.

To view your reduced LD matrices, run these two chunks of code separately:

#for R squared LD heatmap: 
R2LD<-as.data.frame(R2heatmapTMPRSS2$LDmatrix)
snp<-rownames(R2LD)

R2LD<-
  R2LD %>%
  mutate(snp = snp) %>%
  dplyr::select(snp,rs2070788,rs383510,rs4816720,rs12329760) %>%
  filter(snp == "rs2070788" |
         snp == "rs383510" |
         snp == "rs4816720" |
         snp == "rs12329760")
R2LD

##          snp rs2070788   rs383510  rs4816720 rs12329760
## 1  rs2070788        NA 0.40666382 0.25668813  0.1196071
## 2 rs12329760        NA 0.08339459 0.03268553         NA
## 3  rs4816720        NA 0.48865306         NA         NA
## 4   rs383510        NA         NA         NA         NA

#for D prime LD heatmap: 
DpLD<-as.data.frame(DheatmapTMPRSS2$LDmatrix)
snp<-rownames(DpLD)

DpLD<-
  DpLD %>%
  mutate(snp = snp) %>%
  dplyr::select(snp,rs2070788,rs383510,rs4816720,rs12329760) %>%
  filter(snp == "rs2070788" |
         snp == "rs383510" |
         snp == "rs4816720" |
         snp == "rs12329760")
DpLD

##          snp rs2070788   rs383510  rs4816720 rs12329760
## 1  rs2070788        NA 0.40666382 0.25668813  0.1196071
## 2 rs12329760        NA 0.08339459 0.03268553         NA
## 3  rs4816720        NA 0.48865306         NA         NA
## 4   rs383510        NA         NA         NA         NA

Does LD appear to correlate with physical distance among these SNPs, according to your LDheatmap images?

Step 4: Using Ensembl to Look for More SNPs in LD

The last thing we’ll do in this lab is look at Ensembl’s information on LD for each of the SNPs we studied today in your populations.
To do this:

Go to Ensembl and search for TMPRSS2 in the Human search bar as we have done before
Use the left sidebar to navigate to the Variant Table:

Pick one of the variants we looked at today (rs2070788, rs383510, rs461194, and rs12329760) and enter it in the Variant Table search bar. Make note of what kind of consequence each SNP has. What does the consequence mean, and how might the consequence of a SNP affect its LD?

Click on the Variant ID of the SNP you searched to get the SNP page. Click on the “Linkage Disequilibrium” tab under “Explore This Variant”

You will get this page. This should look familiar, because we visited it briefly in our introductory lab. You will get a page like this that has LD plots of the SNP for all of the 1000 Genomes populations.

You can play around with the LD calculator and the LD plots, but the first thing you should do is find your population in the table and click on the “show” button for “Variants in High LD”. You will get this:

Some populations will have more SNPs in high LD than others. Look specifically at the SNPs in high LD that are not in TMPRSS2. In the example above, there are SNPs listed in the gene BMX in high linkage with several variants in ACE2.
Repeat this process for all four SNPs we looked at in today’s lab. Where are the high LD SNPs that aren’t in TMPRSS2 located? Are any of them located in other gene regions?

Step 5: What Do Your Results Mean? Discuss with a partner from class:

Think about the results you produced today in the context of your population. Here are some guiding questions to help you:

What does it mean for two SNPs in your population to be in LD? How would two obesity SNPs being in LD reflect selection in that population?
Would you expect a high level of LD to develop in TMPRSS2 in your population based on its pandemic context? Why or why not?
What would LD between two COVID-19-related SNPs tell you about the evolutionary history of your population?
If you found any SNPs in high LD with our TMPRSS2 SNPs during the Ensembl exercise that are in a different gene, do a quick OMIM search on the gene(s). What does that gene code for? How might it be connected to TMPRSS2?

Lab 3 - Linkage Disequilibrium (LD) in TMPRSS2

Becca DeCamp & Christopher Schmitt

October 5, 2020

Introduction to Linkage Disequilibrium (LD)

Homework for Lab 3: DUE Friday, October 23rd

Our Focal SNPs

A Note On D’ and R²

Learning Outcomes

Step 1: Getting to R Studio and Preparing your Data

Step 2: Using the “ld” Function

Step 3: Measuring LD in our SNPs of Interest

Step 4: Using Ensembl to Look for More SNPs in LD

Step 5: What Do Your Results Mean? Discuss with a partner from class:

Lab 3 - Linkage Disequilibrium (LD) in TMPRSS2

Becca DeCamp & Christopher Schmitt

October 5, 2020

Introduction to Linkage Disequilibrium (LD)

Homework for Lab 3: DUE Friday, October 23rd

Our Focal SNPs

A Note On D’ and R2

Learning Outcomes

Step 1: Getting to R Studio and Preparing your Data

Step 2: Using the “ld” Function

Step 3: Measuring LD in our SNPs of Interest

Step 4: Using Ensembl to Look for More SNPs in LD

Step 5: What Do Your Results Mean? Discuss with a partner from class:

A Note On D’ and R²