Introduction to Neutrality Statistics and Signs of Selection


Tajima’s D

This is a fairly commonly used neutrality statistic. You can read the original paper by Tajima, which details the development of his D statistic as a way of testing for the neutral mutation hypothesis (or Kimura’s theory of neutral mutation). Tajima’s D is a test statistic that compares the average number of pairwise differences in among base pairs in a population sample with the total number of variant sites in that sample. Most simply put, Tajima’s D indicates whether the pattern of mutations found in a particular region of the genome in a population is following the assumptions of neutrality; in other words, if there has been selection for or against a particular allele in the population, the pattern seen should violate the assumptions of the neutral model.

Tajima’s D is calculated using the relationship between the total number of variable loci in a given sequence in a population (\(\theta\)T which is related to S, which stands for the number of segregating sites) and the pairwise differences in that same sequence between each individual in the population (\(\theta\)T, or \(\pi\)).

When there are very few segregating sites, overall, along with very few pairwise differences between individuals, Tajima’s D will be positive; such a lack of variation suggests that purifying selection is acting on this region of the genome. When there are more pairwise differences than segregating sites, Tajima’s D is also positive, but in this case it indicates balancing selection.

An excess of low frequency polymorphisms will lead to a negative Tajima’s D. This is a violation of neutrality suggesting that positive selection may have occurred, or perhaps a recent bottleneck or founder’s events. In this case, we have multiple rare of an alleles against a background of very few segregating sites. This is often seen when one haplotype takes over a population and begins to accrue random mutations (as happens in a selective sweep or a founder/bottleneck event).

With Tajima’s D, the closer to 0 the D statistic is, the closer the locus being tested is to meeting the assumptions of neutrality.

Today, we’ll calculate Tajima’s D values for our populations in order to determine if any directional selection is happening in the severe COVID-19 association region.

Fu and Li’s D and F

Fu and Li’s D and F are also fairly straightforward tests derived from the same \(\Theta\) statistics as Tajima’s D. Like Tajima’s D, these statistics are also based upon Kimura’s neutrality statistic, again meaning that anything deviating from 0 is a violation of neutrality. What these statistics do, specifically, is measure the number of singleton or private mutations (i.e., the number of people within a sample that have a novel mutation specific to themselves). While the D statistic is based on the difference between the number of singletons in the population and the total number of mutations in population, the F statistic is based on the difference between the number of singletons and the average number of nucleotide differences between pairs of sequences. This difference in how they’re derived makes it possible for one statistic to be positive while the other is negative in the same population.

We can interpret these statistics like so:

  • If the statistics are strongly negative, this shows an excess of singletons in the population. This means there are quite a few people whose genomic variants don’t match each other. Such a situation could be from rapid population growth (meaning everyone is closely related and so there’s been no time for mutation to catch up to the increased numbers) or a selective sweep (see below for a formal definition) further back in the population’s history, meaning that strong directional selection for a certain mutation led to an overrepresentation of that mutation (at the expense of other variants at that SNP).

  • If the statistics are positive, this indicates an excess of old/ancestral variants, that have been selected for in the past. In other words, there are very few unique variants, and the variants that are present are carried by a lot of individuals.

We will calculate Fu and Li’s D and F statistics for our data today.

Integrated Haplotype Score (iHS)

Recent positive selection usually happens by selective sweeps (i.e. the rapid spread of a particular haplotype, driven by selection for particular alleles within that haplotype), which not only act upon the SNP that corresponds to the phenotype under selection, but the entire haplotype or linkage region in which that SNP is found. When this happens, the alleles in these selected haplotypes become almost entirely homozygous, because selection on this region of the genome is so strong. An iHS score, or an Integrated Haplotype Score, is a test that can be used to measure the amount of recent positive selection experienced by an allele indirectly by looking for evidence of selective sweeps, while we can also directly measure extended haplotype homozygosity (EHH) in the population to assess hard selective sweeps. These tests identify extended haplotypes (large sets of alleles on a single region of the chromosome that are identical across individuals) and looking at how many of the haplotypes are homozygous within the population. Mass homozygosity across individuals for a particular haplotype is recorded as a high iHS score, while low levels of haplotype homozygosity is recorded as a low iHS score. We can (and will) also visualize EHH directly.

Learning Outcomes

Part 1: Getting Started

Log in to the SCC and bring up the R Studio window as usual.

In Terminal using the Tabix method from Module 1, download the severe COVID-19 associated region only for your 1000 Genomes population at 3p21.31 identified by The Severe COVID-19 GWAS Group in your readings for this week. This genomic region has the positions 3:45800446-46135604, and includes the indicator SNP rs11385942, and genes SLC6A20, LZTFL1, FYCO1, CXCR6, XCR1, and CCR9. I recommend naming it COVIDRISK_YRI.vcf, with YRI replaced with your own population’s ID. The other severe COVID-19 associated region included ABO, and is comprised of positions 9:133257521-133279871; we won’t work with this region today.

Next, we need to install the package PopGenome in our R Studio space, because the SCC doesn’t have it pre-installed.

#We need to install this package
install.packages("PopGenome")

Next, load in the packages we need to use:

library(vcfR)
library(PopGenome)
library(pegas)

Finally, we need to format our data for analysis. PopGenome is a bit funky about how they look for the files in your directory, so we need to move some files around in our folders to make the GENOME object that the PopGenome package will work with. Specifically, if you haven’t already done so, we need to make a new directory within our working directory, and put a copy of our data into that folder. Do this in your SCC space using the mkdir and cp commands (remember to use your population name rather than YRI!):

mkdir COVIDRISK_YRI
cp COVIDRISK_YRI.vcf COVIDRISK_YRI/COVIDRISK_YRI.vcf

Now, we can make our GENOME object by directing the function to the folder we just created. Additionally, we’ll make a DNAbin object (as we have done before) to save for later…
Here we’ll make our GENOME object:

YRIgenome <- readData("COVIDRISK_YRI", format = "VCF")
## |            :            |            :            | 100 %
## |====================================================
YRIgenome
## -----
## Modules:
## -----
##                 Calculation             Description         Get.the.Result
## 1                  readData            Reading data           get.sum.data
## 2          neutrality.stats        Neutrality tests         get.neutrality
## 3             linkage.stats  Linkage disequilibrium            get.linkage
## 4              recomb.stats           Recombination             get.recomb
## 5                F_ST.stats          Fixation index get.F_ST,get.diversity
## 6           diversity.stats             Diversities          get.diversity
## 7              sweeps.stats        Selective sweeps             get.sweeps
## 8                       MKT  McDonald-Kreitman test                get.MKT
## 9              detail.stats        Mixed statistics             get.detail
## 10                       MS   Coalescent simulation                      @
## 11           --------------             -----------          -------------
## 12          set.populations Defines the populations                       
## 13 sliding.window.transform          Sliding window                       
## 14           splitting.data         Splits the data                       
## 15               show.slots        ?provided slots?                       
## 16               get.status  Status of calculations

And here we’ll make our DNAbin object:

YRI <- read.vcfR("COVIDRISK_YRI/COVIDRISK_YRI.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 130
##   header_line: 131
##   variant count: 8956
##   column count: 117
## 
Meta line 130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 8956
##   Character matrix gt cols: 117
##   skip: 0
##   nrows: 8956
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant: 8956
## All variants processed
YRIdna <- vcfR2DNAbin(YRI)
## After extracting indels, 8658 variants remain.
## 
Variant 0 processed
Variant 1000 processed
Variant 2000 processed
Variant 3000 processed
Variant 4000 processed
Variant 5000 processed
Variant 6000 processed
Variant 7000 processed
Variant 8000 processed
Variant 8658 processed
YRIdna
## 216 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 8658 
## 
## Labels:
## NA18486_0
## NA18486_1
## NA18488_0
## NA18488_1
## NA18489_0
## NA18489_1
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.186 0.323 0.290 0.200 
## (Total: 1.87 Mb)

Part 2: Fu and Li’s D and F

The function neutrality.stats from PopGenome conveniently calculates several different neutrality statistics for a given population. We’ll use it now to calculate Fu and Li’s D and F:

#Remember to replace the varible name with your own GENOME object!
neut <- neutrality.stats(YRIgenome)
## |            :            |            :            | 100 %
## |====================================================
get.neutrality(neut)
##       neurality stats
## pop 1 Numeric,9
#To see results: 

neut@Fu.Li.F
##          pop 1
## [1,] -1.068129
neut@Fu.Li.D
##          pop 1
## [1,] -1.228903

As we can see here, the YRI population has D and F statistics that are negative. This suggests an excess of singleton mutations in this risk region in this population, as well as an excess of ancestral mutations. Put simply, the severe COVID-19 risk region in the Yoruban population from Ibadan may have experienced an historical positive selective sweep.

Before we move on, think about what the results for your population means, based on how we interpret Fu and Li’s D. Is there an excess of singletons? Is there an excess of old/ancestral mutations? What does that tell you about your population’s history? We will come back to this at the end of class.

Part 3: Tajima’s D

Now we’ll look at Tajima’s D! When we ran the neutrality.stats function in PopGenome, it also calcuated our Tajima’s D statistic. We can preview our Tajima’s D results from the PopGenome output…

neut@Tajima.D
##           pop 1
## [1,] -0.6111695

But, we can get more useful information by using the tajima.test function in the pegas package. This function will use the DNAbin object that we created earlier as an input. It will not only give us the same D statistic as was calculated by the neutrality.stats function, which is all well and good, but will also give us a P-value associated with our test. This is important, because it will give us a sense of how significant the results are of our neutrality test.

tajima <- tajima.test(YRIdna)
tajima
## $D
## [1] -0.6102318
## 
## $Pval.normal
## [1] 0.5417083
## 
## $Pval.beta
## [1] 0.58122

What we see from YRI for Tajima’s D is that this population does not appear to be subject to selection in the severe COVID-19 risk region. This suggests that the only evolution (remember, that’s a change in allele frequencies over generations, and does not have to be due to selection) happening in the population is selectively neutral.

Interpret your own population’s Tajima’s D value. First, compare the D value given by this function and the one given by the neutrality.stats function. Are they the same? What is the D statistic for your population, and what does it mean in terms of whether and what kind of selection might be occurring on your population? We will revisit this question at the end of the lab.

Part 4: Positive selective sweeps using iHS and EHH

Now, we will run our iHS and EHH analyses to assess recent positive selective sweeps in an expanded area surrounding the severe COVID-19 risk region on chromosome 3 (NOTE: you MUST download a new VCF file to run this analysis). To run this example, I’m using the rehh package, which is specifically designed to look for Extended Haplotype Homozygosity, or evidence of hard positive selective sweeps. To do this, we’ll have to do some data transformations. This is relatively simple to do, but it involves using some unfamiliar python and unix code that may be confusing. It also can take a lot of memory, so I’ll ask you to delete files along the way as you try this out!

To do this, we’ll first need to load our modules. This will include tabix to download the data, vcftools and bcftools to modify it. We’ll also need to load the programming language python for a few short commands later:

module load htslib
module load vcftools
module load bcftools
module load python3

Now, to save on space, we’ll download our extended positions (roughly 1 Mb above and below our region of interest) from chromosome 3 into a zipped file using a familiar tabix command:

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr3_GRCh38.genotypes.20170504.vcf.gz 3:44800446-47135604 | bgzip -c > COVIDRISK_all.vcf.gz

Then, we’ll subset the zipped file for just the YRI data with a familiar vcftools command. While we’re at it, we’ll also filter out all indels and SNPs with a MAF lower than 5% while keeping the ancestral allele information:

vcftools --gzvcf COVIDRISK_all.vcf.gz --keep YRI.samples.list --maf 0.05 --remove-indels --recode-INFO AA --recode --stdout | bgzip -c > COVIDRISK_YRI.vcf.gz

Now we’re ready to convert this into a haplotype file and a genome map file for {rehh} using a few relatively unfamiliar steps… these are modified from the Informatics Portal website.

First, we’ll extract our INFO on ancestral alleles that are embedded in the VCF already:

vcftools --gzvcf COVIDRISK_YRI.vcf.gz --get-INFO AA

Next, we’ll change lower- to upper-case letters in our INFO file:

cat out.INFO | tr '[:lower:]' '[:upper:]' > out2.INFO

Given your data looks good, now let’s erase the larger file with all the 1000 Genomes Project individuals, along with other files we’re done with:

rm COVIDRISK_all.vcf.gz

I have to note that this step is - conceptually - a little complicated because the way the 1000 Genomes Project estimates the ancestral allele at a given position is a bit different from how folks usually think about an ancestral allele (i.e., the allele state of the LCA of the groups you’re testing). You don’t have to worry about it, for now, but you can learn more about it here if you want.

Export the VCF file in IMPUTE format, which is what groups alleles into haplotypes:

vcftools --gzvcf COVIDRISK_YRI.vcf.gz --IMPUTE

Delete the first row in our INFO file and replace tabs for spaces:

cat out2.INFO | awk '{gsub(/\|/,"",$5)}1' | awk 'NR>1{if($3==$5) print$0,"yes";if($3!=$5) print$0,"no";}' | sed -e 's/\t/ /g' > out3.INFO

Now, there’s a problem here in that IMPUTE is deleting a few loci from the sample. We need to use the ‘join’ command to merge the properly matched columns based on position (column 2 in out3.INFO and column 2 in out.impute.legend), which will automatically then delete non-matches in the output file (both files need to be sorted by the position name for this to work):

join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8 out3.INFO <(sort -k2 out.impute.legend) > out4.INFO

Merge most recent INFO file and our IMPUTE hap file, then eliminate from the merged file those markers with no ancestral allele info (“.”) or (“N”) or (“-”):

paste out4.INFO out.impute.hap | awk '($5!="." && $5!="N" && $5!="-")' > merged.out

Change the allele codes (0 to 3 and 1 to 4) starting in column 7, and remove potential duplicates:

awk '{for(i=7;i<=NF;i++)if($i==0)$i=3; else $i=4; print}' merged.out | awk '{if(ip[$2]=="") print; ip [$2]=1}' > merged2.out

Prepare {rehh} map file using information from the first 2 columns:

awk '{print $1"-"$2,$1,$2,"1","2"}' merged2.out > mapCOVIDRISK_YRI.out

#change label 3 to 1 when reference allele = ancestral allele (yes)
#change label 4 to 2 when reference allele = ancestral allele (yes)
#change label 3 to 2 when reference allele = derived allele (no)
#change label 4 to 1 when reference allele = derived allele (no)
#eliminate 6 initial fields in hap file to prepare rehh input hap file

sed -e '/yes/ s/3/1/g' -e '/yes/ s/4/2/g' -e '/no/ s/3/2/g' -e '/no/ s/4/1/g' merged2.out | cut -d " " -f7-  > merged3.out

Transpose to create final haplotype file for {rehh} input:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < merged3.out > merged4.out

Add first column with numbers in increasing order:

awk '{print NR " "$0}' merged4.out > hapCOVIDRISK_YRI.out

An finally, remove intermediate files you won’t use anymore:

rm merged.out
rm merged2.out
rm merged3.out
rm merged4.out
rm out.INFO
rm out.impute.hap
rm out.impute.hap.indv
rm out.log
rm out2.INFO
rm out3.INFO
rm out4.INFO

After this whole process, the files mapCH4_FIN.out and hapCH4_FIN.out are the input files for our {rehh} analysis!

Ok… now let’s open Rstudio and try it!

Once you’re in R, load the {rehh} library (if it throws an error, you may need to install {rehh}):

library(rehh)

Then, we need to import our newly made files for the Finns:

hap<-data2haplohh(hap_file="hapCOVIDRISK_YRI.out",map_file="mapCOVIDRISK_YRI.out",chr.name="3")
## * Reading input file(s) *
## Map info: 6484 markers declared for chromosome 3 .
## Haplotype input file in standard format assumed.
## Assume that alleles in the haplotype file are coded as:
## 0: missing value
## 1: ancestral allele
## 2, 3, ...: derived allele(s).
## * Filtering data *
## Discard markers genotyped on less than 100 % of haplotypes.
## No marker discarded.
## Data consists of 216 haplotypes and 6484 markers.
## Number of mono-, bi-, multi-allelic markers:
## 1 2 
## 0 6484

Once those are read in, you should see that our dataset consists of 62510 SNPs split into 198 haplotypes.

Now we can run our extended haplotype homozygosity assessment (this step might take a while, be patient):

ehh<-scan_hh(hap,limhaplo=2,limehh=0.05,limehhs=0.05,maxgap=NA,threads=1)

Retrieve our iHS scores:

ihs <- ihh2ihs(ehh,freqbin=0.025)
## Discard focal markers with Minor Allele Frequency equal to or below 0.05 .
## No marker discarded.

We can then look for candidate regions potentially under selection using the iHS score. Here we’re considering a piHS (on the y-axis) greater than 2 as a significant score:

cr.se<-calc_candidate_regions(ihs,threshold=2,pval=TRUE,window_size=3000,overlap=300,min_n_extr_mrk=1)
cr.se
##    CHR    START      END N_MRK  MEAN_MRK  MAX_MRK N_EXTR_MRK PERC_EXTR_MRK
## 1    3 45085800 45091500    25 0.5684447 2.429099          1          4.00
## 2    3 45143700 45149400    33 0.7090367 2.095386          1          3.03
## 3    3 45176700 45195600   106 0.7311766 2.305481          6          5.66
## 4    3 45264300 45270000    15 0.4778430 2.156674          1          6.67
## 5    3 45545400 45551100    12 0.4433625 2.055738          1          8.33
## 6    3 45554400 45560100    28 0.6237262 3.200979          1          3.57
## 7    3 45565500 45575400    54 0.6312894 2.396599          4          7.41
## 8    3 46050000 46057500    49 0.7490817 2.129540          3          6.12
## 9    3 46067700 46076700    30 0.9661897 2.147707          2          6.67
## 10   3 46087800 46093500    26 0.5974498 2.275285          1          3.85
## 11   3 46137900 46143600    18 0.8554069 2.332035          1          5.56
## 12   3 46147200 46155600    32 1.2302326 3.242357          5         15.62
## 13   3 46158900 46164600    20 1.1639525 2.474646          1          5.00
## 14   3 46459800 46465500    21 0.6856944 2.182452          1          4.76
## 15   3 46664400 46670100    28 1.1548234 2.036570          1          3.57
##    MEAN_EXTR_MRK
## 1       2.429099
## 2       2.095386
## 3       2.122546
## 4       2.156674
## 5       2.055738
## 6       3.200979
## 7       2.281551
## 8       2.071186
## 9       2.145174
## 10      2.275285
## 11      2.332035
## 12      2.519140
## 13      2.474646
## 14      2.182452
## 15      2.036570

Looks like, in the YRI population, we have perhaps 15 regions that are potentially under selection in this region.

Let’s plot them:

manhattanplot(ihs,pval=TRUE,threshold=2,pch=16, chr.name="3",cr=cr.se,main="iHS")

Now, remember… in this figure we’ve added 1 Mb above and below, so the only spots of selection that are relevant are between 3:45800446-46135604, which may include the potential selective sweeps identified in the figure as numbers 5-13. These appear to be immediately up- and down-stream of the region of interest.

Let’s actually check the SNPs making up these significant regions:

library(tidyverse)
## ── Attaching packages ──────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
pihs<-as_tibble(ihs$ihs[,c(2,4)])

sig.pihs<-
  pihs %>%
  filter(LOGPVALUE >= 2)
sig.pihs
## # A tibble: 30 x 2
##    POSITION LOGPVALUE
##       <dbl>     <dbl>
##  1 45088669      2.43
##  2 45146425      2.10
##  3 45179630      2.31
##  4 45184378      2.17
##  5 45185304      2.08
##  6 45186590      2.06
##  7 45190085      2.09
##  8 45192716      2.03
##  9 45267163      2.16
## 10 45548210      2.06
## # … with 20 more rows

Now we can take a closer look at the EHH of an actual SNP of interest.

We can’t actually investigate whether the derived version of the index SNP, rs11385942, identified for this region is under selection because it’s an indel. However, we can look at some of these significant SNPs within the region of interest. Let’s take a look at 3:46053334, which has among the higher piHS in the severe COVID-19 risk region (piHS = 2.129540).

ehh_46053334<-calc_ehh(hap,mrk="3-46053334")
plot(ehh_46053334,
     main="EHH at 3:46053334",
     col=c("blue2","gold2"))

We can see here that EHH (on the y-axis) for the haplotypes surrounding the ancestral allele is consistently higher than the EHH surrounding the derived allele, with that EHH getting higher as we approach the SNP along the genome. This suggests that recombination has not yet broken up the linkage block/haplotype surrounding the ancestral allele. In other words, it would appear that the SNP is in a large ancestral haplotype with extensive haplotype homozygosity, suggesting that the ancestral SNP may have experienced a positive selective sweep.

Let’s actually visualize the haplotype structure around our SNP of interest to give us a better idea of the observed effect of selection. The original use of this kind of graph - called a bifurcation graph - is in a paper by Sabeti et al. (2002), in our readings for this module. This will look like a bifurcating tree, with the thicker branches representing the relative number of individuals maintaining that particular haplotype (thicker = more people in the sample have that haplotype) surrounding the core allele (which is at the vertical dashed line). The nodes of the tree are where the haplotype/linkage is broken by a variant outside the original haplotype. Let’s check it out:

furcation_46053334<-calc_furcation(hap,mrk="3-46053334")
plot(furcation_46053334,
     lwd=0.1,
     col=c("blue2","gold2"),
     cex.lab=0.3,
     main="Bifurcation 3:46053334",
     legend.xy.coords="none")

Here, we can see that there are a lot of individuals in the YRI population that have the derived allele, but that there’s some extensive haplotype homozygosity around the ancestral allele. We can see this because the thicker the branch, the more individuals share haplotypes along that branch. So when we see a very thick branch extend many Mb along the genome, this suggests a selective sweep. In this case, it perhaps looks like a relatively limited selective sweep, given that not eveyrone has this haplotype. For a very clear selective sweep, we would see almost the entire population showing the extended haplotype (e.g., a very large, thick line extending across a large portion of the genomic region).

Now, our most significant iHS hit in this region is 3:46150577, which is just outside the severe COVID-19 GWAS linkage area. Let’s take a look:

ehh_46150577<-calc_ehh(hap,mrk="3-46150577")
plot(ehh_46150577,
     main="EHH at 3:46150577",
     col=c("blue2","gold2"))

Ok, this looks like another case where the ancestral allele appears to have more extensive haplotype homozygosity than the derived allele (e.g., the blue line is higher than the gold line across most of the region). Let’s look at the bifurcation diagram:

furcation_46150577<-calc_furcation(hap,mrk="3-46150577")
plot(furcation_46150577,
     lwd=0.1,
     col=c("blue2","gold2"),
     cex.lab=0.3,
     main="Bifurcation 3:46150577",
     legend.xy.coords="none")

That’s more like it! Although linkage isn’t preserved well in all individuals, particularly on the left (as the we follow the genome towards the centromere), the majority of individuals have preserved the haplotype associated with the ancestral allele. For a subset of the population, this haplotype is very well preserved for about 0.4 Mb downstream of the SNP. The previous SNP looks more like a clear case of positive selection around the SNP itself based on iHS, but this one looks like there has been some clearer preservation of the extended haplotype via selection.

Try this with your population. Are there any SNPs that look like clear positive selective sweeps in the severe COVID-19 GWAS region?

Part 5: What Do Your Results Mean? Discuss with a partner from class:

As we wrap up this lab, think about the results you’ve generated. You have already been prompted to think about what your results might mean as we’ve gone along; now you have the opportunity to talk to a partner about what you think your results might mean.

Extra Credit: Sliding window Tajima’s D!

A perhaps more helpful method for assessing Tajima’s D can be done using {vcftools} in the SCC/Terminal workspace. This version analyses Tajima’s D in small sliding windows (here, I’ve chosen 500 bp windows) along the gene region of interest, allowing us to better pinpoint exactly where in the gene region selection may be occurring:

module load vcftools
vcftools --vcf COVIDRISK_YRI.vcf --TajimaD 500 # We chose 500 as the window size
mv out.Tajima.D COVIDRISK_YRI.Tajima.D #rename the output file 
mv out.log COVIDRISK_YRITD.log #rename log file 
less COVIDRISK_YRI.Tajima.D 

With this sliding window, we can see that certain sections of our severe COVID-19 risk region have very different values for Tajima’s D. A good rule of thumb is that a Tajima’s D value above 2 or below -2 is significant. For example, the 500 bp windows starting at position 45823000 has 4 SNPs in it and is negative but not quite significant (D = -1.56823000).

Let’s take a closer look by bringing the output from this analysis into R so that we can filter it according to this significance threshold:

slide.covid<-read.table("COVIDRISK_YRI.Tajima.D",header=TRUE)
head(slide.covid)
##   CHROM BIN_START N_SNPS   TajimaD
## 1     3  45800000      1 -0.323452
## 2     3  45800500      1 -0.944036
## 3     3  45801000      0       NaN
## 4     3  45801500      1 -0.944036
## 5     3  45802000      2 -1.128430
## 6     3  45802500      3 -1.163710

Now we can filter it by our significance threshold using the {tidyverse} package:

library(tidyverse)
slide.covid.sig<-
  slide.covid %>%
  filter(TajimaD >= 2 | TajimaD <= -2)
slide.covid.sig
##    CHROM BIN_START N_SNPS TajimaD
## 1      3  45861000      2 2.62004
## 2      3  45973500      3 2.02822
## 3      3  46028500      4 2.17669
## 4      3  46031000      7 2.65188
## 5      3  46041500      3 2.43019
## 6      3  46042500      3 2.43019
## 7      3  46044500      4 2.55641
## 8      3  46057500     10 2.70205
## 9      3  46101500      5 3.15199
## 10     3  46105500      6 2.03589
## 11     3  46113500      3 2.44955
## 12     3  46131500      5 2.80298

Here, we can see that we have 12 500-bp windows in the severe COVID-19 risk region that have significant Tajima’s D values for the Yoruban population. All of these values are positive, suggesting purifying selection on at least one of the SNPs in each of these regions. The most significant region is 3:46101500-46102000 (row 9 in our table), and there are 5 SNPs in that region that could be the potential focus for selection. Going back to Ensembl, it looks like this 500 bp window is in an intergenic region (between genes), so if it is experiencing purifying selection, it is most likely on an as-yet-unannotated regulatory region for a gene (although, of course, it may also simply be in high LD with a functional region elsewhere). I found this functional information by searching for the 500 bp window, 3:46101500-46102000, in the human reference genome on Ensembl and looking at the Variant Table. Remember, if your population had no variants at certain SNPs, then those SNPs were dropped from your VCF file; this means there may be more SNPs listed on Ensembl than are actually in your sliding window for your population.

The remaining windows listed in our table include intronic variants in the gene LZTFL1 (row 1), intronic variants in the gene FYCO1 (row 2), known regulatory region variants (part of a CTCF binding domain) and transcription factor binding sites (rs535236997, rs553123352, and rs374572137) for the gene NFATC1 (this plays a role in cytokine expression in T-cells) (row 10), an intergenic region containing a variant (rs13063635) known to significantly decrease the eosinophil percentage among granulocytes (these are white blood cells important to modulating inflammatory processes) (row 12), and more functionally unknown intergenic variants (rows 3, 4, 5, 6, 7, 8, 9, 11).

These are some very interesting phenotypes, with potential importance to COVID-19 response (severe COVID-19 has, after all, been characterized as a ‘cytokine storm’, and in part a disease of inflammation)! Because of this, it might be interesting to see if there could be any evidence of haplotype homozygosity around the loci found within these windows to bolster support for selection in these regions.

Let’s check out rs6772051 (3:46101567), which is an intergenic variant in our window (row 9; 3:46101500-46102000) that shows the highest value for Tajima’s D:

ehh_46101567<-calc_ehh(hap,mrk="3-46101567")
plot(ehh_46101567,
     main="EHH at 3:46101567",
     col=c("blue2","gold2"))

Definitely looks like the ancestral allele has some more extensive haplotype homozygosity than the derived allele (e.g., the blue line is higher than the gold line across most of the region). Let’s look at the bifurcation diagram:

furcation_46101567<-calc_furcation(hap,mrk="3-46101567")
plot(furcation_46101567,
     lwd=0.1,
     col=c("blue2","gold2"),
     cex.lab=0.3,
     main="Bifurcation Diagram\nrs6772051 in YRI",
     legend.xy.coords="none")

And yes, we can see some very clear extended haplotype homozygosity for the haplotype containing the ancestral allele! Particularly downstream of the variant.

What regions show significantly positive or negative Tajima’s D in your population? Are they the same as in the YRI population? Are they implicated in similar traits or genomic functions?