AN/BI 333/733: Modules

Pre-Module: BU Shared Computing Cluster Tutorial

September 11

Material covered: Introduction to and tutorial on using the BU Shared Computing Cluster, via Linux-based SCC.

Readings: None.

Activities: We will create personal profiles using the SCC interface with the help of a representative from Research Computing. We’ll learn how to connect to SCC and some basic commands that will help us navigate the interface and access analytical software that will be used in the course.

Assignment: In-class worksheet based on activities will be graded.

Learning Outcomes:

Learn how to access your SCC profile and data storage, and become familiar with analytical software and data download and storage commands.
Learn basic command-line tools, vocabulary, and syntax.
Learn how to access analytical software such as R and RStudio, which we will be using to do real- life population genetic analyses.
If we have time, we will begin using these tools to get a head start on Module 1!

Please download associated materials on Blackboard

RCS Tutorials: Research Computing Services offers many helpful (free) tutorials during the month of September that may make a huge difference for how well and quickly you are able to learn this material. I strongly recommend the tutorials Introduction to BU’s Shared Computing Cluster and Introduction to R (although they are not required for the course), and the remainder I recommend if you would like to learn more about the systems we’ll be working with:

Thu, Sep 3 10:00am ‐ 12:00pm Introduction to BU’s Shared Computing Cluster (Hands‐on)
Wed, Sep 3 12:15pm ‐ 12:45pm Using (Python, MATLAB, R, SAS, Stata, or ML) on the SCC
Fri, Sep 4 1:00pm ‐ 3:00pm Introduction to Linux (Hands‐on)
Tue, Sep 1 3:30pm ‐ 5:30pm Introduction to R (Hands‐on)
Tue, Sep 8 3:30pm ‐ 5:30pm Introduction to R (Hands‐on)
Thu, Sep 3 3:30pm ‐ 5:30pm Data Wrangling in R (Hands‐on)
Wed, Sep 9 3:30pm ‐ 5:30pm Data Wrangling in R (Hands‐on)
Thu, Sep 10 3:30pm ‐ 5:30pm Graphics Using Base R Packages (Hands‐on)
Fri, Sep 11 3:30pm ‐ 5:30pm Graphics in R: ggplot2 (Hands‐on)
Mon, Sep 14 3:30pm ‐ 5:30pm Programming in R (Hands‐on)

Module 1: Accessing Human Candidate Gene Region Data – ACE2 and TMPRSS2

September 18

Material covered: Introduction to the 1000 Genomes Project dataset, and tutorial on using Ensembl to access the 1000 Genomes dataset. For illustrative purposes, we’ll focus on both the angiotensin-converting enzyme 2 (ACE2) gene and the transmembrane serine protease 2 (TMPRSS2) gene, each of which code for the key receptors the coronavirus SARS-CoV-2 uses to enter cells, leading to the disease known as COVID-19.

Readings:

The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68-74.

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group. 2011. The variant call format and VCFtools. Bioinformatics App Note 27: 2156-2158.

Gheblawi M, Wang K, Viveiros A, Nguyen Q, Zhong J-C, Turner AJ, Raizada MK, Grant MB, Oudit GY. 2020. Angiotensin-Converting Enzyme 2: SARS-CoV-2 Receptor and Regulator of the Renin-Angiotensin System. Circulation Research, 126(10):1456-1474

Hoffmann M, et al. 2020. SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell, 181(2):271-280.e8

David A, Khanna T, Beykou M, Hanna G, Sternberg MJE. [Preprint]. Structure, function and variants analysis of the androgen-regulated TMPRSS2, a drug target candidate for COVID-19 infection. bioRxiv, accessed 10SEP20, DOI: 10.1101/2020.05.26.116608

Sriram K, Insel P, Loomba R. May 14, 2020. What is the ACE2 receptor, how is it connected to coronavirus and why might it be key to treating COVID-19? The experts explain. The Conversation.

Activities: We’ll learn how to use the Ensembl database to navigate our candidate genes, ACE2 and TMPRSS2, and find more information about them. Each student will be assigned a single 1000 Genomes sub-population that they will look at over the course of the modules, and we will use the Data Slicer within Ensembl to download data for each gene from those populations into our SCC accounts.

Assignment: Students must turn in a homework assignment – with questions related to ACE2 and TMPRSS2 variation in humans and related to the downloaded dataset – the following Friday Pre-Module Homework Assignment is due today.

Learning Outcomes:

Learn about the basics of bioinformatics and how genetic data is transformed from raw sequencing reads in to a VCF file, which is the file type we will be working with.
Learn about the public data on human genomes available via the 1000 Genomes Project, and how to access it via Ensembl.
Learn about the role of ACE2 and TMPRSS2 variation in humans, and how they are implicated in SARS-CoV-2 infection, and their roles in other bodily systems.
Learn how to download specific regions of genomic data – or candidate gene regions – from 1000 Genome populations in VCF format using the Data Slicer in Ensembl and move them on to the SCC.

Homework for Module 1: DUE Friday, September 25th at 5:00 pm

Module 2: ACE2/TMPRSS2 Variants and Hardy-Weinberg Equilibrium

September 25

Material covered: Using R and RStudio via the SCC to run pre-written code that will perform our analyses. Assessing allelic variation in SNPs within and across populations. Testing Hardy-Weinberg equilibrium (HWE) and understanding what it means if violated, which involves knowing the assumptions of the model. Using downloaded candidate region data from 1000 Genomes Project to assess HWE in living human populations using a Chi-Squared test. Using Ensembl to obtain genotype count information in order to use the Wigginton and Cutler method of HWE calculation on selected SNPs.

Readings:

Chen J. The Hardy-Weinberg Principle and Its Applications in Modern Population Genetics. Frontiers in Biology 5(4): 348-53.

Benetti E, et al. 2020. ACE2 gene variants may underlie interindividual variability and susceptibility to COVID-19 in the Italian population. European Journal of Human Genetics

Wigginton JE, Cutler DJ, Abecasis GR. 2005. A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet 76: 887-893.

Activities: We will use the R coding language to test HWE in the dataset on ACE2 we downloaded from Ensembl. We will assess whether or not SNPs in this genomic region are in Hardy-Weinberg equilibrium based on a Chi-Squared test in assigned human populations. We will then re-test selected SNPs using the “True HWE” method described in Wigginton and Cutler. We will then discuss what our results mean, in accordance with what we know about those populations, HWE, and the effects of these ACE2 variants on disease expression.

Assignment: Students must turn in a worksheet – with questions related to ACE2 variation in humans and related to the downloaded dataset – in class the following Monday. Module 1 Homework Assignment is due today.

Learning Outcomes:

Learn how to use the SCC and R coding language to observe and understand population differences in ACE2 variation.
Calculate Hardy-Weinberg Equilibrium for all ACE2 SNPs in individual populations using a traditional Chi-Square test.
Perform a check on all SNPs not in Hardy-Weinberg Equilibrium by calculating “True” Hardy-Weinberg with the built-in Shiny App.
Research the consequence types of these SNPs in order to understand how these SNPs might affect the genome itself, and how they might affect genotype.
Calculate the “true” Hardy-Weinberg Equilibrium using the Shiny App for Lys26Arg, a SNP that may be implicated in less severe SARS-CoV-2 infections, and determine what the Hardy-Weinberg Equilibrium says about this SNP in your population.

Homework for Module 2: DUE Friday, October 2nd at 5:00 pm

Module 3: Linkage Disequilibrium (LD) in ACE2 and TMPRSS2

October 16

Material covered: In this module, we’ll be assessing linkage disequilibrium (LD) in the ACE2 genomic regions of the 1000 Genomes populations using R coding language. We’ll also work on calculating LD by hand between two known loci in ACE2. All of this will help us work towards understanding factors that increase LD in the human genome.

Readings:

Wooster L, Nicholson CJ, Sigurslid HH, Lino Cardenas CL, Malhotra R. Preprint accessed 31AUG20. Polymorphisms in the ACE2 locus associate with severity of COVID-19 infection. medRxiv doi:10.1101/2020.06.18.20135152

Cheng Z., et al. 2015. Identification of TMPRSS2 as a susceptibility gene for severe 2009 pandemic A(H1N1) influenza and A(H7N9) influenza. The Journal of Infectious Diseases, 212(8):1214–1221.

Slatkin M. 2008. Linkage disequilibrium – understanding the evolutionary past and mapping the medical future. Nat Rev Genet 9: 477-485.

Zeberg H, Pääbo S. 2020. The major genetic risk factor for severe COVID-19 is inherited from Neanderthals. Nature 2998

Claiborne Stephens J, Schneider JA, Tanguay DA, Choi J, Acharya Y, Stanley SE, Jiang R, et al. 2001. Haplotype variation and linkage disequilibrium in 313 human genes. Science 293: 489-93.

Activities: We’ll assess linkage disequilibrium across the ACE2 and TMPRSS2 loci using our datasets downloaded from Ensembl, with a focus on the SNPs that are defined in Wooster et al. (2020) and Cheng et al. (2015). We will then discuss what high linkage disequilibrium in our populations could mean regarding potential for selection having occurred within our populations.

Assignment: Students must turn in homeowork – with questions related to ACE2 and TMPRSS2 variation in humans and related to the downloaded dataset – online by the following Friday. Module 2 Homework Assignment is due today.

Learning Outcomes:

Learn about the SNPs rs1548474 and rs4240157 for ACE2, and rs383510 and rs2070788 for TMPRSS2, and their roles in viral susceptibility.
Learn how to use the R package SNPStats to perform LD analysis in a population, including constructing LD matrices and LD heatmaps.
Learn about the two statistics D’ and R², which are the most commonly used statistics to evaluate LD between SNPs. Learn what each can tell us about a population, and apply the two statistics to our own populations
Learn how to use Ensembl to look at long-distance LD between SNPs in ACE2 and TMPRSS2, as well as for SNPs in other genes.

Homework for Module 3: DUE Monday, October 26th

Module 4: Introduction to Nearest-Neighbor Joining and Phylogenetics

October 30th

Material covered: In this module, we’ll be using Nearest-Neighbor Joining to see which individuals within our assigned 1000 Genomes populations are most related to each other (at least insofar as ACE2 variation indicates). We’ll also plot these phylogenetic trees to better understand the patterns of molecular variation and amount of diversity in ACE2 within our populations.

Readings:

Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16: 111–120.

Saitou N, Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4: 406–425.

Liu Z, et al. 2020. Composition and divergence of coronavirus spike proteins and host ACE2 receptors predict potential intermediate hosts of SARS‐CoV‐2. J Med Virol 92(6): 595-601.

Activities: We’ll learn how to create a phylogenetic tree with simple neighbor-joining methods using the ape package in R. We’ll then learn how to make a tree from multiple populations, which will allow us to compare those different populations’ structures in a qualitative way. We will also use these phylogenetic trees to assess the diversity of ACE2 in each population, and discuss what that means.

Assignment: Students must turn in a worksheet – with questions related to ACE2 variation in humans and related to the downloaded dataset – in class the following Friday.

Learning Outcomes:

Learn how to apply Kimura’s Neutral Theory to our populations to create a matrix of genetic distances between individuals in a population.
Learn how to use the ape package’s Nearest Neighbor Joining algorithm to create a Nearest Neighbor Joining tree, and learn how to use the package phangorn to manipulate phylogenetic trees.
Learn how to interpret a phylogenetic tree, and learn what it can tell us about molecular diversity within our populations.

Homework for Module 4: DUE Friday, November 6th

Module 5: Introduction to Neutrality Statistics and Signs of Selection

November 6th

Material covered: This module is an introduction to statistical tests of neutrality that can be used in genomic studies. Tajima’s D, Fu and Li’s D and F, and iHS scores will be covered and discussed. We’ll work towards understanding what each of these tests do to measure selection, and what these statistics can tell us about population structure and history.

Readings:

Garrigan DR, Lewontin R, Wakeley J. 2010. Measuring the sensitivity of single-locus “neutrality tests” using a direct perturbation approach. Mol Biol Evol 27(1): 73-89.

The Severe COVID-19 GWAS Group. 2020. Genomewide Association Study of Severe Covid-19 with Respiratory Failure. New England Journal of Medicine, 383, 1522-1534.

Genetics and COVID-19 Pandemic. 2020. The American Society of Human Genetics.

Sabeti PC, et al. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419(6909): 832-837.

Tajima F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123(3): 585–595.

Activities: We’ll use the packages PopGenome and pegas to cauclate Fu and Li’s D and F, as well as Tajima’s D for the severe COVID-19 susceptibility region (3p21.3) identified by The Severe COVID-19 GWAS Group in our respective 1000 Genomes populations. We’ll also look for positive selective sweeps in each of our 1000 Genomes populations using iHS score and EHH using the rehh package. Finally, we will take ample time to understand what the Fu and Li’s D and F and Tajima’s D test results tell us about how our populations are evolving, and use the example of iHS to predict whether or not our populations underwent a selective sweep in this susceptibility region.

Assignment: Students must turn in a homework assignment – with questions related to this module and associated readings in humans and related to the downloaded dataset – in class the following Friday. Module 4 Homework Assignment is due today.

Learning Outcomes:

Learn what Fu and Li’s D and F, Tajima’s D, and iHS scores and EHH means, and how to interpret them.
Learn to use the PopGenome package to calculate neutrality statistics like Fu and Li’s D and F, and Tajima’s D.
Learn to use the pegas package to explore our Tajima’s D statistic further.
Learn about iHS and how to calculate iHS in R, and reflect on what the iHS score for our populations.
Learn about whether or not selection is happening in our populations based on these statistics, and relate it to what may be happening in the local environment to cause these alleles to be selected for/against.

Homework for Module 5: DUE Friday, November 13th

Module 6: A Brief Digression from COVID-19 for Quantitative Genetics

November 20

Material covered: Quantitative genetics and partitioning variance in phenotypes between genetic and environmental signals. To do this, we’ll be working with some captive vervet monkey (Chlorocebus sabaeus) data I collected at Wake Forest College of Medicine’s Vervet Research Colony (VRC). We’ll also learn about the SOLAR work environment, which is a (relatively) easy interface for doing quantitative genetics analysis in the SCC space. Through this, we’ll learn a bit about the quantitative genetics of BMI and body mass.

Readings:

Almasy L, Blangero J. 2010. Variance component methods for analysis of complex phenotypes Cold Spring Harbor Protocols 2010(5): pdb.top77.

Schmitt CA, Service S, Cantor RM, Jasinska AJ, Jorgensen MJ, Kaplan JR, and Freimer NB. 2018. High heritability of obesity and obesogenic growth are both highly heritable and modified by diet in a nonhuman primate model, the African green monkey (Chlorocebus aethiops sabaeus). Int J Obesity 42: 765-774.

Hill WG. 2012. Quantitative genetics in the genomics era. Curr Genomics 13(3): 196-206.

Activities: There will be a brief discussion of quantitative genetics and the vervet monkey (Chlorocebus sabaeus) model as implemented in SOLAR using the Almasy & Blangero terminology and orientation to using SOLAR in the SCC environment. We will conduct in class exercises that will be used to answer questions in the Module 6 homework.

Assignment: Students must turn in a worksheet – with questions related to quantitative genetic variation in vervets related to BMI, body mass, and obesity – in class the following Friday. Module 5 Homework Assignment is due today.

Learning Outcomes:

Learn about basic quantitative genetics terminology, and the process of variance decomposition.
Learn about pedigree-based analyses and how to process a pedigree file.
Learn how to use the program SOLAR (in the SCC framework).
Learn how to transform data to meet quantitative genetics model assumptions.
Learn how to appropriately interpret quantitative genetics outcomes, including narrow-sense heritability, what environmental component means, and how to interpret a household/maternal effect in this context.

Homework for Module 6: DUE Friday, November 27th

Module 7: Finding a New Locus…

November 02

Material covered: We’ll discuss the process of finding a new locus on which to conduct a population genetics study for your final project! Make sure the locus is related somehow to a trait you’re really interested in, and also preferably a trait that varies among contemporary human populations. A candidate gene for a particular trait would be great (i.e., a gene that’s been noted to perhaps be associated with a trait in a GWAS, but hasn’t really been tested across populations that vary in that trait). A trait that’s been noted to have been under selection previously would be great, but is by no means necessary.

Readings:

Slate J. 03 Nov 2015. Why I’m wary of candidate gene studies.

Patnala R, Clements J, Batra J. 2013. Candidate gene association studies: a comprehensive guide to useful in silico tools. BMC Genet 14:39.

Activities: We will go over a brief tutorial on how to think about finding a new locus of interest to study on your own, and we will have a class discussion on final project topic ideas.

Assignment: Students must choose a gene of interest for their final project by the following Friday, November 13. Learning Outcomes:

Learn how to use the resources provided to you in this class to find a new genetic locus for study.
Assess the merits/flaws of using a candidate gene approach to understand trait histories in human populations.

AN/BI 333/733: Modules

Christopher A Schmitt

October 1, 2020

Course Overview

Syllabus

Exam

Policies

Final Project