Midterm 2 (150 points)

This midterm is open notes, open textbook, open Lab tutorial and will cover genetics review concepts (insofar as they are relevant to our newer work), linkage disequilibrium, genetic drift, heterozygosity, and detecting and measuring selection in the genome.

You may begin the exam when the submission link (below) becomes active at class time on Friday, November 13, and work on the exam until the link closes at midnight of the exam due date, Friday, November 20.

No late responses will be accepted!


Parts of this exam will require downloading, processing, and analyzing data from Ensembl in ways that should, by now, be familiar to you; these may involve interfacing with R/R Studio, tabix, vcftools, and other modules in the SCC. If you’ve forgotten how to use these modules, you should use your online module tutorials as guides (this is an open notes exam, after all).

All SCC-based files associated with exam analyses should be uploaded to your named a folder in the general anth333 project space. Doing this will be part of the exam, and involves commands you’ve already learned and used in the context of class.

NOTE: I will not help you directly with exam questions in office hours. However, I will help you with exercises already present on Lab tutorials or previous homework assignments. Please do not ask for help on exam questions. Also, although you are encouraged to do your homework in cooperation with other students, you should be doing your take-home exam alone.


###Part 1: Preparing your Workspace for Midterm 2 (5 points)

    1. Log in to your SCC working directory in the anth333 project space. (1 point)

      1. In the directory anth333, you should already have a shared project space named after your BU login. In that directory, create a new directory named ‘Midterm2’. (2 points)

      1. Now, navigate to your newly named directory so that it’s your current directory (i.e., where all of the files you process in R will be deposited and saved). Conduct all work for the midterm from this directory. At the end of the exam, all your newly created or saved files associated with Midterm 2 must be in this folder. (2 points)

      HINT: Given you answered these first questions correctly, your SCC prompt will look like this for all midterm analyses:

      [username@scc1 Midterm2]$

      And your SCC On Demand file pane and console pathways should look like this:

      ~/project/anth333/BUlogin/Midterm2/
      Now, all of the analyses and processing you do will be done within this single named directory, allowing me to grade your individual progress.

    ###Part 2: Linkage in a New Population (40 points)

    So far in our labs, we’ve been focusing on the human TMPRSS2 gene in our own assigned 1000 Genomes Project populations. To better demonstrate what we’ve learned, let’s take a look at this gene in a novel population.

    Now, I’m already convinced that we all know how to properly download a dataset from Ensembl, so for this exam, please copy the data for the Bengali in Bangladesh population from the SampleVCF folder into your Midterm2 folder. (5 points)

    HINT: We last did something like this in Module 4

    The next few questions will have to do with the concept of linkage.

      1. Which process in the genome is responsible for driving linkage between any two loci towards equilibrium, and when does it happen? (5 points)
        1. Mutation, during fertilization
        1. Drift, during mitosis
        1. Recombination, during meiosis
        1. Selection, during evolution


      1. Imagine you conduct a dihybrid cross between pea plants, which can have either purple or red flowers (with the red flowers being recessive). Which of the following results in your F2 generation of this dihybrid cross indicate linkage (for the given sample size)? (2 points)
        1. 281 : 94 : 94 : 31 (n = 500)
        1. 197 : 66 : 65 : 22 (n = 350)
        1. 215 : 71 : 71 : 24 (n = 381)
        1. 327 : 27 : 27 : 55 (n = 436)


      1. Although it is generally recognized that mitochondrial DNA does not recombine, what evidence seems to suggests that might not be the case? (2 points)
        1. Recombination in a large sample of human mtDNA seems to vary as a function of the number of bases between two loci.
        1. There appears to be recombination between mtDNA haplotypes in humans with heteroplasmy.
        1. Heteroplasmic mtDNA lineages can be passed on from parent to offspring.
        1. All of the above.


      1. Which of the following loci on the Y chromosome are linked? (2 points)
        1. CDY2A and DAZ3
        1. USP9Y and BPY2C
        1. TSPY4 and TSPY8
        1. All of the above; all loci on the Y chromosome (outside the PAR) are linked because it doesn’t recombine.


      1. What is the measure of genetic distance on a linkage map? (2 points)
        1. Megabases (Mb)
        1. CentiMorgans (cM)
        1. Nei’s Distance (N)
        1. None of the above.


      1. When comparing recombinant to non-recombinant offspring in a population of interest, you calculate a LOD score of 5.2 for two particular loci. In your population, are these two loci in linkage disequilibrim? (2 points)
        1. Yes.
        1. No.
        1. It’s impossible to say from this information.


      1. Imagine you measure the gametic genotypes for two monogenic traits (seen in the parental haplotype as AB/ab) in proportions 0.48, 0.48, 0.02, 0.02 (AB, ab, Ab, aB). Are these two loci experiencing linkage disequilibrium? (2 points)
        1. Yes.
        1. No.
        1. It’s impossible to say from this information.


      1. What is D, the measure of LD, for the problem above? (2 points)

    For the next few questions, we’ll interrogate the dataset we copied for the Bengali from Bangladesh population from the 1000 Genomes Project dataset. Using the methods in Module 3, create an LD heatmap for the Bengali from Bangladesh using the D’ test statistic.

      1. Upload your D’ heatmap for the Bengali from Bangladesh population (10 points)

      1. Which of the SNPs of interest from the LD module (rs2070788, rs383510, rs4816720, rs12329760) are in linkage disequilibrium (LD > 0.8) in the Bengali from Bangladesh population? (6 points)

    ###Part 3: Genetic Drift (10 points)

      1. Imagine you’re studying a small population of critically endangered yellow-tailed woolly monkeys (Lagothrix flavicauda). You manage to genotype the 8 remaining individuals in an isolated population in the Andes at SNP rs16534 (C/A), and find it’s in Hardy-Weinberg equilibrium with the frequency of the C allele at 0.35. A few years later, you return and genotype the next generation of 8 individuals. Given this information, what is the probability that this next generation of yellow-tailed woolly monkeys will be fixed for the C allele at rs16534? (4 points)

      1. Which of the following is a potential consequence of this reduced population size in yellow-tailed woolly monkeys? (2 points)
        1. Reduced total number of alleles.
        1. Changes in allele frequency.
        1. Increased linkage disequilibrium.
        1. None of the above.


      1. If the remnant yellow-tailed woolly monkey population you studied had only 2 males and 6 females, what is the effective population size (Ne)? (2 points)

      1. In the second generation of the remnant yellow-tailed woolly monkey population you studied, you genotype them to better understand how many offspring that generation were from each of the six females. You find that there’s a rather large variance in the number. Would that increase or decrease the effective population size? (2 points)

    ###Part 4: Mutations, Heterozygosity, and Selection (80 points)

      1. In your yellow-tailed woolly monkey population, given that the SNP rs16534 is fixed for the C allele, and a mutation rate of 0.00025, what would the frequency of the C allele be after 10,000 generations? (2 points)

      1. In which mutation model, in which each mutation considered to be novel, does one simply count the number of difference and assume that represents the total number of mutations that have occured? (2 points)
        1. Infinite alleles model
        1. Stepwise mutation model
        1. Nei’s genetic distance
        1. None of the above.


      1. Which mutation model is the preferred model for microsatellite loci? (2 points)
        1. Infinite alleles model
        1. Stepwise mutation model
        1. Nei’s genetic distance
        1. None of the above.


      1. In Module 4, we used the Neighbor-Joining method to construct a tree based on an infinite alleles mutation model. Does the neighbor-joining method always give you the evolutionarily ‘correct’ tree? Why or why not? (5 points)

      1. Please take a screenshot or save a PNG or PDF file of the neighbor-joining tree for the Bengali in Bangladesh population (as done in Lab 4) and upload it here. Your file size must not exceed 10 MB. (10 points)

      1. If we align several individual sequences, and see a site that has multiple bases at that locus, what do we call it? (2 points)
        1. Segregating site
        1. Non-conserved site
        1. Polymorphic site
        1. All of the above


    Imagine that we all, as a class, submitted samples to the Sensory Morphology and Anthropological Genomics Lab for sequencing of our individual TMPRSS2 genes. For a 20-bp sequence, we get the following information regarding segregating sites in that region:


      1. What is the observed heterozygosityπ) of this sample? (2 points)

      1. The estimated heterozygosityk) of this sample is 0.86. What is Tajima’s D for this region of the genome in our class population? (2 points)
        1. Positive
        1. Zero
        1. Negative


      1. Given the value of Tajima’s D for our class population, what kind of selection might be occurring in our classroom!? (2 points)
        1. No selection
        1. Diversifying selection
        1. Purifying selection
        1. Balancing selection


      1. We also calculated Tajima’s D, in a different and much faster way, in Module 5. Calculate Tajima’s D for the Bengali in Bangladesh population. (5 points)

      1. Is this value for Tajima’s D statistically significant? (2 points)
        1. Yes
        1. No


      1. Given the value of Tajima’s D for the Bengali in Bangladesh, what kind of selection might be occurring around TMPRSS2 in this population? (2 points)
        1. No selection
        1. Diversifying selection
        1. Purifying selection


    When working in South Africa, you collect genomic data on three populations of vervet monkey (Pop1, Pop2, and Pop3). Each individual is genotyped at a particular SNP, the genotypes for which are summarized in the following table:


    Answer the following questions based on these genotype results.

      1. What is the observed heterozygosity (Hobs) and expected heterozygosity (Hexp) of each population? (6 points)

      1. Which population has a positive F-value? (2 points)

      1. What does a positive F-value impy for that population? (2 points)

      1. According to the FST for these populations, what is the proportion of variation at this locus that is due to differences between subpopulations? (2 points)

      1. What kind of sequencing is best for identifying and thus reducing errors in variant/polymorphism discovery with whole-genome sequencing? (2 points)
        1. Sanger sequencing
        1. Bisulfite sequencing
        1. Duplex sequencing
        1. Single-strand sequencing


      1. The EDAR gene is a gene implicated in numerous phenotypes including hair thickness, sweat gland output, tooth morphology, and mammary duct proliferation. What is this phenomenon called, where one gene is implicated in multiple seemingly unrelated phenotypes? (2 points)
        1. Epistasis
        1. Pleiotropy
        1. Epigenetics
        1. Polygenics


      1. The McDonald-Kreitman Test is a test designed to detect taxon-informative markers; we use it to contrast within and between taxon synonymous and non-synonymous differences. In the case of Drosophila variation at the alcohol dehydrogenase gene (see the figure on slide 23 of Lecture 12), which of the loci presented are the best species-informative loci to differentiate between them (e.g., loci that really differentiate species from each other) aside from 781, which we discussed in class? (6 points)

      1. When using dN/dS, a significantly positive value implies what kind of selection? (2 points)
        1. Positive
        1. Negative
        1. Neutral
        1. Purifying


      1. Which of the following physiological systems appear to have been under significant positive selection since our last common ancestor with chimpanzees, according to dN/dS? (2 points)
        1. Immunity and defense
        1. Digestion
        1. Olfaction
        1. Brain growth


      1. Are hard selective sweeps common in human evolutionary history? (2 points)
        1. Yes
        1. No


      1. Match the convergent mutations for a selected trait in human evolutionary history with the trait they were selected for… (8 points)

      Traits: lactase persistence, lighter skin pigmentation, hypoxia

        G/C-14010 upstream of LCT in Tanzanians
        Extended haplotype homozygosity around EGLN1 in Andean populations
        Missense mutation L-374-F (rs16891982) in the MATP gene in Europeans
        C/G-13907 upstream of LCT in Sudanese
        Missense variant H-615-R (rs1800414) in OCA2 in East Asians
        Extended haplotype homozygosity around EGLN1 in Tibetans


      1. Explain, to the best of your ability, why the positive selective sweep around the PPARA gene, implicated in high altitude adaptations in Tibetans, is probably an adaptation related to cold tolerance rather than hypoxia. (6 points)

    ###Part 5: Quantitative Genetics (25 points)

      1. Define heritability. (5 points)

      1. What is the difference between broad sense and narrow sense heritability? (5 points)

      1. The basic equation for modeling a phenotype lis much like a regression line, and can be described using the equation p = μ + Σ βi xi + a + e. If we were to use this equation to estimate the variance components of body mass based on a QTL in UCP1 and an experimental dietary covariate, what would each component of that equation represent? Use one term each to match the parameter to what it represents: (10 points)

            Body Mass | Diet Covariate | Change in Mass due to Alleles at QTL | Baseline Body Mass | Change in Mass due to Environmental Effects | Change in Mass due to Diet


        p:
        μ:
        βi:
        xi:
        a:
        e:


      1. If Trait A has a heritability of 1, and Trait B has a heritability of 0.4, which will respond more readily to selection? Using the concept of realized heritability, explain why? (5 points)
      ##Congratulations! You’ve finished your last midterm!