Homework: Introduction to the International Genome Sample Resource (Formerly known as the 1000 Genomes Project)



Readings:



At this point, technical difficulties aside, we’ve got a bit more knowledge regarding how to navigate the SCC and how to gather data of various kinds from Ensembl, so let’s put that knowledge to use with some practice!

This homework assignment is meant to both stretch your abilities from the past two labs, and prepare you for what’s coming in the next lab. If you can’t remember how to do something, check your Pre-Lab slides and the Lab 1 module.

To make things easier, I’ve also created an online interface where you can answer the questions.

Question 1 (25 points):

Go to the Ensembl web page for the gene UCP1, and look at the variant table.

  • How many variants code for an inframe deletion within the coding region of UCP1?
  • Sort the variants by base pair location from first to last (e.g., lowest bp number to highest on chromosome 4), and list the first five variants (if there are less than five, list them all).
  • What is the average number of base pairs deleted in your list of variants?
  • Are any of these deletion variants (e.g., the minor, deleted genotype) present in your 1000 Genomes study population?
  • How many individuals in your 1000 Genomes population have the deleted, or minor allele at each variant site?

Question 2 (50 points):

Go back to the variant table, and filter and sort it to find the Stop Gained variant with the highest minor allele frequency (MAF).

  • How many SNPs that cause a stop gain are present in the 1000 Genomes populations?
  • Name them, in order of MAF from highest to lowest.
  • What is the MAF for the SNP with the highest MAF? And what is the minor allele for that SNP?
  • What does a stop gain mutation do, and why might it be selected for or against in population?
  • Does YOUR 1000 Genomes population have the minor allele?
  • Given the readings for Lab 1 regarding the function of UCP1, what might having the minor allele mean for the phenotype of individuals who have it. In your study population, what would this mean for an individual with the minor allele, given their local environment?

Question 3 (25 points):

We should probably get a little more practice with the tabix and vcftools coding for downloading data into our SCC space.

Please download the UCP1 data (for the whole gene) for a second sub-population you find interesting in the 1000 Genomes dataset. Of all the files generated, you should keep ONLY the final VCF file.

The file MUST be in your SCC working directory by the time we meet in class on Friday, October 5th!