Course Outline

Course Overview





Insurmountable Coding Problems

Time to (Try to) Replicate Some Science!

Final Homework DUE at 5:00 pm of December 14, 2023.

Replication is meant to be one of the keystones of scientific inquiry. Every experiment, observation, or analysis should be replicable by any other scientist, if we’ve all done our jobs right. The objective of this assignment is to use your skills in R to replicate as closely as you can a set of statistical analyses and results reported in a published paper of your choosing.

You should select and confirm with me a paper and dataset you will replicate by October 05, 2023!

You do not need to replicate ALL of the analyses presented in the paper, but at minimum you must replicate at least 3 analyses, including at least one descriptive statistical analysis and one inferential statistical analysis. As part of this assignment, you must also replicate to the best of your abilities at least one figure.

For this assignment, you should prepare several files to share with me via GitHub in a new repo, shared with me as a collaborator, called “BUlogin-data-replication-assignment”:
  1. A PDF copy of the paper from which you are replicating analyses.
  2. A .csv file (you can make this in Excel) with the original data for that paper.
  3. An .Rmd file where you thoroughly describe and run the code for all of the steps in your replication. I should be able to take the .Rmd file along with the .csv file and knit it to produce a nicely formatted .html report describing what you did and in which I can easily see your results.

You should also embed in your .Rmd file, near your own results, any images or figures from the original paper that you replicate so that I can see them together. These should be included as .png files in a folder called “img” within your repo. You can include code like the following to reference files in your “img” folder for inclusion in your document.

<img src="img/imagename.filetype" width="###px"/>

Where you replace imagename.filetype with the name of your file, e.g., “figure-1.png” and ### with a integer number of pixels, e.g., width=“200px”.

A simpler option with fewer formatting possibilities is to include the picture using this markdown code:


If you include the following chunk at the beginning of your Rmd document with the headline {r setup, include=FALSE} (which lays out for R a particular set of instructions for knitting), it will ensure that all of the figures created by your chunks of code when it knits will be put directly into the “img” folder:

> knitr::opts_chunk$set(echo = TRUE, warning = FALSE, comment = "##", prompt = TRUE,
+     tidy = TRUE, tidy.opts = list(width.cutoff = 75), fig.path = "img/")

Elements of Your Report

You should start your replication report with a short description of the study and of the specific data and replication analyses you will be performing, to orient your reader. Outline (briefly) the goal of the original paper, the data set used, and the analyses conducted, then describe which you will replicate. You should also demonstrate how you read your datafile into R, and show a few lines of raw data in your output (e.g., using head()).

I will be looking for you to clearly take your reader through all of the elements of data manipulation, analysis, and, where appropriate, visualization. You should provide as much coding detail, explanation, and output tables as necessary to compare your results to those published.

Annotate your code well, and good luck!

Example Replication Assignments

For some excellent examples of past assignments, see here and here.

Where to find data (and their associated paper)

Students in the past have had a lot of trouble actually finding a relevant article that has associated open-access data. This makes replication much more difficult, and is increasing frowned upon, but is standard operating procedure in the sciences (especially when data are proprietary). However, there is much debate about and increasing calls for open-access data, both for replication and more general equity purposes. The National Science Foundation, for example, requires that there be a data management plan including the depositing of data generated by NSF-funded research in public repositories.

Rather than searching for a paper you like and finding that the authors have - for various reasons - not made their data public, you can instead search a public data repository relevant to your field for an appropriate dataset along with the associated published paper at this handy website.