Fall 2023, Seminar will be held on Thursdays, 3:30 - 6:15pm in
CAS 335
(that’s on the third floor of the CAS Building, 725
Commonwealth Ave)
Faculty Instructor:
Christopher A. Schmitt
Assistant Professor of Anthropology, Biology, Women’s Gender &
Sexuality Studies
Office: Stone Science Building (STO), 675
Commonwealth Ave, Rm 247E
Office Hours: Mondays 1:00 - 3:30pm
Web: http://www.evopropinquitous.net
Email: caschmit[at]bu[dot]edu
Twitter: http://www.twitter.com/fuzzyatelin
Course Description
Statistical methods are the backbone of scientific research, but are
often given short shrift when designing research in biological
anthropology. The purpose of this seminar is two-fold: 1) to familiarize
students with the use of relevant statistical programming packages
(primarily R), and 2) to discuss select advances in statistical
techniques from related disciplines that may help students while
designing and implementing their own research projects.
Potential foci of discussion may include statistical methods for
accounting for small sample sizes or non-normal data, using power
analyses and preliminary statistics to justify data collection design,
and the use of mixed models and information theoretic approaches to
analyze a number of different data types. Although there will be a
discussion element to the seminar, students should see this course as a
guided workshop or practicum in which we learn by working with both our
own and previously published datasets to better understand hypothesis
testing using statistical inference in biological anthropology.
This course is open to students outside of Anthropology willing to learn
the methods involved. Past students include undergraduate and graduate
students from Anthropology, Archaeology, Biology, and Economics.
Prerequisites
CAS AN 102 or CAS BI 107/108 (for undergraduates) or graduate student
standing, and/or consent of instructor. At least one semester of
introductory statistics is recommended, but not required. Prior
experience programming is helpful, but also not required.
Assessment
Performance in the class will be assessed on a
gradeless basis for the semester, with only a single
final grade being assigned in consultation with each student. Assessment
will entail the following assignments and considerations:
- Regular attendance and class participation (10%).
- On-time completion of assigned Modules
prior to seminar meetings (10%).
- Programming homework sets associated with each Module,
due Tuesday at 8:00 pm to your assigned peer
commentary partner(s) (20%).
- Respectful Peer
Commentary on homework coding, that puts into practice
teamwork and pair programming practices discussed in class and in
readings, to be submitted to me by 5:00pm Thursday
(10%).
- One individual Analysis
Replication Assignment based on a published paper with a
publicly available dataset, chosen in consultation with the instructor
(20%).
- One group presentation and written R vignette
demonstrating the use of a particular statistical method chosen in
consultation with the instructor (past examples available in the Modules).
Group participation will be a large part of evaluation, and must also
put into practice teamwork methods discussed in readings and class
(20%).
- A final Self
Evaluation written by you arguing for the grade
you have earned through your progress and the quality of your work in
the class (5%).
Required Texts
Kabacoff R. 2022. R in Action, 3rd Edition. New York:
Manning Publications.
Tillman D. 2016. The Book of R: A First Course in Programming
and Statistics. San Francisco: No Starch Press.
Tillman available in print or electronic format from No Starch Press and O’Reilly Media; Kabacoff
available in print or electronic format from Manning
Publications - I recommend using the third edition if you can (you
can also find the second edition as a PDF here);
both texts are also available at Amazon.com.
Optional Texts Students Find Helpful
Learning Objectives
By the end of this course, you should:
-
be familiar with key concepts and methods in applied data science for
acquiring and managing data, conducting exploratory data analyses,
testing statistical hypotheses, building models to classify and make
predictions about data, and evaluating model performance;
-
have a facility with modern tools for data analysis, (e.g., the
Unix command line, version control systems, the R programming
environment, web APIs) and be able to apply “best practices” in data
science;
-
know how to interact with both local and remote data sources to
store, query, process, and analyze data presented a variety of common
formats (e.g., delimited text files, structured text files, various
database systems);
-
be comfortable writing simple computer programs for data
management, statistical analysis, visualization, and more specialized
applications;
-
know how to design and implement reproducible data science
workflows that take a project from data acquisition to analysis to
presentation and be able to organize your work using a version control
system;
-
be able to accurately assess, critique, and reproduce existing
published works utilizing public and open source data repositories
and analytical techniques;
-
be able to work as part of an effective team to problem solve and
implement effective coding practices towards a group analytical goal;
-
and be able to apply all of these tools to questions of interest in
the natural and social sciences.
BU HUB Learning
Outcomes (for enrolled undergraduates)
This course has been accepted to the BU HUB
Undergraduate General Education Curriculum as CAS AN/BI
588. It has several proposed Learning Outcomes related to its
assigned Hub Capacities, including:
Scientific Inquiry II (SI2)
- Learning Outcome 1: Students in this course will
learn to identify and apply appropriate methods of statistical inference
to test hypotheses related to biological anthropology. This will de done
in the R programming framework using published or simulated datasets
from a number of scientific literatures, primarily from primate
morphology and behavior. Throughout the semester, students will use
these tests to appropriately frame and address established hypotheses in
biological anthropology using increasingly complex statistical methods,
including using t-tests to assess differences in body size between wild
and anthropogenically impacted vervet monkey groups, testing
hypothesized correlations between female body size and various life
history traits across primates using multiple regression, and assessing
shifts in male grooming attention towards periovulatory female
chimpanzees using generalized linear mixed modeling, among others. This
will culminate in a more advanced application of these methods by
engaging in the R-based replication of a published paper with open
access data in the student’s field of interest, and the development of
their own teaching module demonstrating a novel statistical method and
how it may be used to test a novel hypothesis in a new dataset. The
replication assignment will facilitate a hands-on, critical assessment
of how these authors used data processing, manipulation, analyses, and
figures to reach their published conclusions. The teaching module will
allow students to reflect critically on a novel method and how it might
be used (or may be misused) to inform a hypothetical framework of their
choosing in a novel dataset.
Quantitative Reasoning II (QR2)
- Learning Outcome 1: In this courses, students will
learn how to frame questions germane to biological anthropology through
the explicit testing of hypotheses using statistical inference. Using
the R statistical programming framework, students will explore the
underlying logic and mathematics of probability, hypothesis testing,
linear modeling and regression, ANOVA, multiple regression, generalized
linear modeling and mixed effects modeling, and then use these methods
to solve complex problems in biological anthropology with both real and
simulated data.
- Learning Outcome 2: Online, R-based statistical
modules guide students through the development and use of the above
methods in numerous datasets drawn from studies of wild primates and
museum specimens to test hypotheses central to biological anthropology
and evolutionary biology. These questions range from using the poisson
distribution to predict the frequency of titi monkey morning duets, to
using generalized linear modeling to assess whether male rank has an
impact on fitness outcomes in a population of woolly monkeys.
- Learning Outcome 3: With every statistical module
students will challenged to test established hypotheses using
statistical inference. They are then challenged in homework assignments
to investigate data structures and formulate and test their own
hypotheses using the statistical methods learned in class. This
culminates in a replication assignment from the primary literature, in
which students critically evaluate the methods and conclusion of a
published paper from the primary literature in their field.
- Learning Outcome 4: Through the use of GitHub-based
shared repositories and the R Markdown language, students will learn to
communicate via annotated coding chunks in data reports to explain their
logic and choice of statistical coding options to address bi-weekly
homework assignments and statistical module challenges. The Peer
Commentary of homework assignments requires that students engage with
each other to both improve their code and ensure that the symbolic,
visual, numerical, and written account of their data processing and
analysis are communicated effectively. The final assignments of the
class, a group-based homework module teaching a novel statistical test
not presented in the class and a replication of an analysis from the
primary literature, both encourage students to design legible and
engaging representations of the their data analysis process, and to
adequately communicate these methods both verbally and visually.
- Learning Outcome 5: Through an emphasis on testing
the assumptions underlying the statistical tests learned in class,
students will learn to recognize and articulate both the capacity and
limitations of these methods. This includes modules and discussions
testing data distributions and model residuals for normality, using
power tests to assess the effects of sample size and variance on
parameter estimation, assessing the appropriateness of link functions
for different data distributions in GLMMs, and more, complete with
demonstrations of what poorly analyzed data looks like, and how that may
negatively influence scientific inference.