This module introduces how survey data can be a powerful tool in anthropological research by using real-world responses from a survey done in this class. Surveys allow anthropologists to gather insights into cultural beliefs, behaviors, and perceptions across diverse populations. In this module, you’ll explore how structured survey responses can be analyzed to reveal patterns and meanings relevant to social science questions. To get started, you’ll need the following:
library(tidyverse)
library(curl)
library(ggplot2)
library(dplyr)
library(tm)
library(wordcloud2)
library(RColorBrewer)
library(lme4)
BUT FIRST:
Why Use Survey Data in Anthropology?
Survey data helps anthropologists:
Understand cultural trends and social behavior.
Collect both quantitative and qualitative data, providing different approaches that analyzing both broad trends and detailed insights within R.
Compare societies/populations across different worlds and time periods.
Quantify ethnographic findings with statistical evidence.
Example: Imagine a cultural anthropologist studying youth perceptions of education across rural and urban communities in Kenya. Through open-and closed-ended survey questions, they can:
Quantitatively measure how many respondents believe education leads to job security.
Qualitatively analyze why they hold those beliefs based on open-text responses.
Use R to visualize group differences (e.g., urban vs. rural) and run regression models to predict educational outlook based on socioeconomic variables.
We will use a
survey
dataset for this Module. We will need
to load in the dataset:
# Load the raw dataset
raw_data <- curl("https://github.com/ZeddyCraft/AN588-Group-Presentation/raw/refs/heads/main/AN588%20Survey%20(Responses)%20-%20Form%20Responses%201.csv")
d <- read.csv(raw_data, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
## Timestamp How.old.are.you. What.is.your.gender.
## 1 3/25/2025 16:44:00 21 Female
## 2 3/25/2025 16:55:22 20 Male
## 3 3/25/2025 17:14:43 19 Female
## 4 3/25/2025 17:15:59 22 Female
## 5 3/25/2025 17:20:32 21 Non-Binary
## 6 3/25/2025 17:20:33 20 Non-Binary
## How.tall.are.you..in.inches.please....if.you.don.t.know.guess.a.number...
## 1 6'9
## 2 69
## 3 64
## 4 66
## 5 60"
## 6 65
## How.much.do.you.weigh..in.lbs.please....if.you.don.t.know.guess.a.number...
## 1 180lbs
## 2 200
## 3 155
## 4 130
## 5 125lbs
## 6 185
## What.is.your.major...Answer.in.full..Ex..Computer.Science.
## 1 cs
## 2 PO/IR
## 3 Computer Science
## 4 Marine Sciences, Earth & Environmental Sciences
## 5 Film & TV, Advertising
## 6 Painting
## Do.you.have.are.planning.to.add.a.minor.
## 1 No
## 2 Yes
## 3 No
## 4 No
## 5 No
## 6 No
## If.the.answer.to.your.previous.question.was.yes..what.minor...Answer.in.full..Ex..Computer.Science.
## 1
## 2 DS
## 3
## 4
## 5
## 6
## On.a.scale.of.1.10..how.much.do.you.like.BU.
## 1 8
## 2 7
## 3 8
## 4 8
## 5 6
## 6 8
## On.a.scale.of.1.10..how.confident.are.you.in.your.academics.
## 1 7
## 2 7
## 3 3
## 4 9
## 5 5
## 6 3
## On.a.scale.of.1.10..how.confident.are.you.that.you.will.get.a.job.after.graduation.
## 1 8
## 2 4
## 3 2
## 4 4
## 5 2
## 6 2
## On.a.scale.of.1.10..how.often.are.you.anxious.about.the.future.
## 1 5
## 2 9
## 3 10
## 4 7
## 5 10
## 6 10
## How.many.chickens.can.fit.in.the.basement.of.CAS.
## 1 Infinite
## 2 At least 5
## 3 643,022.76
## 4 at least two
## 5 infinite
## 6 200,000
## How.many.zombies.do.you.think.you.could.kill.in.the.zombie.apocalypse.
## 1 10
## 2 Depends on the equipment but my delusional ass thinks at least 5
## 3 3
## 4 none i'd be patient zero
## 5 depends, at min like maybe 2?
## 6 i’d get to like 4 them just give up tbh
## If.you.could.have.an.infinite.number.of.goats..how.many.goats.would.you.want.
## 1 2
## 2 I do not want goats, so as many as I could realistically sell for profit without over-saturating the market or endangering their wellbeing
## 3 4
## 4 ...one?????
## 5 infinite
## 6 2
## What.is.your.favorite.color.
## 1 purple
## 2 Yellow
## 3 805999
## 4 #dabbed
## 5 teal or forest green
## 6 Purple
## Use.one.word.to.describe.how.do.you.feel.about.BU. Column.17
## 1 multicultural NA
## 2 Nuanced NA
## 3 Wonderous NA
## 4 lol NA
## 5 meh NA
## 6 Labyrinthian NA
Take a moment to explore the raw survey data above. Before we begin analysis, consider the following questions:
What is wrong with this dataset here?
What will likely happen if we run an analysis with right now?
What challenges or limitations would this impose on our research?
Before conducting any kind of analysis, it’s essential to clean your data. Dirty or inconsistent data can lead to misleading results or errors in your analysis. Common cleaning steps include:
You can clean your data using R (e.g., mutate()
,
rename()
, filter()
functions in
dplyr
), or manually in a spreadsheet if the dataset is
small. The cleaned version used in this module is already processed for
demonstration.
Now load this cleaned dataset:
# Load the dataset
cleaned_data <- curl("https://github.com/ZeddyCraft/AN588-Group-Presentation/raw/refs/heads/main/Cleaned_Data_4:23.csv")
d1 <- read.csv(cleaned_data, header = TRUE, sep = ",", stringsAsFactors = FALSE)
options(scipen = 999) # Turn off scientific notation for printing
head(d1)
## Age Gender Height.In. Weight.lb. Major
## 1 21 Female 81 180 Computer Science
## 2 20 Male 69 200 Political Science
## 3 19 Female 64 155 Computer Science
## 4 22 Female 66 130 Marine Science
## 5 21 Non-Binary 60 125 Film & TV
## 6 20 Non-Binary 65 185 Painting
## Second_Major Minor Like_BU Academic_Confidence
## 1 8 7
## 2 International Relations Data Science 7 7
## 3 8 3
## 4 Earth & Environmental Science 8 9
## 5 Advertising 6 5
## 6 8 3
## Job_Confidence Future_Anxious Chickens_Basement Zombies_Could_Kill
## 1 8 5 1E+43 10
## 2 4 9 5 5
## 3 2 10 643,022.76 3
## 4 4 7 2 0
## 5 2 10 1E+43 2
## 6 2 10 200,000 4
## Goat_Number Favorite_Color BU_Description
## 1 2 Purple Multicultural
## 2 500 Yellow Nuanced
## 3 4 Purple Wonderous
## 4 1 Purple lol
## 5 1000000000000 Teal Meh
## 6 2 Purple Labyrinthian
What differences do you see with this dataset? How does it influence our research going forward?
Learn to use survey data to test theoretical relationships using linear models and visualization.*
In anthropology, forming and testing hypotheses helps us connect individual experiences to larger social patterns. For example, we might ask: “Does a student’s academic confidence influence how secure they feel about their job prospects?”
lm()
) in R.Academic confidence will have a stronger predictive relationship with job security than academic anxiety.
This is an anthropological question because it links subjective experience (confidence, anxiety) to structural expectations (job market, future security), all within a cultural context (e.g., university life in a late-capitalist society).
Let’s test this using regression models.
survey_data <- d1
Confidence <- d1$Academic_Confidence
Anxiety <- d1$Future_Anxious
JobSecurity <- d1$Job_Confidence
# Two regression models
model_conf <- lm(JobSecurity ~ Confidence, data = survey_data)
model_anx <- lm(JobSecurity ~ Anxiety, data = survey_data)
summary(model_conf)
##
## Call:
## lm(formula = JobSecurity ~ Confidence, data = survey_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2935 -2.1825 -0.0828 2.5619 5.6616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3833 1.2694 1.877 0.0678 .
## Confidence 0.4888 0.1826 2.677 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.685 on 40 degrees of freedom
## Multiple R-squared: 0.1519, Adjusted R-squared: 0.1307
## F-statistic: 7.165 on 1 and 40 DF, p-value: 0.01072
summary(model_anx)
##
## Call:
## lm(formula = JobSecurity ~ Anxiety, data = survey_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.066 -1.802 -0.037 1.922 4.992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.1252 1.2494 8.104 0.000000000572 ***
## Anxiety -0.6118 0.1605 -3.811 0.000467 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.498 on 40 degrees of freedom
## Multiple R-squared: 0.2664, Adjusted R-squared: 0.2481
## F-statistic: 14.53 on 1 and 40 DF, p-value: 0.0004671
Let’s visualize it
# Compare R-squared values
conf_r2 <- summary(model_conf)$r.squared
anx_r2 <- summary(model_anx)$r.squared
bar_data <- tibble(
Predictor = c("Academic Confidence", "Academic Anxiety"),
R_squared = c(conf_r2, anx_r2)
)
# Barplot
ggplot(bar_data, aes(x = Predictor, y = R_squared, fill = Predictor)) +
geom_col(show.legend = FALSE) +
ylim(0, 1) +
labs(
title = "Prediction of Job Security by Confidence vs. Anxiety",
y = "R-squared Value",
x = "Predictor"
) +
theme_minimal()
Which predictor has a stronger relationship with perceived job security? What might this tell us about how confidence and anxiety influence perceptions of the future?
📘 Learning Note: Linear regression is just one way to model relationships between survey responses. Depending on your research question and data structure, other methods may be more appropriate. For instance:
To show how flexible survey data is, we will try to model data that are more silly (Chickens, Zombies, and Goats). Starting with Chicken, we can try comparing different majors’ perspectives on how many chickens can fit in the basement of CAS.
# Remove commas or weird characters if necessary
d1$Chickens_Basement <- gsub(",", "", d1$Chickens_Basement)
# Convert to numeric
d1$Chickens_Basement <- as.numeric(d1$Chickens_Basement)
# Use log-transformation to make the visuals more unified
d1$log_chickens <- log10(d1$Chickens_Basement + 1)
Obviously, if we try to perspectives by majors, we will end up having a graph with respective majors to their response, which would make it look not visually pleasing.
# Create a new column for the BU schools
d1 <- d1 %>%
mutate(school = case_when(
Major %in% c("Psychology", "Anthropology", "Biology", "History", "Philosophy", "Political Science", "Marine Science", "Computer Science", "Sociology", "Archaeology", "Physics", "Neuroscience", "Astronomy", "Sociocultural Anthropology", "Biological Anthropology", "Mathematics") ~ "CAS",
Major %in% c("Electrical Engineering", "Mechanical Engineering", "Robotics & Autonomous Systems", "Mechanical engineering", "Mechanical Engineering", "Computer Engineering") ~ "ENG",
Major %in% c("Journalism", "Film & TV", "Advertising", "Public Relations") ~ "COM",
Major %in% c("Business Administration", "Accounting", "Finance", "Business Administration & Management", "Business") ~ "Questrom",
Major %in% c("Music", "Theater", "Performance", "Music Education") ~ "CFA",
TRUE ~ "Other"
))
# Model Chickens by School
mean_chickens <- d1 %>%
group_by(school) %>%
summarise(mean_log_chickens = mean(log_chickens, na.rm = TRUE))
ggplot(mean_chickens, aes(x = school, y = mean_log_chickens, fill = school)) +
geom_col() +
labs(
title = "Mean Chicken Estimate by School",
x = "School",
y = "Proportion of Chickens"
) +
theme_minimal()
This plot shows the average (log-transformed) number of chickens each
school’s students estimated. While the estimates are whimsical, the plot
still reflects possible group-level differences in estimation
behavior—maybe engineering students make more practical guesses, while
humanities students are more imaginative?
Now let’s try it with Zombies. With Zombies, answers can vary based on more imaginary or realistic perspective. One way of fit the numbers in is categorizing them into groups, like with the chicken question.
# Zombie Visuals
zombie_level <- cut(d1$Zombies_Could_Kill,
breaks = c(-Inf, 5, 15, 20, Inf),
labels = c("Low", "Moderate", "High", "Ultra"))
# Then model as a factor outcome
glm(zombie_level ~ Age + Like_BU, data = d1, family = "binomial")
##
## Call: glm(formula = zombie_level ~ Age + Like_BU, family = "binomial",
## data = d1)
##
## Coefficients:
## (Intercept) Age Like_BU
## 1.26033 0.02835 -0.22156
##
## Degrees of Freedom: 41 Total (i.e. Null); 39 Residual
## Null Deviance: 57.84
## Residual Deviance: 57.21 AIC: 63.21
ggplot(d1, aes(x = zombie_level)) +
geom_bar(fill = "purple") +
labs(title = "Zombie Killer Tiers", x = "Zombie Tier", y = "Count")
Now let’s try doing goats. Since answers for this question might be more imaginary, we’ll first have to convert them to fit into the visualization.
# Convert goats to numeric, in case of any issue in the data
d1$Goat_Number <- as.numeric(d1$Goat_Number)
## Warning: NAs introduced by coercion
# Log-transform for visualization
d1$log_goats <- log10(d1$Goat_Number + 1)
# Goats Visual
ggplot(d1, aes(x = log10(Goat_Number + 1))) +
geom_density(fill = "orange", alpha = 0.6) +
geom_rug(color = "darkred", alpha = 0.5) +
labs(
title = "How Many Goats They Want",
x = "Goats Proportion",
y = "Density"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12)
)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).
This density plot does reveal more skewed nature of the data. However,
most people appear to only want a few goats, with a few participants
desire millions. A good note is that log-transforming helped to bring
that skew into view and makes it easier to compare distributions.
library(ggridges) # Will need this package for the visualization
ggplot(d1, aes(x = log_goats, y = Gender, fill = Gender)) +
geom_density_ridges(alpha = 0.6, scale = 1.2) +
labs(
title = "Distribution of Goat Desire by Gender (Log-Transformed)",
x = "Self-Reported Goat Preference",
y = "Gender"
) +
theme_minimal()
## Picking joint bandwidth of 0.549
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density_ridges()`).
From the ridgeline plot, while all groups also tend to peak around the
lower end of the scale, some of the outlier behavior appear to be coming
from the male and non-binary groups, suggesting more playful
exaggeration in certain subgroups.
Another realm of exploration can be doing them by schools. Go ahead and try it.
# You can use 'school' to group the majors by schools, like in the Chicken Question.
ggplot(d1, aes(x = log_goats, y = school, fill = school)) +
geom_density_ridges(alpha = 0.5, scale = 1.2) +
labs(
title = "Distribution of Goat Desire by School (Log-Transformed)",
x = "Self-Reported Goat Preference",
y = "School"
) +
theme_minimal()
## Picking joint bandwidth of 0.735
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density_ridges()`).
Choosing the right model and visuals depends on the nature of the variables in your survey and the assumptions behind the statistical method. Even silly survey data can be modeled with basic techniques like linear regression, logistic regression, and transformation to fit into nice visualizations.
One of the most useful parts of surveys is qualitative statistics or data represented by words or phrases, not numbers.
Take for example, the “Favorite_Color” column, based on the prompt “What is your favorite color?”.
head(d1$Favorite_Color)
## [1] "Purple" "Yellow" "Purple" "Purple" "Teal" "Purple"
How can we organize this data? How can we analyze it? Traditional methods of data visualization don’t work and neither do traditional methods of analysis. One way of data visualization is through word clouds. Word clouds correlate the frequency of a descriptor with the size of the word in the image, giving us a visual representation of these descriptions. The package wordcloud2 gives us a function “wordcloud2” which allows us to create these in r.
library(wordcloud2)
#first let's find the frequency (or counts) of each of the color responses in r
t<-table(d1$Favorite_Color)
wordcloud2(t, size=.8, color=(c("Black","Blue","Brown","Cyan","Green","Pink","Purple","Red","Teal","Yellow")))
That’s visualization! Word cloud can be an amazing tool to use for your data visuals. Another common form of visualization is through graphs! The two most common types of graphs used are pie graphs and bar graphs. Let’s make a few graphs using the favorite color of the respondents!
ggplot(d1, aes(x=Favorite_Color, fill=Favorite_Color))+
geom_bar(show.legend = F)+
scale_fill_manual(values=c("black","blue","brown","cyan","green","pink","purple","red","turquoise","yellow"))
pie(x=t,col=c("black","blue","brown","cyan","green","pink","purple","red","turquoise","yellow"))
Notice how ggplot doesn’t have a built-in function for pie charts. You actually can make pie charts using ggplot! Here’s a good guide on how to do so!
Now it’s your turn. Let’s try making a word cloud and a bar graph with the Majors column.
t2<-table(d1$Major)
wordcloud2(t2, size=.4, color="random-dark")
ggplot(d1, aes(x=Major,fill=Major))+
geom_bar()+
theme(axis.text.x=element_blank(), legend.key.size=unit(.1, "cm"))
What is the most frequent major that appeared in this dataset? Any interesting major in there?
Text data adds depth — a word cloud offers a quick thematic scan, helping anthropologists connect numbers to narratives.
Survey data is an incredibly flexible tool for anthropologists. In this module, we explored descriptive stats, modeling, and qualitative visualization — all from one simple dataset. The skills here are the foundation for both research and applied work.