1 Using Survey Data in Anthropology: Properly Prepping, Analyzing, and Modeling

1.1 Preliminaries

This module introduces how survey data can be a powerful tool in anthropological research by using real-world responses from a survey done in this class. Surveys allow anthropologists to gather insights into cultural beliefs, behaviors, and perceptions across diverse populations. In this module, you’ll explore how structured survey responses can be analyzed to reveal patterns and meanings relevant to social science questions. To get started, you’ll need the following:

library(tidyverse)
library(curl)
library(ggplot2)
library(dplyr)
library(tm)
library(wordcloud2)
library(RColorBrewer)
library(lme4)

1.2 Objectives

In this module, we will learn:
1. How to clean and manipulate survey data.
2. How to calculate descriptive statistics using survey data
3. How to determine what type of analyses we can do with survey data.

BUT FIRST:

Why Use Survey Data in Anthropology?

Survey data helps anthropologists:

Understand cultural trends and social behavior.
Collect both quantitative and qualitative data, providing different approaches that analyzing both broad trends and detailed insights within R.
Compare societies/populations across different worlds and time periods.
Quantify ethnographic findings with statistical evidence.

Example: Imagine a cultural anthropologist studying youth perceptions of education across rural and urban communities in Kenya. Through open-and closed-ended survey questions, they can:

Quantitatively measure how many respondents believe education leads to job security.
Qualitatively analyze why they hold those beliefs based on open-text responses.
Use R to visualize group differences (e.g., urban vs. rural) and run regression models to predict educational outlook based on socioeconomic variables.

2 1. Loading and Exploring the Data

We will use a survey dataset for this Module. We will need to load in the dataset:

# Load the raw dataset
raw_data <- curl("https://github.com/ZeddyCraft/AN588-Group-Presentation/raw/refs/heads/main/AN588%20Survey%20(Responses)%20-%20Form%20Responses%201.csv")
d <- read.csv(raw_data, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)

##            Timestamp How.old.are.you. What.is.your.gender.
## 1 3/25/2025 16:44:00               21               Female
## 2 3/25/2025 16:55:22               20                 Male
## 3 3/25/2025 17:14:43               19               Female
## 4 3/25/2025 17:15:59               22               Female
## 5 3/25/2025 17:20:32               21           Non-Binary
## 6 3/25/2025 17:20:33               20           Non-Binary
##   How.tall.are.you..in.inches.please....if.you.don.t.know.guess.a.number...
## 1                                                                       6'9
## 2                                                                        69
## 3                                                                        64
## 4                                                                        66
## 5                                                                       60"
## 6                                                                        65
##   How.much.do.you.weigh..in.lbs.please....if.you.don.t.know.guess.a.number...
## 1                                                                      180lbs
## 2                                                                         200
## 3                                                                         155
## 4                                                                         130
## 5                                                                      125lbs
## 6                                                                         185
##   What.is.your.major...Answer.in.full..Ex..Computer.Science.
## 1                                                         cs
## 2                                                      PO/IR
## 3                                           Computer Science
## 4            Marine Sciences, Earth & Environmental Sciences
## 5                                     Film & TV, Advertising
## 6                                                   Painting
##   Do.you.have.are.planning.to.add.a.minor.
## 1                                       No
## 2                                      Yes
## 3                                       No
## 4                                       No
## 5                                       No
## 6                                       No
##   If.the.answer.to.your.previous.question.was.yes..what.minor...Answer.in.full..Ex..Computer.Science.
## 1                                                                                                    
## 2                                                                                                  DS
## 3                                                                                                    
## 4                                                                                                    
## 5                                                                                                    
## 6                                                                                                    
##   On.a.scale.of.1.10..how.much.do.you.like.BU.
## 1                                            8
## 2                                            7
## 3                                            8
## 4                                            8
## 5                                            6
## 6                                            8
##   On.a.scale.of.1.10..how.confident.are.you.in.your.academics.
## 1                                                            7
## 2                                                            7
## 3                                                            3
## 4                                                            9
## 5                                                            5
## 6                                                            3
##   On.a.scale.of.1.10..how.confident.are.you.that.you.will.get.a.job.after.graduation.
## 1                                                                                   8
## 2                                                                                   4
## 3                                                                                   2
## 4                                                                                   4
## 5                                                                                   2
## 6                                                                                   2
##   On.a.scale.of.1.10..how.often.are.you.anxious.about.the.future.
## 1                                                               5
## 2                                                               9
## 3                                                              10
## 4                                                               7
## 5                                                              10
## 6                                                              10
##   How.many.chickens.can.fit.in.the.basement.of.CAS.
## 1                                          Infinite
## 2                                        At least 5
## 3                                        643,022.76
## 4                                      at least two
## 5                                          infinite
## 6                                           200,000
##   How.many.zombies.do.you.think.you.could.kill.in.the.zombie.apocalypse.
## 1                                                                     10
## 2       Depends on the equipment but my delusional ass thinks at least 5
## 3                                                                      3
## 4                                               none i'd be patient zero
## 5                                          depends, at min like maybe 2?
## 6                                i’d get to like 4 them just give up tbh
##                                                                If.you.could.have.an.infinite.number.of.goats..how.many.goats.would.you.want.
## 1                                                                                                                                          2
## 2 I do not want goats, so as many as I could realistically sell for profit without over-saturating the market or endangering their wellbeing
## 3                                                                                                                                          4
## 4                                                                                                                                ...one?????
## 5                                                                                                                                   infinite
## 6                                                                                                                                          2
##   What.is.your.favorite.color.
## 1                       purple
## 2                       Yellow
## 3                       805999
## 4                      #dabbed
## 5         teal or forest green
## 6                       Purple
##   Use.one.word.to.describe.how.do.you.feel.about.BU. Column.17
## 1                                      multicultural        NA
## 2                                            Nuanced        NA
## 3                                          Wonderous        NA
## 4                                                lol        NA
## 5                                                meh        NA
## 6                                       Labyrinthian        NA

2.1 Challenge 1:

Take a moment to explore the raw survey data above. Before we begin analysis, consider the following questions:

What is wrong with this dataset here?
What will likely happen if we run an analysis with right now?
What challenges or limitations would this impose on our research?

2.2 Why Cleaning Your Data Matters

Before conducting any kind of analysis, it’s essential to clean your data. Dirty or inconsistent data can lead to misleading results or errors in your analysis. Common cleaning steps include:

Renaming or relabeling columns for clarity
Converting text responses to numeric or categorical variables (e.g. as.numeric)
Handling missing data appropriately
Standardizing scales

You can clean your data using R (e.g., mutate(), rename(), filter() functions in dplyr), or manually in a spreadsheet if the dataset is small. The cleaned version used in this module is already processed for demonstration.

Now load this cleaned dataset:

# Load the dataset
cleaned_data <- curl("https://github.com/ZeddyCraft/AN588-Group-Presentation/raw/refs/heads/main/Cleaned_Data_4:23.csv")
d1 <- read.csv(cleaned_data, header = TRUE, sep = ",", stringsAsFactors = FALSE)
options(scipen = 999) # Turn off scientific notation for printing
head(d1)

##   Age     Gender Height.In. Weight.lb.             Major
## 1  21     Female         81        180  Computer Science
## 2  20       Male         69        200 Political Science
## 3  19     Female         64        155  Computer Science
## 4  22     Female         66        130    Marine Science
## 5  21 Non-Binary         60        125         Film & TV
## 6  20 Non-Binary         65        185          Painting
##                    Second_Major        Minor Like_BU Academic_Confidence
## 1                                                  8                   7
## 2       International Relations Data Science       7                   7
## 3                                                  8                   3
## 4 Earth & Environmental Science                    8                   9
## 5                   Advertising                    6                   5
## 6                                                  8                   3
##   Job_Confidence Future_Anxious Chickens_Basement Zombies_Could_Kill
## 1              8              5             1E+43                 10
## 2              4              9                 5                  5
## 3              2             10        643,022.76                  3
## 4              4              7                 2                  0
## 5              2             10             1E+43                  2
## 6              2             10           200,000                  4
##     Goat_Number Favorite_Color BU_Description
## 1             2         Purple  Multicultural
## 2           500         Yellow        Nuanced
## 3             4         Purple      Wonderous
## 4             1         Purple            lol
## 5 1000000000000           Teal            Meh
## 6             2         Purple   Labyrinthian

2.3 Discussion Prompt:

What differences do you see with this dataset? How does it influence our research going forward?

2.3.1 TAKEAWAY: Clean your data!!

3 Modeling and Visualizing Survey Data

3.1 Objective:

Learn to use survey data to test theoretical relationships using linear models and visualization.*

3.1.1 🧠 Conceptual Focus:

In anthropology, forming and testing hypotheses helps us connect individual experiences to larger social patterns. For example, we might ask: “Does a student’s academic confidence influence how secure they feel about their job prospects?”

3.1.2 📊 Skill Focus:

Learn how to structure a testable hypothesis using Likert scale data.
Practice fitting linear models (lm()) in R.
Compare model fit using R-squared.
Interpret coefficients and draw conclusions.

3.1.3 🔬 Hypothesis:

Academic confidence will have a stronger predictive relationship with job security than academic anxiety.

This is an anthropological question because it links subjective experience (confidence, anxiety) to structural expectations (job market, future security), all within a cultural context (e.g., university life in a late-capitalist society).

Let’s test this using regression models.

survey_data <- d1
Confidence <- d1$Academic_Confidence
Anxiety <- d1$Future_Anxious
JobSecurity <- d1$Job_Confidence

# Two regression models
model_conf <- lm(JobSecurity ~ Confidence, data = survey_data)
model_anx <- lm(JobSecurity ~ Anxiety, data = survey_data)

summary(model_conf)

## 
## Call:
## lm(formula = JobSecurity ~ Confidence, data = survey_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2935 -2.1825 -0.0828  2.5619  5.6616 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   2.3833     1.2694   1.877   0.0678 .
## Confidence    0.4888     0.1826   2.677   0.0107 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.685 on 40 degrees of freedom
## Multiple R-squared:  0.1519, Adjusted R-squared:  0.1307 
## F-statistic: 7.165 on 1 and 40 DF,  p-value: 0.01072

summary(model_anx)

## 
## Call:
## lm(formula = JobSecurity ~ Anxiety, data = survey_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.066 -1.802 -0.037  1.922  4.992 
## 
## Coefficients:
##             Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)  10.1252     1.2494   8.104 0.000000000572 ***
## Anxiety      -0.6118     0.1605  -3.811       0.000467 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.498 on 40 degrees of freedom
## Multiple R-squared:  0.2664, Adjusted R-squared:  0.2481 
## F-statistic: 14.53 on 1 and 40 DF,  p-value: 0.0004671

Let’s visualize it

# Compare R-squared values
conf_r2 <- summary(model_conf)$r.squared
anx_r2 <- summary(model_anx)$r.squared

bar_data <- tibble(
  Predictor = c("Academic Confidence", "Academic Anxiety"),
  R_squared = c(conf_r2, anx_r2)
)

# Barplot
ggplot(bar_data, aes(x = Predictor, y = R_squared, fill = Predictor)) +
  geom_col(show.legend = FALSE) +
  ylim(0, 1) +
  labs(
    title = "Prediction of Job Security by Confidence vs. Anxiety",
    y = "R-squared Value",
    x = "Predictor"
  ) +
  theme_minimal()

3.2 Discussion Prompt:

Which predictor has a stronger relationship with perceived job security? What might this tell us about how confidence and anxiety influence perceptions of the future?

📘 Learning Note: Linear regression is just one way to model relationships between survey responses. Depending on your research question and data structure, other methods may be more appropriate. For instance:

Generalized Linear Models (GLMs) can be used when your dependent variable is categorical or count-based.(e.g., predicting whether a student feels included in campus culture based on confidence and anxiety levels).
Mixed Effects Models are useful when your data includes repeated measures or nested structures (e.g., analyzing student responses grouped by their Major with random intercepts for each group).
Ordinal Regression is better suited for Likert-type responses when the order matters but the spacing between options is not equal.

To show how flexible survey data is, we will try to model data that are more silly (Chickens, Zombies, and Goats). Starting with Chicken, we can try comparing different majors’ perspectives on how many chickens can fit in the basement of CAS.

3.2.1 Chickens in the Basement

# Remove commas or weird characters if necessary
d1$Chickens_Basement <- gsub(",", "", d1$Chickens_Basement)

# Convert to numeric
d1$Chickens_Basement <- as.numeric(d1$Chickens_Basement)

# Use log-transformation to make the visuals more unified
d1$log_chickens <- log10(d1$Chickens_Basement + 1)

Obviously, if we try to perspectives by majors, we will end up having a graph with respective majors to their response, which would make it look not visually pleasing.

# Create a new column for the BU schools
d1 <- d1 %>%
  mutate(school = case_when(
    Major %in% c("Psychology", "Anthropology", "Biology", "History", "Philosophy", "Political Science", "Marine Science", "Computer Science", "Sociology", "Archaeology", "Physics", "Neuroscience", "Astronomy", "Sociocultural Anthropology", "Biological Anthropology", "Mathematics") ~ "CAS",
    Major %in% c("Electrical Engineering", "Mechanical Engineering", "Robotics & Autonomous Systems", "Mechanical engineering", "Mechanical Engineering", "Computer Engineering") ~ "ENG",
    Major %in% c("Journalism", "Film & TV", "Advertising", "Public Relations") ~ "COM",
    Major %in% c("Business Administration", "Accounting", "Finance", "Business Administration & Management", "Business") ~ "Questrom",
    Major %in% c("Music", "Theater", "Performance", "Music Education") ~ "CFA",
    TRUE ~ "Other"
  ))

# Model Chickens by School
mean_chickens <- d1 %>%
  group_by(school) %>%
  summarise(mean_log_chickens = mean(log_chickens, na.rm = TRUE))

ggplot(mean_chickens, aes(x = school, y = mean_log_chickens, fill = school)) +
  geom_col() +
  labs(
    title = "Mean Chicken Estimate by School",
    x = "School",
    y = "Proportion of Chickens"
  ) +
  theme_minimal()

This plot shows the average (log-transformed) number of chickens each school’s students estimated. While the estimates are whimsical, the plot still reflects possible group-level differences in estimation behavior—maybe engineering students make more practical guesses, while humanities students are more imaginative?

3.2.2 Zombies You Could Kill

Now let’s try it with Zombies. With Zombies, answers can vary based on more imaginary or realistic perspective. One way of fit the numbers in is categorizing them into groups, like with the chicken question.

# Zombie Visuals
zombie_level <- cut(d1$Zombies_Could_Kill, 
                                breaks = c(-Inf, 5, 15, 20, Inf),
                                labels = c("Low", "Moderate", "High", "Ultra"))
# Then model as a factor outcome
glm(zombie_level ~ Age + Like_BU, data = d1, family = "binomial")

## 
## Call:  glm(formula = zombie_level ~ Age + Like_BU, family = "binomial", 
##     data = d1)
## 
## Coefficients:
## (Intercept)          Age      Like_BU  
##     1.26033      0.02835     -0.22156  
## 
## Degrees of Freedom: 41 Total (i.e. Null);  39 Residual
## Null Deviance:       57.84 
## Residual Deviance: 57.21     AIC: 63.21

ggplot(d1, aes(x = zombie_level)) +
  geom_bar(fill = "purple") +
  labs(title = "Zombie Killer Tiers", x = "Zombie Tier", y = "Count")

3.2.3 Infinite Goats

Now let’s try doing goats. Since answers for this question might be more imaginary, we’ll first have to convert them to fit into the visualization.

# Convert goats to numeric, in case of any issue in the data
d1$Goat_Number <- as.numeric(d1$Goat_Number)

## Warning: NAs introduced by coercion

# Log-transform for visualization 
d1$log_goats <- log10(d1$Goat_Number + 1)

# Goats Visual
ggplot(d1, aes(x = log10(Goat_Number + 1))) +
  geom_density(fill = "orange", alpha = 0.6) +
  geom_rug(color = "darkred", alpha = 0.5) +
  labs(
    title = "How Many Goats They Want",
    x = "Goats Proportion",
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title.x = element_text(size = 12),
    axis.title.y = element_text(size = 12)
  )

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).

This density plot does reveal more skewed nature of the data. However, most people appear to only want a few goats, with a few participants desire millions. A good note is that log-transforming helped to bring that skew into view and makes it easier to compare distributions.

3.2.4 Let’s further investigate it by Genders

library(ggridges) # Will need this package for the visualization

ggplot(d1, aes(x = log_goats, y = Gender, fill = Gender)) +
  geom_density_ridges(alpha = 0.6, scale = 1.2) +
  labs(
    title = "Distribution of Goat Desire by Gender (Log-Transformed)",
    x = "Self-Reported Goat Preference",
    y = "Gender"
  ) +
  theme_minimal()

## Picking joint bandwidth of 0.549

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density_ridges()`).

From the ridgeline plot, while all groups also tend to peak around the lower end of the scale, some of the outlier behavior appear to be coming from the male and non-binary groups, suggesting more playful exaggeration in certain subgroups.

3.3 Challenge 2:

Another realm of exploration can be doing them by schools. Go ahead and try it.

# You can use 'school' to group the majors by schools, like in the Chicken Question. 
ggplot(d1, aes(x = log_goats, y = school, fill = school)) +
  geom_density_ridges(alpha = 0.5, scale = 1.2) +
  labs(
    title = "Distribution of Goat Desire by School (Log-Transformed)",
    x = "Self-Reported Goat Preference",
    y = "School"
  ) +
  theme_minimal()

## Picking joint bandwidth of 0.735

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density_ridges()`).

3.3.1 Takeaway:

Choosing the right model and visuals depends on the nature of the variables in your survey and the assumptions behind the statistical method. Even silly survey data can be modeled with basic techniques like linear regression, logistic regression, and transformation to fit into nice visualizations.

4 Qualitative Statistics

One of the most useful parts of surveys is qualitative statistics or data represented by words or phrases, not numbers.

Take for example, the “Favorite_Color” column, based on the prompt “What is your favorite color?”.

head(d1$Favorite_Color)

## [1] "Purple" "Yellow" "Purple" "Purple" "Teal"   "Purple"

How can we organize this data? How can we analyze it? Traditional methods of data visualization don’t work and neither do traditional methods of analysis. One way of data visualization is through word clouds. Word clouds correlate the frequency of a descriptor with the size of the word in the image, giving us a visual representation of these descriptions. The package wordcloud2 gives us a function “wordcloud2” which allows us to create these in r.

library(wordcloud2)

#first let's find the frequency (or counts) of each of the color responses in r
t<-table(d1$Favorite_Color)
wordcloud2(t, size=.8, color=(c("Black","Blue","Brown","Cyan","Green","Pink","Purple","Red","Teal","Yellow")))

That’s visualization! Word cloud can be an amazing tool to use for your data visuals. Another common form of visualization is through graphs! The two most common types of graphs used are pie graphs and bar graphs. Let’s make a few graphs using the favorite color of the respondents!

ggplot(d1, aes(x=Favorite_Color, fill=Favorite_Color))+
  geom_bar(show.legend = F)+
  scale_fill_manual(values=c("black","blue","brown","cyan","green","pink","purple","red","turquoise","yellow"))

pie(x=t,col=c("black","blue","brown","cyan","green","pink","purple","red","turquoise","yellow"))

Notice how ggplot doesn’t have a built-in function for pie charts. You actually can make pie charts using ggplot! Here’s a good guide on how to do so!

4.1 Challenge 3

Now it’s your turn. Let’s try making a word cloud and a bar graph with the Majors column.

t2<-table(d1$Major)
wordcloud2(t2, size=.4, color="random-dark")

ggplot(d1, aes(x=Major,fill=Major))+
  geom_bar()+
  theme(axis.text.x=element_blank(), legend.key.size=unit(.1, "cm"))

4.1.1 Discussion Prompt:

What is the most frequent major that appeared in this dataset? Any interesting major in there?

4.1.2 Takeaway:

Text data adds depth — a word cloud offers a quick thematic scan, helping anthropologists connect numbers to narratives.

5 Conclusion

Survey data is an incredibly flexible tool for anthropologists. In this module, we explored descriptive stats, modeling, and qualitative visualization — all from one simple dataset. The skills here are the foundation for both research and applied work.

Survey Data: Properly Prepping, Analyzing, and Modeling For Research

Lindsay Warrell & Jonathan Zhang

04/29/2025