Getting Data into R

Objectives

The objective of this module to learn how to download data sets from various local and online sources.

Preliminaries

GO TO: https://github.com/fuzzyatelin/fuzzyatelin.github.io/tree/master/AN588_Fall23/, select the .txt version of Country-Data-2016, then press the RAW button, highlight, and copy the text to a text editor and save it locally. Do the same for the .csv version.

This data set consists of basic statistics (area, current population size, birth rate, death rate, life expectancy, and form of government) for 249 countries taken from WorldData.info that I have combined with data from the International Union for the Conservation of Nature (IUCN)’s Red List Summary Statistics about the number of threatened animal species by country. Note: the IUCN dataset has generally not been updated since 2008!

From the same page, select the .xlsx version of CPDS-1960-2014-reduced, then press DOWNLOAD and save it locally. Also download these data in .txt and .csv formats using the procedure described above.

This is a version of the Comparative Political Data Set (CPDS), which is “a collection of political and institutional country-level data provided by Prof. Dr. Klaus Armingeon and collaborators at the University of Berne. It consists of annual data for 36 democratic countries for the period of 1960 to 2014 or since their transition to democracy” (Armingeon et al. 2016). The full dataset consists of 300 variables, which I have pared down to a smaller set of economical and population size variables.

CITATION: Armingeon K, Isler C, Knöpfel L, Weisstanner D, and Engler S. 2016. Comparative Political Data Set 1960-2014. Bern: Institute of Political Science, University of Berne.

Install these packages in R: {readr}, {curl}, {readxl},{writexl},{drop2}, {repmis}, {googledrive}, {googlesheets4}

The Backstory

So far, we have seen how to create a variety of data structures by hand (e.g., using the c() function), but for larger data sets we will need mechanisms to import data into R. There are many methods for importing tabular data, stored in various formats (like text files, spreadsheets, and databases).

The Tao of Text

Plain text files are, arguably, the best way to store data (and scripts and other documents) as they are a standard format that has been around longer than most operating systems and are unlikely to change anytime soon.

Plain text does not have a version and does not age
Plain text files are platform and software agnostic
Plain text files can be opened by a wide variety of programs
Plain text can easily be copied and pasted into a wide range of software
Plain text files tend to be smaller and quicker to open then proprietary formats
Plain text files are easy to transmit over the web
Many mature and efficient software tools exist for indexing, parsing, searching, and modifying text
The content of plain text files looks the same on any system
Various flavors of Markdown can be used for styling plain text files, if needed
Plain text remains itself outside of the digital context

Loading Different Types of Plain Text Files

In R, we can load a data set from a plain text file using the read.table() function from the {base} package, with the path to the file as the first (file=) argument for the function. An additional argument (header=) can be used to specify whether the first row of the data file consists of column/field names.

The generic read.table() function can be used to read data files where columns are separated by tabs, commas, white space, or some other delimeter. The sep= argument tells R what character is used as a “separator” or delimiter. The skip= argument can be used to start reading a file after a set number of rows.

There are format-specific variants of read.table() (e.g., read.csv()) that have different defaults and may be quicker for certain file types. Note that when using this function from the {base} package, the argument stringsAsFactors is set to be TRUE by default, and we need to set it as FALSE if we want character strings to be loaded as actual strings.

Let’s read in one of the data sets that you have copied and stored locally: CPDS-1960-2014-reduced.txt.

Reading from a local file

The file.choose() command is useful and gives you a familiar dialog box to select a file. You can use this to specify the path to a locally-stored file, which you will save as the R object called f.

f <- file.choose()

The file paths below refer to where I have saved the downloaded data, on my Desktop. You will need to change this if you have saved your downloaded data to a different location.

Loading tab-separated (`.tsv`, `.txt`) text with {base} R functions

NOTE: In the following snippet, you can change the sep= argument as needed to use other delimiters

f <- "~/Desktop/CPDS-1960-2014-reduced.txt"
d <- read.table(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE, fill = T)
head(d)  # lists the first 6 lines of data

##   year   country gov_right1 gov_cent1 gov_left1 gov_party      elect vturn
## 1 1960 Australia        100         0         0         1             95.5
## 2 1961 Australia        100         0         0         1 09/12/1961  95.3
## 3 1962 Australia        100         0         0         1             95.3
## 4 1963 Australia        100         0         0         1 30/11/1963  95.7
## 5 1964 Australia        100         0         0         1             95.7
## 6 1965 Australia        100         0         0         1             95.7
##   womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1        0        NA      3.73     40.15    0.46  4215.00       NA  1.25
## 2        0     -0.64      2.29     38.62   -0.36  4286.00       NA  2.46
## 3        0      5.77     -0.32     38.75   -0.79  4382.00       NA  2.32
## 4        0      6.01      0.64     37.34   -0.51  4484.00       NA  1.87
## 5        0      6.26      2.87     35.31   -0.08  4610.80    67.24  1.45
## 6        0      4.99      3.41     53.99   -0.73  4745.95    67.66  1.36
##       pop pop15_64 pop65 elderly
## 1 10275.0   6296.5 874.9    8.51
## 2 10508.2   6428.6 894.6    8.51
## 3 10700.5   6571.5 913.6    8.54
## 4 10906.9   6710.9 933.0    8.55
## 5 11121.6   6857.3 948.1    8.52
## 6 11340.9   7014.6 966.3    8.52

NOTE: With bracket notation, you can modify how many lines the head() function will return: e.g., head(d)[1:10])

tail(d)  # shows the last 6 lines of data

##      year country gov_right1 gov_cent1 gov_left1 gov_party      elect vturn
## 1573 2009     USA      16.36     83.64         0         1             53.2
## 1574 2010     USA      11.76     88.24         0         1 02/11/2010  39.8
## 1575 2011     USA       8.80     91.20         0         1             39.8
## 1576 2012     USA       5.88     94.12         0         1 06/11/2012  50.9
## 1577 2013     USA       5.88     94.12         0         1             50.9
## 1578 2014     USA       8.40     91.60         0         1 04/11/2014  35.6
##      womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1573     16.8     -2.78     -0.36     93.47  -12.83 155454.0    75.49   9.3
## 1574     16.8      2.53      1.64    102.66  -12.18 155220.3    74.80   9.6
## 1575     16.8      1.60      3.16    108.25  -10.75 154949.3    74.12   8.9
## 1576     18.0      2.22      2.07    111.48   -9.00 156368.6    74.53   8.1
## 1577     17.8      1.49      1.46    111.45   -5.49 156761.2    74.41   7.4
## 1578     19.3      2.43      1.62    111.70   -5.13 157268.8       NA   6.2
##           pop pop15_64    pop65 elderly
## 1573 306771.5 206060.8 39623.18   12.92
## 1574 309347.1 207665.3 40479.35   13.09
## 1575 311721.6 209179.2 41366.63   13.27
## 1576 314112.1 209823.0 43164.91   13.74
## 1577 316498.0 210673.5 44723.04   14.13
## 1578 318857.0 211545.9 46243.21   14.50

class(d)  # shows that tables are typically loaded as data frames

## [1] "data.frame"

Or, alternatively…

d <- read.delim(f, header = TRUE, stringsAsFactors = FALSE)
head(d)

##   year   country gov_right1 gov_cent1 gov_left1 gov_party      elect vturn
## 1 1960 Australia        100         0         0         1             95.5
## 2 1961 Australia        100         0         0         1 09/12/1961  95.3
## 3 1962 Australia        100         0         0         1             95.3
## 4 1963 Australia        100         0         0         1 30/11/1963  95.7
## 5 1964 Australia        100         0         0         1             95.7
## 6 1965 Australia        100         0         0         1             95.7
##   womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1        0        NA      3.73     40.15    0.46  4215.00       NA  1.25
## 2        0     -0.64      2.29     38.62   -0.36  4286.00       NA  2.46
## 3        0      5.77     -0.32     38.75   -0.79  4382.00       NA  2.32
## 4        0      6.01      0.64     37.34   -0.51  4484.00       NA  1.87
## 5        0      6.26      2.87     35.31   -0.08  4610.80    67.24  1.45
## 6        0      4.99      3.41     53.99   -0.73  4745.95    67.66  1.36
##       pop pop15_64 pop65 elderly
## 1 10275.0   6296.5 874.9    8.51
## 2 10508.2   6428.6 894.6    8.51
## 3 10700.5   6571.5 913.6    8.54
## 4 10906.9   6710.9 933.0    8.55
## 5 11121.6   6857.3 948.1    8.52
## 6 11340.9   7014.6 966.3    8.52

Loading comma-separated (`.csv`) text with {base} R functions

f <- "~/Desktop/CPDS-1960-2014-reduced.csv"
d <- read.table(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)

##   year   country gov_right1 gov_cent1 gov_left1 gov_party      elect vturn
## 1 1960 Australia        100         0         0         1             95.5
## 2 1961 Australia        100         0         0         1 09/12/1961  95.3
## 3 1962 Australia        100         0         0         1             95.3
## 4 1963 Australia        100         0         0         1 30/11/1963  95.7
## 5 1964 Australia        100         0         0         1             95.7
## 6 1965 Australia        100         0         0         1             95.7
##   womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1        0        NA      3.73     40.15    0.46  4215.00       NA  1.25
## 2        0     -0.64      2.29     38.62   -0.36  4286.00       NA  2.46
## 3        0      5.77     -0.32     38.75   -0.79  4382.00       NA  2.32
## 4        0      6.01      0.64     37.34   -0.51  4484.00       NA  1.87
## 5        0      6.26      2.87     35.31   -0.08  4610.80    67.24  1.45
## 6        0      4.99      3.41     53.99   -0.73  4745.95    67.66  1.36
##       pop pop15_64 pop65 elderly
## 1 10275.0   6296.5 874.9    8.51
## 2 10508.2   6428.6 894.6    8.51
## 3 10700.5   6571.5 913.6    8.54
## 4 10906.9   6710.9 933.0    8.55
## 5 11121.6   6857.3 948.1    8.52
## 6 11340.9   7014.6 966.3    8.52

Or, alternatively…

d <- read.csv(f, header = TRUE, stringsAsFactors = FALSE)
head(d)

##   year   country gov_right1 gov_cent1 gov_left1 gov_party      elect vturn
## 1 1960 Australia        100         0         0         1             95.5
## 2 1961 Australia        100         0         0         1 09/12/1961  95.3
## 3 1962 Australia        100         0         0         1             95.3
## 4 1963 Australia        100         0         0         1 30/11/1963  95.7
## 5 1964 Australia        100         0         0         1             95.7
## 6 1965 Australia        100         0         0         1             95.7
##   womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1        0        NA      3.73     40.15    0.46  4215.00       NA  1.25
## 2        0     -0.64      2.29     38.62   -0.36  4286.00       NA  2.46
## 3        0      5.77     -0.32     38.75   -0.79  4382.00       NA  2.32
## 4        0      6.01      0.64     37.34   -0.51  4484.00       NA  1.87
## 5        0      6.26      2.87     35.31   -0.08  4610.80    67.24  1.45
## 6        0      4.99      3.41     53.99   -0.73  4745.95    67.66  1.36
##       pop pop15_64 pop65 elderly
## 1 10275.0   6296.5 874.9    8.51
## 2 10508.2   6428.6 894.6    8.51
## 3 10700.5   6571.5 913.6    8.54
## 4 10906.9   6710.9 933.0    8.55
## 5 11121.6   6857.3 948.1    8.52
## 6 11340.9   7014.6 966.3    8.52

Using the {readr} package

The {readr} package provides alternative functions to read in delimited text files. It runs faster than the {base} package functions. It reads in an initial set of 1000 rows of the from the table to try to impute the data class for of each column. You can also specify the data class of each column with the col_types() function.

require(readr)

## Loading required package: readr

f <- "~/Desktop/CPDS-1960-2014-reduced.txt"
d <- read_tsv(f, col_names = TRUE)  # for tab-separated files

## Rows: 1578 Columns: 20

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(d)

## # A tibble: 6 × 20
##    year country   gov_right1 gov_cent1 gov_left1 gov_party elect  vturn womenpar
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>  <dbl>    <dbl>
## 1  1960 Australia        100         0         0         1 <NA>    95.5        0
## 2  1961 Australia        100         0         0         1 09/12…  95.3        0
## 3  1962 Australia        100         0         0         1 <NA>    95.3        0
## 4  1963 Australia        100         0         0         1 30/11…  95.7        0
## 5  1964 Australia        100         0         0         1 <NA>    95.7        0
## 6  1965 Australia        100         0         0         1 <NA>    95.7        0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## #   deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## #   pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>

class(d)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

# returns d as a data frame, but also as other table-based data structures

Or, alternatively…

d <- read_delim(f, delim = "\t", col_names = TRUE)

## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(d)

## # A tibble: 6 × 20
##    year country   gov_right1 gov_cent1 gov_left1 gov_party elect  vturn womenpar
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>  <dbl>    <dbl>
## 1  1960 Australia        100         0         0         1 <NA>    95.5        0
## 2  1961 Australia        100         0         0         1 09/12…  95.3        0
## 3  1962 Australia        100         0         0         1 <NA>    95.3        0
## 4  1963 Australia        100         0         0         1 30/11…  95.7        0
## 5  1964 Australia        100         0         0         1 <NA>    95.7        0
## 6  1965 Australia        100         0         0         1 <NA>    95.7        0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## #   deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## #   pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>

require(readr)
f <- "~/Desktop/CPDS-1960-2014-reduced.csv"
d <- read_csv(f, col_names = TRUE)  # for comma-separated files

## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(d)

## # A tibble: 6 × 20
##    year country   gov_right1 gov_cent1 gov_left1 gov_party elect  vturn womenpar
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>  <dbl>    <dbl>
## 1  1960 Australia        100         0         0         1 <NA>    95.5        0
## 2  1961 Australia        100         0         0         1 09/12…  95.3        0
## 3  1962 Australia        100         0         0         1 <NA>    95.3        0
## 4  1963 Australia        100         0         0         1 30/11…  95.7        0
## 5  1964 Australia        100         0         0         1 <NA>    95.7        0
## 6  1965 Australia        100         0         0         1 <NA>    95.7        0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## #   deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## #   pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>

Or, alternatively…

d <- read_delim(f, delim = ",", col_names = TRUE)

## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(d)

## # A tibble: 6 × 20
##    year country   gov_right1 gov_cent1 gov_left1 gov_party elect  vturn womenpar
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>  <dbl>    <dbl>
## 1  1960 Australia        100         0         0         1 <NA>    95.5        0
## 2  1961 Australia        100         0         0         1 09/12…  95.3        0
## 3  1962 Australia        100         0         0         1 <NA>    95.3        0
## 4  1963 Australia        100         0         0         1 30/11…  95.7        0
## 5  1964 Australia        100         0         0         1 <NA>    95.7        0
## 6  1965 Australia        100         0         0         1 <NA>    95.7        0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## #   deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## #   pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>

Loading Excel Files

While you should never need to use Excel, sometimes you will no doubt be given a spreadsheet file with some data in it that you want to analyze in R. There are several packages available that provide functions for loading data into R from Excel spreadsheet files: {readxl}/{writexl}, {XLConnect}, {gdata}, and {xlsx}. The first of these ({readxl}/{writexl}) are fast, easy to use, and work well; on top of that, {readxl} has been integrated into the {tidyverse}, which means we’ll be using it a lot anyway. {XLConnect} only works on some systems, {gdata} is a bit slower and requires that you have PERL installed someone on your computer (which it is likely to be by default), and {xlsx} is much slower, so you can try them if you like, but we won’t be using them in this course.

NOTE: always use str() to check if your variables come in as the correct data class.

Using the {readxl} and {writexl} packages

With {readxl}, importing Excel files into R and converting them to dataframes is very simple. You simply need to specify the PATH to your desired file, make sure to specify which sheet in the file you want to import, and if you want the column names to carry over you need to specify that, as well:

require(readxl)

## Loading required package: readxl

f <- "~/Desktop/CPDS-1960-2014-reduced.xlsx"
d <- read_excel(f, sheet = 1, col_names = TRUE)
head(d)

## # A tibble: 6 × 20
##    year country   gov_right1 gov_cent1 gov_left1 gov_party elect              
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>     <dbl> <dttm>             
## 1  1960 Australia        100         0         0         1 NA                 
## 2  1961 Australia        100         0         0         1 1961-12-09 00:00:00
## 3  1962 Australia        100         0         0         1 NA                 
## 4  1963 Australia        100         0         0         1 1963-11-30 00:00:00
## 5  1964 Australia        100         0         0         1 NA                 
## 6  1965 Australia        100         0         0         1 NA                 
## # ℹ 13 more variables: vturn <dbl>, womenpar <dbl>, realgdpgr <dbl>,
## #   inflation <dbl>, debt_hist <dbl>, deficit <dbl>, ttl_labf <dbl>,
## #   labfopar <dbl>, unemp <dbl>, pop <dbl>, pop15_64 <dbl>, pop65 <dbl>,
## #   elderly <dbl>

str(d)

## tibble [1,578 × 20] (S3: tbl_df/tbl/data.frame)
##  $ year      : num [1:1578] 1960 1961 1962 1963 1964 ...
##  $ country   : chr [1:1578] "Australia" "Australia" "Australia" "Australia" ...
##  $ gov_right1: num [1:1578] 100 100 100 100 100 100 100 100 100 100 ...
##  $ gov_cent1 : num [1:1578] 0 0 0 0 0 0 0 0 0 0 ...
##  $ gov_left1 : num [1:1578] 0 0 0 0 0 0 0 0 0 0 ...
##  $ gov_party : num [1:1578] 1 1 1 1 1 1 1 1 1 1 ...
##  $ elect     : POSIXct[1:1578], format: NA "1961-12-09" ...
##  $ vturn     : num [1:1578] 95.5 95.3 95.3 95.7 95.7 95.7 95.1 95.1 95.1 95 ...
##  $ womenpar  : num [1:1578] 0 0 0 0 0 0 0.8 0.8 0.8 0 ...
##  $ realgdpgr : num [1:1578] NA -0.643 5.767 6.009 6.258 ...
##  $ inflation : num [1:1578] 3.729 2.288 -0.319 0.641 2.866 ...
##  $ debt_hist : num [1:1578] 40.1 38.6 38.7 37.3 35.3 ...
##  $ deficit   : num [1:1578] 0.4582 -0.3576 -0.7938 -0.5062 -0.0804 ...
##  $ ttl_labf  : num [1:1578] 4215 4286 4382 4484 4611 ...
##  $ labfopar  : num [1:1578] NA NA NA NA 67.2 ...
##  $ unemp     : num [1:1578] 1.25 2.46 2.32 1.87 1.45 ...
##  $ pop       : num [1:1578] 10275 10508 10700 10907 11122 ...
##  $ pop15_64  : num [1:1578] 6296 6429 6572 6711 6857 ...
##  $ pop65     : num [1:1578] 875 895 914 933 948 ...
##  $ elderly   : num [1:1578] 8.51 8.51 8.54 8.55 8.52 ...

The {writexl} package can also write data frames back out to Excel worksheets. If the file does not exist, it is created. If it does exist, data is cleared and overwritten (so be careful!). The second process is MUCH slower. I have included a conditional statement (if(){}) which will implement the file.remove() command here, if needed (to speed things up).

library(writexl)
f <- "~/Desktop/output.xlsx"
if (file.exists(f)) {
    file.remove(f)
}
write_xlsx(d, f, col_names = TRUE)

Loading Files Stored on a Remote Server

We can also load files stored on a server elsewhere on the web, e.g., Dropbox or GitHub.

To read .csv or .txt files directly from GitHub, use the {curl} or {readr} packages. We’ll use this extensively in this class, moving forward.

GO TO: https://github.com/fuzzyatelin/fuzzyatelin.github.io/tree/master/AN588_Fall23/, select the .csv version of the CPDS-1960-2014-reduced file, then press RAW and copy the URL from the address box of your browser window… this is what you need to use as an argument for the functions below (you will repeat this for the .txt version later on)

Importing data from a file on a remote server using {curl}

For a comma-separated value (.csv) text file…

library(curl)

## Using libcurl 8.11.1 with OpenSSL/3.3.2

## 
## Attaching package: 'curl'

## The following object is masked from 'package:readr':
## 
##     parse_date

f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Spring25/CPDS-1960-2014-reduced.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)

##   year   country gov_right1 gov_cent1 gov_left1 gov_party      elect vturn
## 1 1960 Australia        100         0         0         1             95.5
## 2 1961 Australia        100         0         0         1 09/12/1961  95.3
## 3 1962 Australia        100         0         0         1             95.3
## 4 1963 Australia        100         0         0         1 30/11/1963  95.7
## 5 1964 Australia        100         0         0         1             95.7
## 6 1965 Australia        100         0         0         1             95.7
##   womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1        0        NA      3.73     40.15    0.46  4215.00       NA  1.25
## 2        0     -0.64      2.29     38.62   -0.36  4286.00       NA  2.46
## 3        0      5.77     -0.32     38.75   -0.79  4382.00       NA  2.32
## 4        0      6.01      0.64     37.34   -0.51  4484.00       NA  1.87
## 5        0      6.26      2.87     35.31   -0.08  4610.80    67.24  1.45
## 6        0      4.99      3.41     53.99   -0.73  4745.95    67.66  1.36
##       pop pop15_64 pop65 elderly
## 1 10275.0   6296.5 874.9    8.51
## 2 10508.2   6428.6 894.6    8.51
## 3 10700.5   6571.5 913.6    8.54
## 4 10906.9   6710.9 933.0    8.55
## 5 11121.6   6857.3 948.1    8.52
## 6 11340.9   7014.6 966.3    8.52

# returns a data frame

For a tab-delimited (.tsv or txt) text file…

f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/CPDS-1960-2014-reduced.txt")
d <- read.table(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
head(d)

##   year   country gov_right1 gov_cent1 gov_left1 gov_party      elect vturn
## 1 1960 Australia        100         0         0         1             95.5
## 2 1961 Australia        100         0         0         1 09/12/1961  95.3
## 3 1962 Australia        100         0         0         1             95.3
## 4 1963 Australia        100         0         0         1 30/11/1963  95.7
## 5 1964 Australia        100         0         0         1             95.7
## 6 1965 Australia        100         0         0         1             95.7
##   womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1        0        NA      3.73     40.15    0.46  4215.00       NA  1.25
## 2        0     -0.64      2.29     38.62   -0.36  4286.00       NA  2.46
## 3        0      5.77     -0.32     38.75   -0.79  4382.00       NA  2.32
## 4        0      6.01      0.64     37.34   -0.51  4484.00       NA  1.87
## 5        0      6.26      2.87     35.31   -0.08  4610.80    67.24  1.45
## 6        0      4.99      3.41     53.99   -0.73  4745.95    67.66  1.36
##       pop pop15_64 pop65 elderly
## 1 10275.0   6296.5 874.9    8.51
## 2 10508.2   6428.6 894.6    8.51
## 3 10700.5   6571.5 913.6    8.54
## 4 10906.9   6710.9 933.0    8.55
## 5 11121.6   6857.3 948.1    8.52
## 6 11340.9   7014.6 966.3    8.52

# returns a data frame

Importing data from a file on a remote server using {readr}

Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)
library(readr)
f <- "https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/CPDS-1960-2014-reduced.csv"
d <- read_csv(f, col_names = TRUE)

## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(d)

## # A tibble: 6 × 20
##    year country   gov_right1 gov_cent1 gov_left1 gov_party elect  vturn womenpar
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>  <dbl>    <dbl>
## 1  1960 Australia        100         0         0         1 <NA>    95.5        0
## 2  1961 Australia        100         0         0         1 09/12…  95.3        0
## 3  1962 Australia        100         0         0         1 <NA>    95.3        0
## 4  1963 Australia        100         0         0         1 30/11…  95.7        0
## 5  1964 Australia        100         0         0         1 <NA>    95.7        0
## 6  1965 Australia        100         0         0         1 <NA>    95.7        0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## #   deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## #   pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>

# returns a 'tibble', a new version of a data frame

f <- "https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/CPDS-1960-2014-reduced.txt"
d <- read_tsv(f, col_names = TRUE)

## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(d)

## # A tibble: 6 × 20
##    year country   gov_right1 gov_cent1 gov_left1 gov_party elect  vturn womenpar
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>  <dbl>    <dbl>
## 1  1960 Australia        100         0         0         1 <NA>    95.5        0
## 2  1961 Australia        100         0         0         1 09/12…  95.3        0
## 3  1962 Australia        100         0         0         1 <NA>    95.3        0
## 4  1963 Australia        100         0         0         1 30/11…  95.7        0
## 5  1964 Australia        100         0         0         1 <NA>    95.7        0
## 6  1965 Australia        100         0         0         1 <NA>    95.7        0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## #   deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## #   pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>

# returns a 'tibble', a new version of a data frame for use in the
# {tidyverse}

Importing data from a file hosted on Dropbox

Unfortunately, now that Dropbox is restricting API access to their servers, it’s become difficult to work with files archived on a Dropbox account in the RStudio framework. The package we would usually use for this, {rdrop2}, has already been archived on CRAN, in part because of this and because the maintainers have given it up (as of late 2024). However, we can still install the development version, and it will work (for now!) on the most recent version of R (4.4.2) as of January 2025:

devtools::install_github("karthik/rdrop2")

NOTE: The following code block cannot be “knit” to show you the output because it requires an interactive R environment for drop_auth(), drop_search(), etc. Also, with two-factor authentication protocols (like Duo for BU accounts), there are likely a number of cumbersome logins/authentications you need to enter to accomplish this!

require(rdrop2)
drop_auth()  # opens a browser dialog box to ask for authorization...
drop_dir()  # lists the contents of your dropbox folder
f <- "CPDS-1960-2014-reduced.csv"  # name of the file to read from
f <- drop_search(f)  # searches your dropbox directory for file or directory names; this can be slow
f <- f$path  # f$path is the location you choose from the results returned above
d <- drop_read_csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
str(d)

This same process can be done to load data from other types of delimited files in Dropbox by setting the appropriate sep= argument.

You can also read text files from someone else’s Dropbox account using a link that they have shared with you. Here’s a link sharing the same file from my BU dropbox:

link <- "https://www.dropbox.com/scl/fi/560bh55pwv3sowbmyygsl/CPDS-1960-2014-reduced.csv?rlkey=sp4xkouegbno3fqrt65o7gb4d&dl=0"

NOTE: Shared Dropbox links like this one will take you to a webpage that has the data embedded… to get the raw data you need to change the end of the link from dl=0 to dl=1 or raw=1. That’s what the next line of code does:

link <- gsub(pattern = "dl=0", replacement = "dl=1", x = link)
d <- read.csv(link, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
str(d)

You can also bypass {rdrop2} issues and use the source_data() function from the {repmis} package (“Miscellaneous Tools for Reproducible Research”) to load data from a file on Dropbox. This function detects column types and gives a few more warnings than others if it encounters something odd.

require(repmis)

## Loading required package: repmis

d <- source_data(link, header = TRUE, sep = ",")  # use the same updated link to the raw data as above

## Downloading data from: https://www.dropbox.com/scl/fi/560bh55pwv3sowbmyygsl/CPDS-1960-2014-reduced.csv?rlkey=sp4xkouegbno3fqrt65o7gb4d&dl=0

## SHA-1 hash of the downloaded data file is:
## ded3e6bd2a06f16c05861b4082bdabcb5c8fade7

head(d)

##                                      devServerHostname: "" V2
## 1 requireFallbackBaseUrl: "https://www.dropboxstatic.com/" NA
## 2                                 langpack: {locale: "en"} NA
## 3                                      useReactNext: false NA

str(d)

## 'data.frame':    3 obs. of  2 variables:
##  $ devServerHostname: "": chr  "requireFallbackBaseUrl: \"https://www.dropboxstatic.com/\"" "langpack: {locale: \"en\"}" "useReactNext: false"
##  $ V2                   : logi  NA NA NA

Finally, you can also load data directly from google drive into R using either the {googledrive} or the {googlesheets4} package. The most seamless way of importing a file from google drive is to convert the file to a google sheet (i.e., a spreadsheet editable within the drive). Try saving the CPDS-1960-2014-reduced file as a google sheet in your own drive, and extracting it into R using the following code.

NOTE: The following code block cannot be “knit” to show you the output because it requires an interactive R environment for gs_ls(), get_title(), etc. If the package does not take you to a webpage login for google sheets upon the gs_ls() command, you may need to install the package {httpuv} separately.

require(googledrive)
require(googlesheets4)

f <- drive_find(n_max = 30, type = "spreadsheet")  #This interactive interface will guide you through steps to link your Google Drive. The 'n_max' option will restrict the number of files retrieved to the 30 most recently-added to the linked drive. The 'type' option allows you to only search for a specific kind of file; in this case, spreadsheets (this 'type' will *not* find Excel files, only Google Sheets). It will store the results in an object called a 'dribble'.

# In your dribble, find the Google Sheet you want. The first column is the
# name, and the second is the ID for the sheet. That ID is what you need
# to enter to read your google sheet into R (you may need to log in again
# to access it).

d <- read_sheet("1bzyH5hgif0__q6oW4zfJhxGxLKA073sv8ZYcZpAHTf4")
head(d)
str(d)

For importation of a spreadsheet file not converted to a google sheet in your Drive, it’s a bit more complicated but still possible. You can try uploading your document and extracting it into R using the code suggested by this helpful post from the Dyer Lab.

Downloading Files from a Remote Server

The {rdrop2} package can be used to download a file from a personal Dropbox account to your local computer, rather than just connecting to a Dropbox file to read the data stored there. The {googledrive} package can do this, as well, with any file in a personal Google Drive account (given you have permission). This should work with any file type.

NOTE: The following code block cannot be “knit” to show you the output because it requires an interactive R environment for drop_search(), etc.

filename <- "CPDS-1960-2014-reduced.csv"  # name of file to download
f <- drop_search(filename)  # searches your dropbox directory for that file or directory name
f <- f$path  # f$path is the location of the results returned above
drop_get(f, local_file = paste0("~/Desktop/", filename), overwrite = TRUE, progress = TRUE)
# this will save the file to the desktop

The progress=TRUE argument gives you a reassuring progress bar. By default, this argument is set to FALSE.

NOTE: The process also works for other file types, e.g., Excel files:

filename <- "CPDS-1960-2014-reduced.xlsx"
f <- drop_search(filename)  # searches your dropbox directory for file or directory names
f <- f$path  # $path is the location of the results returned above
drop_get(f, local_file = paste0("~/Desktop/", filename), overwrite = TRUE, progress = TRUE)
# again, saves to the desktop

You can also download files directly to your computer from google drive using the {googledrive} package.

drive_find(pattern = "CPDS-1960-2014-reduced")  #identifies files that match that name 
drive_download("CPDS-1960-2014-reduced.csv")
# in this case, should save to your working directory

More on the basics of using {googledrive} for more refined data downloading, interfacing and formatting can be found here.

Downloading Files from Other Statistical Programs

R is pretty great, but you may have analyses and datasets already formatted for other statistical programs that you want to bring to R for further analysis or replication. For more advice and tips on how to download other forms of data into R, including datasets and outputs from other statistical packages such as SAS, Stata, and SPSS, check out this webpage.

For folks more intimately familiar with the statistical program MATLAB, there are a number of resources like this handy guide, and an entire package, {R.matlab}, that allows MATLAB users to interface with R.

Module 05

by Anthony Di Fiore and Christopher A. Schmitt

Getting Data into R

Objectives

Preliminaries

The Backstory

The Tao of Text

Loading Different Types of Plain Text Files

Reading from a local file

Loading tab-separated (`.tsv`, `.txt`) text with {base} R functions

Loading comma-separated (`.csv`) text with {base} R functions

Using the {readr} package

Loading Excel Files

Using the {readxl} and {writexl} packages

Loading Files Stored on a Remote Server

Importing data from a file on a remote server using {curl}

Importing data from a file on a remote server using {readr}

Importing data from a file hosted on Dropbox

Downloading Files from a Remote Server

Downloading Files from Other Statistical Programs

Module 05

by Anthony Di Fiore and Christopher A. Schmitt

Getting Data into R

Objectives

Preliminaries

The Backstory

The Tao of Text

Loading Different Types of Plain Text Files

Reading from a local file

Loading tab-separated (.tsv, .txt) text with {base} R functions

Loading comma-separated (.csv) text with {base} R functions

Using the {readr} package

Loading Excel Files

Using the {readxl} and {writexl} packages

Loading Files Stored on a Remote Server

Importing data from a file on a remote server using {curl}

Importing data from a file on a remote server using {readr}

Importing data from a file hosted on Dropbox

Downloading Files from a Remote Server

Downloading Files from Other Statistical Programs

Loading tab-separated (`.tsv`, `.txt`) text with {base} R functions

Loading comma-separated (`.csv`) text with {base} R functions