The objective of this module to learn how to download data sets from various local and online sources.
.txt
version of Country-Data-2016, then
press the RAW
button, highlight, and copy the text to a
text editor and save it locally. Do the same for the .csv
version.This data set consists of basic statistics (area, current population size, birth rate, death rate, life expectancy, and form of government) for 249 countries taken from WorldData.info that I have combined with data from the International Union for the Conservation of Nature (IUCN)’s Red List Summary Statistics about the number of threatened animal species by country. Note: the IUCN dataset has generally not been updated since 2008!
.xlsx
version of
CPDS-1960-2014-reduced, then press
DOWNLOAD
and save it locally. Also download these data in
.txt
and .csv
formats using the procedure
described above.This is a version of the Comparative Political Data Set (CPDS), which is “a collection of political and institutional country-level data provided by Prof. Dr. Klaus Armingeon and collaborators at the University of Berne. It consists of annual data for 36 democratic countries for the period of 1960 to 2014 or since their transition to democracy” (Armingeon et al. 2016). The full dataset consists of 300 variables, which I have pared down to a smaller set of economical and population size variables.
CITATION: Armingeon K, Isler C, Knöpfel L, Weisstanner D, and Engler S. 2016. Comparative Political Data Set 1960-2014. Bern: Institute of Political Science, University of Berne.
So far, we have seen how to create a variety of data structures by
hand (e.g., using the c()
function), but for larger data
sets we will need mechanisms to import data into
R. There are many methods for importing
tabular data, stored in various formats (like text files, spreadsheets,
and databases).
Plain text files are, arguably, the best way to store data (and scripts and other documents) as they are a standard format that has been around longer than most operating systems and are unlikely to change anytime soon.
In R, we can load a data set from a plain
text file using the read.table()
function from the {base}
package, with the path to the file as the first (file=
)
argument for the function. An additional argument (header=
)
can be used to specify whether the first row of the data file consists
of column/field names.
The generic read.table()
function can be used to read
data files where columns are separated by tabs, commas, white space, or
some other delimeter. The sep=
argument tells
R what character is used as a “separator” or
delimiter. The skip=
argument can be used to start reading
a file after a set number of rows.
There are format-specific variants of read.table()
(e.g., read.csv()
) that have different defaults and may be
quicker for certain file types. Note that when using this function from
the {base} package, the argument stringsAsFactors
is set to
be TRUE by default, and we need to set it as FALSE if we want character
strings to be loaded as actual strings.
Let’s read in one of the data sets that you have copied and stored locally: CPDS-1960-2014-reduced.txt.
The file.choose()
command is useful and gives you a
familiar dialog box to select a file. You can use this to specify the
path to a locally-stored file, which you will save as the
R object called f
.
f <- file.choose()
The file paths below refer to where I have saved the downloaded data, on my Desktop. You will need to change this if you have saved your downloaded data to a different location.
.tsv
, .txt
) text
with {base} R functionsNOTE: In the following snippet, you can change the
sep=
argument as needed to use other delimiters
f <- "~/Desktop/CPDS-1960-2014-reduced.txt"
d <- read.table(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE, fill = T)
head(d) # lists the first 6 lines of data
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn
## 1 1960 Australia 100 0 0 1 95.5
## 2 1961 Australia 100 0 0 1 09/12/1961 95.3
## 3 1962 Australia 100 0 0 1 95.3
## 4 1963 Australia 100 0 0 1 30/11/1963 95.7
## 5 1964 Australia 100 0 0 1 95.7
## 6 1965 Australia 100 0 0 1 95.7
## womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1 0 NA 3.73 40.15 0.46 4215.00 NA 1.25
## 2 0 -0.64 2.29 38.62 -0.36 4286.00 NA 2.46
## 3 0 5.77 -0.32 38.75 -0.79 4382.00 NA 2.32
## 4 0 6.01 0.64 37.34 -0.51 4484.00 NA 1.87
## 5 0 6.26 2.87 35.31 -0.08 4610.80 67.24 1.45
## 6 0 4.99 3.41 53.99 -0.73 4745.95 67.66 1.36
## pop pop15_64 pop65 elderly
## 1 10275.0 6296.5 874.9 8.51
## 2 10508.2 6428.6 894.6 8.51
## 3 10700.5 6571.5 913.6 8.54
## 4 10906.9 6710.9 933.0 8.55
## 5 11121.6 6857.3 948.1 8.52
## 6 11340.9 7014.6 966.3 8.52
NOTE: With bracket notation, you can modify how many
lines the head()
function will return: e.g.,
head(d)[1:10])
tail(d) # shows the last 6 lines of data
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn
## 1573 2009 USA 16.36 83.64 0 1 53.2
## 1574 2010 USA 11.76 88.24 0 1 02/11/2010 39.8
## 1575 2011 USA 8.80 91.20 0 1 39.8
## 1576 2012 USA 5.88 94.12 0 1 06/11/2012 50.9
## 1577 2013 USA 5.88 94.12 0 1 50.9
## 1578 2014 USA 8.40 91.60 0 1 04/11/2014 35.6
## womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1573 16.8 -2.78 -0.36 93.47 -12.83 155454.0 75.49 9.3
## 1574 16.8 2.53 1.64 102.66 -12.18 155220.3 74.80 9.6
## 1575 16.8 1.60 3.16 108.25 -10.75 154949.3 74.12 8.9
## 1576 18.0 2.22 2.07 111.48 -9.00 156368.6 74.53 8.1
## 1577 17.8 1.49 1.46 111.45 -5.49 156761.2 74.41 7.4
## 1578 19.3 2.43 1.62 111.70 -5.13 157268.8 NA 6.2
## pop pop15_64 pop65 elderly
## 1573 306771.5 206060.8 39623.18 12.92
## 1574 309347.1 207665.3 40479.35 13.09
## 1575 311721.6 209179.2 41366.63 13.27
## 1576 314112.1 209823.0 43164.91 13.74
## 1577 316498.0 210673.5 44723.04 14.13
## 1578 318857.0 211545.9 46243.21 14.50
class(d) # shows that tables are typically loaded as data frames
## [1] "data.frame"
Or, alternatively…
d <- read.delim(f, header = TRUE, stringsAsFactors = FALSE)
head(d)
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn
## 1 1960 Australia 100 0 0 1 95.5
## 2 1961 Australia 100 0 0 1 09/12/1961 95.3
## 3 1962 Australia 100 0 0 1 95.3
## 4 1963 Australia 100 0 0 1 30/11/1963 95.7
## 5 1964 Australia 100 0 0 1 95.7
## 6 1965 Australia 100 0 0 1 95.7
## womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1 0 NA 3.73 40.15 0.46 4215.00 NA 1.25
## 2 0 -0.64 2.29 38.62 -0.36 4286.00 NA 2.46
## 3 0 5.77 -0.32 38.75 -0.79 4382.00 NA 2.32
## 4 0 6.01 0.64 37.34 -0.51 4484.00 NA 1.87
## 5 0 6.26 2.87 35.31 -0.08 4610.80 67.24 1.45
## 6 0 4.99 3.41 53.99 -0.73 4745.95 67.66 1.36
## pop pop15_64 pop65 elderly
## 1 10275.0 6296.5 874.9 8.51
## 2 10508.2 6428.6 894.6 8.51
## 3 10700.5 6571.5 913.6 8.54
## 4 10906.9 6710.9 933.0 8.55
## 5 11121.6 6857.3 948.1 8.52
## 6 11340.9 7014.6 966.3 8.52
.csv
) text with {base}
R functionsf <- "~/Desktop/CPDS-1960-2014-reduced.csv"
d <- read.table(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn
## 1 1960 Australia 100 0 0 1 95.5
## 2 1961 Australia 100 0 0 1 09/12/1961 95.3
## 3 1962 Australia 100 0 0 1 95.3
## 4 1963 Australia 100 0 0 1 30/11/1963 95.7
## 5 1964 Australia 100 0 0 1 95.7
## 6 1965 Australia 100 0 0 1 95.7
## womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1 0 NA 3.73 40.15 0.46 4215.00 NA 1.25
## 2 0 -0.64 2.29 38.62 -0.36 4286.00 NA 2.46
## 3 0 5.77 -0.32 38.75 -0.79 4382.00 NA 2.32
## 4 0 6.01 0.64 37.34 -0.51 4484.00 NA 1.87
## 5 0 6.26 2.87 35.31 -0.08 4610.80 67.24 1.45
## 6 0 4.99 3.41 53.99 -0.73 4745.95 67.66 1.36
## pop pop15_64 pop65 elderly
## 1 10275.0 6296.5 874.9 8.51
## 2 10508.2 6428.6 894.6 8.51
## 3 10700.5 6571.5 913.6 8.54
## 4 10906.9 6710.9 933.0 8.55
## 5 11121.6 6857.3 948.1 8.52
## 6 11340.9 7014.6 966.3 8.52
Or, alternatively…
d <- read.csv(f, header = TRUE, stringsAsFactors = FALSE)
head(d)
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn
## 1 1960 Australia 100 0 0 1 95.5
## 2 1961 Australia 100 0 0 1 09/12/1961 95.3
## 3 1962 Australia 100 0 0 1 95.3
## 4 1963 Australia 100 0 0 1 30/11/1963 95.7
## 5 1964 Australia 100 0 0 1 95.7
## 6 1965 Australia 100 0 0 1 95.7
## womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1 0 NA 3.73 40.15 0.46 4215.00 NA 1.25
## 2 0 -0.64 2.29 38.62 -0.36 4286.00 NA 2.46
## 3 0 5.77 -0.32 38.75 -0.79 4382.00 NA 2.32
## 4 0 6.01 0.64 37.34 -0.51 4484.00 NA 1.87
## 5 0 6.26 2.87 35.31 -0.08 4610.80 67.24 1.45
## 6 0 4.99 3.41 53.99 -0.73 4745.95 67.66 1.36
## pop pop15_64 pop65 elderly
## 1 10275.0 6296.5 874.9 8.51
## 2 10508.2 6428.6 894.6 8.51
## 3 10700.5 6571.5 913.6 8.54
## 4 10906.9 6710.9 933.0 8.55
## 5 11121.6 6857.3 948.1 8.52
## 6 11340.9 7014.6 966.3 8.52
The {readr} package provides alternative functions to read in
delimited text files. It runs faster than the {base} package functions.
It reads in an initial set of 1000 rows of the from the table to try to
impute the data class for of each column. You can also specify the data
class of each column with the col_types()
function.
require(readr)
## Loading required package: readr
f <- "~/Desktop/CPDS-1960-2014-reduced.txt"
d <- read_tsv(f, col_names = TRUE) # for tab-separated files
## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d)
## # A tibble: 6 × 20
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn womenpar
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1960 Australia 100 0 0 1 <NA> 95.5 0
## 2 1961 Australia 100 0 0 1 09/12… 95.3 0
## 3 1962 Australia 100 0 0 1 <NA> 95.3 0
## 4 1963 Australia 100 0 0 1 30/11… 95.7 0
## 5 1964 Australia 100 0 0 1 <NA> 95.7 0
## 6 1965 Australia 100 0 0 1 <NA> 95.7 0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## # deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## # pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>
class(d)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# returns d as a data frame, but also as other table-based data structures
Or, alternatively…
d <- read_delim(f, delim = "\t", col_names = TRUE)
## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d)
## # A tibble: 6 × 20
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn womenpar
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1960 Australia 100 0 0 1 <NA> 95.5 0
## 2 1961 Australia 100 0 0 1 09/12… 95.3 0
## 3 1962 Australia 100 0 0 1 <NA> 95.3 0
## 4 1963 Australia 100 0 0 1 30/11… 95.7 0
## 5 1964 Australia 100 0 0 1 <NA> 95.7 0
## 6 1965 Australia 100 0 0 1 <NA> 95.7 0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## # deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## # pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>
require(readr)
f <- "~/Desktop/CPDS-1960-2014-reduced.csv"
d <- read_csv(f, col_names = TRUE) # for comma-separated files
## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d)
## # A tibble: 6 × 20
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn womenpar
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1960 Australia 100 0 0 1 <NA> 95.5 0
## 2 1961 Australia 100 0 0 1 09/12… 95.3 0
## 3 1962 Australia 100 0 0 1 <NA> 95.3 0
## 4 1963 Australia 100 0 0 1 30/11… 95.7 0
## 5 1964 Australia 100 0 0 1 <NA> 95.7 0
## 6 1965 Australia 100 0 0 1 <NA> 95.7 0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## # deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## # pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>
Or, alternatively…
d <- read_delim(f, delim = ",", col_names = TRUE)
## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d)
## # A tibble: 6 × 20
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn womenpar
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1960 Australia 100 0 0 1 <NA> 95.5 0
## 2 1961 Australia 100 0 0 1 09/12… 95.3 0
## 3 1962 Australia 100 0 0 1 <NA> 95.3 0
## 4 1963 Australia 100 0 0 1 30/11… 95.7 0
## 5 1964 Australia 100 0 0 1 <NA> 95.7 0
## 6 1965 Australia 100 0 0 1 <NA> 95.7 0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## # deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## # pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>
While you should never need to use Excel, sometimes you will no doubt be given a spreadsheet file with some data in it that you want to analyze in R. There are several packages available that provide functions for loading data into R from Excel spreadsheet files: {readxl}, {XLConnect}, {gdata}, and {xlsx}. The first two of these are fast, easy to use, and work well. {gdata} is a bit slower and requires that you have PERL installed someone on your computer (which it is likely to be by default). {xlsx} is much slower.
NOTE: always use str()
to check if your
variables come in as the correct data class.
require(readxl)
## Loading required package: readxl
f <- "~/Desktop/CPDS-1960-2014-reduced.xlsx"
d <- read_excel(f, sheet = 1, col_names = TRUE)
head(d)
## # A tibble: 6 × 20
## year country gov_right1 gov_cent1 gov_left1 gov_party elect
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dttm>
## 1 1960 Australia 100 0 0 1 NA
## 2 1961 Australia 100 0 0 1 1961-12-09 00:00:00
## 3 1962 Australia 100 0 0 1 NA
## 4 1963 Australia 100 0 0 1 1963-11-30 00:00:00
## 5 1964 Australia 100 0 0 1 NA
## 6 1965 Australia 100 0 0 1 NA
## # ℹ 13 more variables: vturn <dbl>, womenpar <dbl>, realgdpgr <dbl>,
## # inflation <dbl>, debt_hist <dbl>, deficit <dbl>, ttl_labf <dbl>,
## # labfopar <dbl>, unemp <dbl>, pop <dbl>, pop15_64 <dbl>, pop65 <dbl>,
## # elderly <dbl>
str(d)
## tibble [1,578 × 20] (S3: tbl_df/tbl/data.frame)
## $ year : num [1:1578] 1960 1961 1962 1963 1964 ...
## $ country : chr [1:1578] "Australia" "Australia" "Australia" "Australia" ...
## $ gov_right1: num [1:1578] 100 100 100 100 100 100 100 100 100 100 ...
## $ gov_cent1 : num [1:1578] 0 0 0 0 0 0 0 0 0 0 ...
## $ gov_left1 : num [1:1578] 0 0 0 0 0 0 0 0 0 0 ...
## $ gov_party : num [1:1578] 1 1 1 1 1 1 1 1 1 1 ...
## $ elect : POSIXct[1:1578], format: NA "1961-12-09" ...
## $ vturn : num [1:1578] 95.5 95.3 95.3 95.7 95.7 95.7 95.1 95.1 95.1 95 ...
## $ womenpar : num [1:1578] 0 0 0 0 0 0 0.8 0.8 0.8 0 ...
## $ realgdpgr : num [1:1578] NA -0.643 5.767 6.009 6.258 ...
## $ inflation : num [1:1578] 3.729 2.288 -0.319 0.641 2.866 ...
## $ debt_hist : num [1:1578] 40.1 38.6 38.7 37.3 35.3 ...
## $ deficit : num [1:1578] 0.4582 -0.3576 -0.7938 -0.5062 -0.0804 ...
## $ ttl_labf : num [1:1578] 4215 4286 4382 4484 4611 ...
## $ labfopar : num [1:1578] NA NA NA NA 67.2 ...
## $ unemp : num [1:1578] 1.25 2.46 2.32 1.87 1.45 ...
## $ pop : num [1:1578] 10275 10508 10700 10907 11122 ...
## $ pop15_64 : num [1:1578] 6296 6429 6572 6711 6857 ...
## $ pop65 : num [1:1578] 875 895 914 933 948 ...
## $ elderly : num [1:1578] 8.51 8.51 8.54 8.55 8.52 ...
require(XLConnect)
f <- "~/Desktop/CPDS-1960-2014-reduced.xlsx"
d <- readWorksheetFromFile(f, sheet = 1, header = TRUE)
head(d)
str(d)
The {XLConnect} package can also write data frames back out to
Excel worksheets. If the file does not exist,
it is created. If it does exist, data is cleared and overwritten. The
second process is MUCH slower. I have included a conditional statement
(if(){}
) which will implement the
file.remove()
command here, if needed.
f <- "~/Desktop/output.xlsx"
if (file.exists(f)) {
file.remove(f)
}
writeWorksheetToFile(f, d, sheet = "myData", clearSheets = TRUE)
For futher information on using {XLConnect} check out this helpful documentation.
We can also load files stored on a server elsewhere on the web, e.g., Dropbox or GitHub.
To read .csv
or .txt
files directly from
GitHub, use the {curl} or {readr} packages.
We’ll use this extensively in this class, moving forward.
GO TO: https://github.com/fuzzyatelin/fuzzyatelin.github.io/tree/master/AN588_Fall23/,
select the .csv
version of the
CPDS-1960-2014-reduced file, then press
RAW
and copy the URL from the address box of your browser
window… this is what you need to use as an argument for the functions
below (you will repeat this for the .txt
version later
on)
For a comma-separated value (.csv
) text file…
library(curl)
## Using libcurl 7.64.1 with LibreSSL/2.8.3
##
## Attaching package: 'curl'
## The following object is masked from 'package:readr':
##
## parse_date
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/CPDS-1960-2014-reduced.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn
## 1 1960 Australia 100 0 0 1 95.5
## 2 1961 Australia 100 0 0 1 09/12/1961 95.3
## 3 1962 Australia 100 0 0 1 95.3
## 4 1963 Australia 100 0 0 1 30/11/1963 95.7
## 5 1964 Australia 100 0 0 1 95.7
## 6 1965 Australia 100 0 0 1 95.7
## womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1 0 NA 3.73 40.15 0.46 4215.00 NA 1.25
## 2 0 -0.64 2.29 38.62 -0.36 4286.00 NA 2.46
## 3 0 5.77 -0.32 38.75 -0.79 4382.00 NA 2.32
## 4 0 6.01 0.64 37.34 -0.51 4484.00 NA 1.87
## 5 0 6.26 2.87 35.31 -0.08 4610.80 67.24 1.45
## 6 0 4.99 3.41 53.99 -0.73 4745.95 67.66 1.36
## pop pop15_64 pop65 elderly
## 1 10275.0 6296.5 874.9 8.51
## 2 10508.2 6428.6 894.6 8.51
## 3 10700.5 6571.5 913.6 8.54
## 4 10906.9 6710.9 933.0 8.55
## 5 11121.6 6857.3 948.1 8.52
## 6 11340.9 7014.6 966.3 8.52
# returns a data frame
For a tab-delimited (.tsv
or txt
) text
file…
f <- curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/CPDS-1960-2014-reduced.txt")
d <- read.table(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
head(d)
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn
## 1 1960 Australia 100 0 0 1 95.5
## 2 1961 Australia 100 0 0 1 09/12/1961 95.3
## 3 1962 Australia 100 0 0 1 95.3
## 4 1963 Australia 100 0 0 1 30/11/1963 95.7
## 5 1964 Australia 100 0 0 1 95.7
## 6 1965 Australia 100 0 0 1 95.7
## womenpar realgdpgr inflation debt_hist deficit ttl_labf labfopar unemp
## 1 0 NA 3.73 40.15 0.46 4215.00 NA 1.25
## 2 0 -0.64 2.29 38.62 -0.36 4286.00 NA 2.46
## 3 0 5.77 -0.32 38.75 -0.79 4382.00 NA 2.32
## 4 0 6.01 0.64 37.34 -0.51 4484.00 NA 1.87
## 5 0 6.26 2.87 35.31 -0.08 4610.80 67.24 1.45
## 6 0 4.99 3.41 53.99 -0.73 4745.95 67.66 1.36
## pop pop15_64 pop65 elderly
## 1 10275.0 6296.5 874.9 8.51
## 2 10508.2 6428.6 894.6 8.51
## 3 10700.5 6571.5 913.6 8.54
## 4 10906.9 6710.9 933.0 8.55
## 5 11121.6 6857.3 948.1 8.52
## 6 11340.9 7014.6 966.3 8.52
# returns a data frame
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)
library(readr)
f <- "https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/CPDS-1960-2014-reduced.csv"
d <- read_csv(f, col_names = TRUE)
## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d)
## # A tibble: 6 × 20
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn womenpar
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1960 Australia 100 0 0 1 <NA> 95.5 0
## 2 1961 Australia 100 0 0 1 09/12… 95.3 0
## 3 1962 Australia 100 0 0 1 <NA> 95.3 0
## 4 1963 Australia 100 0 0 1 30/11… 95.7 0
## 5 1964 Australia 100 0 0 1 <NA> 95.7 0
## 6 1965 Australia 100 0 0 1 <NA> 95.7 0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## # deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## # pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>
# returns a 'tibble', a new version of a data frame
f <- "https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/CPDS-1960-2014-reduced.txt"
d <- read_tsv(f, col_names = TRUE)
## Rows: 1578 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): country, elect
## dbl (18): year, gov_right1, gov_cent1, gov_left1, gov_party, vturn, womenpar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d)
## # A tibble: 6 × 20
## year country gov_right1 gov_cent1 gov_left1 gov_party elect vturn womenpar
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1960 Australia 100 0 0 1 <NA> 95.5 0
## 2 1961 Australia 100 0 0 1 09/12… 95.3 0
## 3 1962 Australia 100 0 0 1 <NA> 95.3 0
## 4 1963 Australia 100 0 0 1 30/11… 95.7 0
## 5 1964 Australia 100 0 0 1 <NA> 95.7 0
## 6 1965 Australia 100 0 0 1 <NA> 95.7 0
## # ℹ 11 more variables: realgdpgr <dbl>, inflation <dbl>, debt_hist <dbl>,
## # deficit <dbl>, ttl_labf <dbl>, labfopar <dbl>, unemp <dbl>, pop <dbl>,
## # pop15_64 <dbl>, pop65 <dbl>, elderly <dbl>
# returns a 'tibble', a new version of a data frame for use in the
# {tidyverse}
To load data from a .csv
located file in a
personal Dropbox account you can use
the {rdrop2} package.
NOTE: The following code block cannot be “knit” to
show you the output because it requires an interactive
R environment for drop_auth()
,
drop_search()
, etc. Also, with two-factor authentication
protocols (like Duo for BU accounts), there are likely a number of
cumbersome logins/authentications you need to enter to accomplish
this!
require(rdrop2)
drop_auth() # opens a browser dialog box to ask for authorization...
drop_dir() # lists the contents of your dropbox folder
f <- "CPDS-1960-2014-reduced.csv" # name of the file to read from
f <- drop_search(f) # searches your dropbox directory for file or directory names; this can be slow
f <- f$path # f$path is the location you choose from the results returned above
d <- drop_read_csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
str(d)
This same process can be done to load data from other types of
delimited files in Dropbox by setting the
appropriate sep=
argument.
You can also read text files from someone else’s Dropbox account using a link that they have shared with you. Here’s a link sharing the same file from my BU dropbox:
link <- "https://www.dropbox.com/scl/fi/560bh55pwv3sowbmyygsl/CPDS-1960-2014-reduced.csv?rlkey=sp4xkouegbno3fqrt65o7gb4d&dl=0"
NOTE: Shared Dropbox links like this one will take you to a webpage that has the data embedded… to get the raw data you need to change the end of the link from dl=0 to dl=1 or raw=1. That’s what the next line of code does:
link <- gsub(pattern = "dl=0", replacement = "dl=1", x = link)
d <- read.csv(link, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)
str(d)
You can also use the source_data()
function from the
{repmis} package (“Miscellaneous Tools for Reproducible Research”) to
load data from a file on Dropbox. This
function detects column types and gives a few more warnings than others
if it encounters something odd.
require(repmis)
## Loading required package: repmis
d <- source_data(link, header = TRUE, sep = ",") # use the same updated link to the raw data as above
## Downloading data from: https://www.dropbox.com/scl/fi/560bh55pwv3sowbmyygsl/CPDS-1960-2014-reduced.csv?rlkey=sp4xkouegbno3fqrt65o7gb4d&dl=0
## SHA-1 hash of the downloaded data file is:
## 7ea955a619fb5d67fca2498dff4f2b224117a7ef
head(d)
## window.configureRequire(requireConfig {
## 1 requireFallbackBaseUrl: "https://www.dropboxstatic.com/" NA
str(d)
## 'data.frame': 1 obs. of 2 variables:
## $ window.configureRequire(requireConfig: chr "requireFallbackBaseUrl: \"https://www.dropboxstatic.com/\""
## $ { : logi NA
Finally, you can also load data directly from google drive into R using either the {googledrive} or the {googlesheets4} package. The most seamless way of importing a file from google drive is to convert the file to a google sheet (i.e., a spreadsheet editable within the drive). Try saving the CPDS-1960-2014-reduced file as a google sheet in your own drive, and extracting it into R using the following code.
NOTE: The following code block cannot be “knit” to
show you the output because it requires an interactive
R environment for gs_ls()
,
get_title()
, etc. If the package does not take you to a
webpage login for google sheets upon the gs_ls()
command,
you may need to install the package {httpuv}
separately.
require(googledrive)
require(googlesheets4)
f <- drive_find(n_max = 30, type = "spreadsheet") #This interactive interface will guide you through steps to link your Google Drive. The 'n_max' option will restrict the number of files retrieved to the 30 most recently-added to the linked drive. The 'type' option allows you to only search for a specific kind of file; in this case, spreadsheets (this 'type' will *not* find Excel files, only Google Sheets). It will store the results in an object called a 'dribble'.
# In your dribble, find the Google Sheet you want. The first column is the
# name, and the second is the ID for the sheet. That ID is what you need
# to enter to read your google sheet into R (you may need to log in again
# to access it).
d <- read_sheet("1bzyH5hgif0__q6oW4zfJhxGxLKA073sv8ZYcZpAHTf4")
head(d)
str(d)
For importation of a spreadsheet file not converted to a google sheet in your Drive, it’s a bit more complicated but still possible. You can try uploading your document and extracting it into R using the code suggested by this helpful post from the Dyer Lab.
The {rdrop2} package can be used to download a file from a personal Dropbox account to your local computer, rather than just connecting to a Dropbox file to read the data stored there. The {googledrive} package can do this, as well, with any file in a personal Google Drive account (given you have permission). This should work with any file type.
NOTE: The following code block cannot be “knit” to
show you the output because it requires an interactive
R environment for drop_search()
,
etc.
filename <- "CPDS-1960-2014-reduced.csv" # name of file to download
f <- drop_search(filename) # searches your dropbox directory for that file or directory name
f <- f$path # f$path is the location of the results returned above
drop_get(f, local_file = paste0("~/Desktop/", filename), overwrite = TRUE, progress = TRUE)
# this will save the file to the desktop
The progress=TRUE
argument gives you a reassuring
progress bar. By default, this argument is set to FALSE.
NOTE: The process also works for other file types, e.g., Excel files:
filename <- "CPDS-1960-2014-reduced.xlsx"
f <- drop_search(filename) # searches your dropbox directory for file or directory names
f <- f$path # $path is the location of the results returned above
drop_get(f, local_file = paste0("~/Desktop/", filename), overwrite = TRUE, progress = TRUE)
# again, saves to the desktop
You can also download files directly to your computer from google drive using the {googledrive} package.
drive_find(pattern = "CPDS-1960-2014-reduced") #identifies files that match that name
drive_download("CPDS-1960-2014-reduced.csv")
# in this case, should save to your working directory
More on the basics of using {googledrive} for more refined data downloading, interfacing and formatting can be found here.
R is pretty great, but you may have analyses and datasets already formatted for other statistical programs that you want to bring to R for further analysis or replication. For more advice and tips on how to download other forms of data into R, including datasets and outputs from other statistical packages such as SAS, Stata, and SPSS, check out this webpage.
For folks more intimately familiar with the statistical program MATLAB, there are a number of resources like this handy guide, and an entire package, {R.matlab}, that allows MATLAB users to interface with R.