6 Loading data from files

6.1 About this chapter

Questions:

How do I get my data into R?

Objectives:

Loading a ‘.csv’ file
Checking column contents
Dealing with headers and column names
Loading a specific sheet from a ‘.xlsx’ file.

Keypoints:

The readr and readxl packages contain functions for loading data from .csv and .xslx files. These functions help you to ensure that your data is loaded as you expect.

6.2 readr

readr is a tool for loading data into R. As part of the tidyverse it is loaded when you use library(tidyverse) but can be loaded on its own with library(readr). We will use readr to load in data from a ‘flat’ .csv file. Most

6.2.1 read_csv()

The main function is read_csv() which can read a standard comma separated values file from disk into an R dataframe. There are a few variants of read_csv() which may be appropriate for different sorts of .csv file, but they all work the same.

read_csv2() - reads semi-colon delimited files, which are commonly used where a comma is used as a decimal separator
read_tsv() - reads tab delimited files
read_delim() - reads files delimited by an arbitrary character

The first argument to read_csv() is the path to the file to read. Here I’ll read a file on my Desktop that contains the diamonds data we’ve been using.

read_csv("~/Desktop/diamonds.csv")

Rows: 53940 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): cut, color, clarity
dbl (7): carat, depth, table, price, x, y, z

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

On loading we see a column specification, read_csv() has guessed at what the columns should be and made those types. Its fine for the most part, but some of those columns we’d prefer to be factors. We can set our own column specification to force the column types on loading. We only have to do the ones that read_csv() gets wrong. Specifically, lets fix cut and color to a factor. We can do that with the col_types argument.

read_csv("~/Desktop/diamonds.csv",
    col_types = cols(
      cut = col_factor(NULL),
      color = col_factor(NULL)
    )
)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <fct>     <fct> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

6.2.2 Parser functions

This works by assigning a parser function that returns a specific type to each column, here it’s col_factor(). There are parser functions for all types of data, and all of them can be used if read_csv() doesn’t guess your data properly. We won’t go into detail of all of them, just remember that if your numbers or dates or stuff won’t load properly, there’s a parser function that can help.

The parser functions all have their own arguments, so we can manipulate those. We can see the NULL argument being passed to col_factor() above, which means ‘all values found should be used as levels of the factor’. This is a great default setting, but if we have a large file, it won’t help us find unexpected values.

Consider a situation where we are certain we should only have the values Fair, Good and Very Good for cut in our diamonds data. We can make the parser function check this for us and give a warning if it finds anything else.

read_csv("~/Desktop/diamonds.csv",
    col_types = cols(
      cut = col_factor(levels = c( "Fair" ,"Good", "Very Good")),
      color = col_factor(NULL)
    )
)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <fct>     <fct> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 <NA>      E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 <NA>      E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 <NA>      I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

This time, we get a large number of warnings. Though the output is quite cryptic at first glance, read_csv() is complaining that it found values for cut that were not in the list we passed to the parser function.

Hence we can use parsers to ensure we are loading in the data we expect and generate errors if not.

6.2.3 Headers and column names

By default read_csv() uses the first line of the file for column names. Consider this toy example.

toy_csv <- 
"a,b,c
1,2,3
4,5,6"
read_csv(toy_csv)

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

The first line of the toy file becomes the column headings. This may not be appropriate, since there could be some metadata in the file

toy_csv <- 
"some info about stuff
a,b,c
1,2,3
4,5,6"
read_csv(toy_csv)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 3 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): some info about stuff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 3 × 1
  `some info about stuff`
  <chr>                  
1 a,b,c                  
2 1,2,3                  
3 4,5,6

The loaded data gets really messed up, so we can skip a set number of lines if needed

toy_csv <- 
"some info about stuff
a,b,c
1,2,3
4,5,6"
read_csv(toy_csv, skip = 1)

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

Alternatively, you might have comments that begin with a particular character. You can use the comment argument to skip those lines

toy_csv <- 
"#some info about stuff
#some more info
#goodness, lots of INFO
a,b,c
1,2,3
4,5,6"
read_csv(toy_csv, comment = "#")

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

The data might not have any column names at all, the first row may data. This situation is handled with col_names argument

toy_csv <- 
"1,2,3
4,5,6"
read_csv(toy_csv, col_names = FALSE)

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): X1, X2, X3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

So, read_csv() sets up arbitrary column names. We can specify column names if we wish

toy_csv <- 
"a,b,c
1,2,3
4,5,6"
read_csv(toy_csv, col_names = c("x", "y", "z"))

Rows: 3 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): x, y, z

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 3 × 3
  x     y     z    
  <chr> <chr> <chr>
1 a     b     c    
2 1     2     3    
3 4     5     6

6.2.4 Missing values

There are many different ways of encoding missing values, you can tell read_csv() which character represents a missing value explicitly with the na argument. These values will all be loaded as proper NA.

toy_csv <- 
"a,b,c
1,_,3
4,5,_"
read_csv(toy_csv, na = "_")

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1    NA     3
2     4     5    NA

6.3 Writing Files

A complementary function to read_csv() write_csv() allows you to write a dataframe out to a ‘.csv’ file. The convention is straightforward, you need the name of the dataframe and the name of the file and path to write to.

write_csv(diamonds, "~/Desktop/my_data.csv")

6.4 readxl

The readxl package is installed as part of the tidyverse install.packages() command, but it is not part of the core, so library(tidyverse) does not load it. You must do it explicitly with library(readxl).

6.4.1 read_xlsx()

The main function is read_xlsx(), it’s similar to read_csv().

library(readxl)
read_xlsx("~/Desktop/datasets.xlsx")

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

By default it loads the first worksheet, you can examine the sheets available with excel_sheets()

excel_sheets("~/Desktop/datasets.xlsx")

[1] "iris"     "chickwts" "mtcars"   "quakes"

Then load in the one you want.

read_xlsx("~/Desktop/datasets.xlsx", sheet = "chickwts")

# A tibble: 71 × 2
   weight feed     
    <dbl> <chr>    
 1    179 horsebean
 2    160 horsebean
 3    136 horsebean
 4    227 horsebean
 5    217 horsebean
 6    168 horsebean
 7    108 horsebean
 8    124 horsebean
 9    143 horsebean
10    140 horsebean
# ℹ 61 more rows

Loading then follows the same pattern as for read_csv(), with a difference in the column specifications - in this package its much simpler. You can only specify type columnwise and the specification can only be one of “skip”, “guess”, “logical”, “numeric”, “date”, “text” or “list” - meaning you can’t do the advanced parsing as for read_csv().

A sample spec might look like

read_xlsx(
  "~/Desktop/datasets.xlsx", 
  sheet = "chickwts",
  col_types = c("numeric", "text")
  )

# A tibble: 71 × 2
   weight feed     
    <dbl> <chr>    
 1    179 horsebean
 2    160 horsebean
 3    136 horsebean
 4    227 horsebean
 5    217 horsebean
 6    168 horsebean
 7    108 horsebean
 8    124 horsebean
 9    143 horsebean
10    140 horsebean
# ℹ 61 more rows