How do I go from non-tidy to tidy data structures?
Objectives:
Manipulating dataframes with the tidyr verbs
Keypoints:
The tidyr package contains functions that affect the layout and structure of dataframes.
5.2 tidyr
tidyr is a tool for manipulating layout of datasets. As part of the tidyverse it is loaded when you use library(tidyverse) but can be loaded on its own with library(tidyr). tidyr has two main functions - spread() and gather().
5.2.1 Sample tidy datasets
Let’s look at five sample tables that show the same data in different ways, only one of which counts as tidy.
table1
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4b
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
The tidy data is in table1.
table2 is not tidy because not every variable has its own column. The count column has values for cases and population and these are two different things. The type column could be a variable on its own, but as used here its a way to mix up the count variable unneccesarily.
table3 is not tidy because rate is currently a function of two variables - literally its printed as cases/population. The column rate should contain the result of cases/population and if we wanted to retain the case and population information it should be in its own column, like in table1
table4a and table4b aren’t tidy, because the data are split over two tables and in each table the values of the year variable are split over multiple columns.
Let’s work with each of these non-tidy datasets in turn to get them tidy.
5.3 pivot_longer()
The most common problem is that in table4a, where the values of a variable are split over multiple columns.
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
To tidy this, we can use pivot_longer(), which increases the number of rows and decreases the number of columns. It needs three bits of information:
The column(s) to keep un-pivoted - these are columns that are likely already tidy. All other columns will be pivoted
The name of a new column in which to put the old column names
The name of a new column in which to put the old values
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Note how we use the -country syntax to mean ‘pivot everything but country’. The names_to argument tells pivot_longer() where to put the names, and the values_to argument specifies where the numbers should go.
Note too how the columns we pivoted (1999 and 2000) have been dropped from the table completely. This little table is now tidy.
We can do the same with table4b but this one has the value population
# A tibble: 6 × 3
country year population
<chr> <chr> <dbl>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
# A tibble: 6 × 4
country year cases population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Note we don’t need to specify the by = argument, since the two tables have column names in common - left_join() works that out and does the join automatically.
5.4 pivot_wider()
The inverse function to pivot_longer() is pivot_wider() which increases column number and decreases row count. This function needs two pieces of information
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
5.5 separate()
The seperate() function turns one column into many by splitting the value whenever a particular character appears. Remember table3
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
We can separate that rate column into two - cases and population
table3 %>%separate(rate, into =c("cases", "population"))
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
By default separate() splits things on any non-numeric character. But we can be explicit with the sep argument.
table3 %>%separate(rate, into =c("cases", "population"), sep ="/")
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
This works just as well, but is useful if the computer makes a bad guess.
Note the column types of cases and population, they’re down as chr. By default the type of the parent column is retained, but you can make separate() guess what type the new column is with the convert argument.
table3 %>%separate(rate, into =c("cases", "population"), sep ="/", convert =TRUE)
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
And now, we’re back to tidy data.
5.6 unite()
The unite() function is the inverse of separate(), and combines multiple columns into a single one.
To demonstrate unite() we can make a new table, table5 by using the separate() function on the year column in table3. Passing sep a number, tells it just to split that many characters into the string
# A tibble: 6 × 4
country century year rate
<chr> <int> <int> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 0 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 0 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 0 213766/1280428583
We can now re- unite()table5. The arguments for this function are just the name of the new column, and the columns to join
table5 %>%unite(new, century, year)
# A tibble: 6 × 3
country new rate
<chr> <chr> <chr>
1 Afghanistan 19_99 745/19987071
2 Afghanistan 20_0 2666/20595360
3 Brazil 19_99 37737/172006362
4 Brazil 20_0 80488/174504898
5 China 19_99 212258/1272915272
6 China 20_0 213766/1280428583
Here the default is to use an underscore _ to join the values, but we can be explicit and use nothing with the sep argument
table5 %>%unite(new, century, year, sep="")
# A tibble: 6 × 3
country new rate
<chr> <chr> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 200 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 200 80488/174504898
5 China 1999 212258/1272915272
6 China 200 213766/1280428583
5.7 Quiz
Examine the table1 and table4a datasets. Combine table4a to table1 to create two new columns. Ensure the columns make sense and retain data integrity.
Tidying data is hard! And it needs you to know your data quite well, which naturally takes time. Rather than quiz questions here, a worked example will give good benefit, so let’s try one of those. Do the Case Study on page 163 of R for Data Science. If you don’t have the print edition it is available online here Case Study for Tidy Data.