3  Combining Datasets

3.1 About this chapter

  1. Questions:
  • How do I combine dataframes?
  1. Objectives:
  • Understanding keys
  • Explore join functions
  1. Keypoints:
  • Dataframes get joined on key columns. The rows that are retained depends on the type of join performed

3.2 Joining

Often you will want to combine data contained in more than one dataset. In this section we will look at the functions that help you do that.

3.2.1 Key columns

The joining operation depends on the two datasets having some values in some column in common. The column in each dataset that allows you to combine columns is the key column. Consider these dataframes

band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

Note that the two dataframes have a column in common name.

3.3 Join functions

Join functions work to combine two dataframes side-by-side in some way. Usually they use one column as a base and add columns to that one from the other.

3.3.1 left_join()

The most common sort of join is the left join. This takes one dataframe, considers it to be on the left of the join and combines the second dataframe on to it, skipping rows in the right dataframe that have nowhere to join

left_join( band_members, band_instruments, )
Joining, by = "name"
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  

Note how the column in common name is used as the key through which to join and that the band_member Keith goes missing because it isn’t in the left dataframe, which is the reference.

3.3.2 right_join()

right_join() is the complementary function.

right_join( band_members, band_instruments)
Joining, by = "name"
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

See how this time Keith is retained as we’re joining to the right table as the base, but as he has no entry in the left table, an NA is used to fill the missing value.

3.3.3 inner_join()

inner_join() keeps only rows that are completely shared

inner_join( band_members, band_instruments)
Joining, by = "name"
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

3.3.4 full_join()

full_join() joins all rows as well as possible, generating NA as appropriate.

full_join( band_members, band_instruments)
Joining, by = "name"
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

3.3.5 Joins with no common column names

What can we do when there is no common column names? Consider this variant of band instruments

band_instruments2
# A tibble: 3 × 2
  artist plays 
  <chr>  <chr> 
1 John   guitar
2 Paul   bass  
3 Keith  guitar

The name column is called artist - we can join by explicitly stating the column to join by

left_join( band_members, band_instruments2, by = c("name" = "artist"))
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  

3.4 Binding operations

These allow you to paste dataframes together.

bind_rows() sticks them together top-to-bottom.

bind_rows(band_members, band_members)
# A tibble: 6 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
4 Mick  Stones 
5 John  Beatles
6 Paul  Beatles

Note the column names need not be identical for this to work. NAs are propogated as required.

bind_rows(band_members, band_instruments)
# A tibble: 6 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles <NA>  
3 Paul  Beatles <NA>  
4 John  <NA>    guitar
5 Paul  <NA>    bass  
6 Keith <NA>    guitar

bind_cols() sticks dataframes together side-by-side/

bind_cols(band_members, band_instruments)
New names:
• `name` -> `name...1`
• `name` -> `name...3`
# A tibble: 3 × 4
  name...1 band    name...3 plays 
  <chr>    <chr>   <chr>    <chr> 
1 Mick     Stones  John     guitar
2 John     Beatles Paul     bass  
3 Paul     Beatles Keith    guitar

Note how it doesn’t do any sensible matching - it’s just pasting them together. Repeated column names get modified. What happens if the dataframes aren’t of equal length?

data_4_rows <- tibble( names = letters[1:4], values = 1:4)
bind_cols(band_members, data_4_rows)
Error in `bind_cols()`:
! Can't recycle `..1` (size 3) to match `..2` (size 4).

3.5 Quiz

The quiz for this section is mixed in with the quiz for section 5. When you get that far, do the quiz there.