3 Combining Datasets
3.1 About this chapter
- Questions:
- How do I combine dataframes?
- Objectives:
- Understanding keys
- Explore
join
functions
- Keypoints:
- Dataframes get joined on key columns. The rows that are retained depends on the type of
join
performed
3.2 Joining
Often you will want to combine data contained in more than one dataset. In this section we will look at the functions that help you do that.
3.2.1 Key columns
The joining operation depends on the two datasets having some values in some column in common. The column in each dataset that allows you to combine columns is the key column. Consider these dataframes
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
Note that the two dataframes have a column in common name
.
3.3 Join functions
Join functions work to combine two dataframes side-by-side in some way. Usually they use one column as a base and add columns to that one from the other.
3.3.1 left_join()
The most common sort of join is the left join. This takes one dataframe, considers it to be on the left of the join and combines the second dataframe on to it, skipping rows in the right dataframe that have nowhere to join
left_join( band_members, band_instruments, )
Joining, by = "name"
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
Note how the column in common name
is used as the key through which to join and that the band_member
Keith
goes missing because it isn’t in the left
dataframe, which is the reference.
3.3.2 right_join()
right_join()
is the complementary function.
right_join( band_members, band_instruments)
Joining, by = "name"
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar
See how this time Keith
is retained as we’re joining to the right table as the base, but as he has no entry in the left table, an NA
is used to fill the missing value.
3.3.3 inner_join()
inner_join()
keeps only rows that are completely shared
inner_join( band_members, band_instruments)
Joining, by = "name"
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
3.3.4 full_join()
full_join()
joins all rows as well as possible, generating NA
as appropriate.
full_join( band_members, band_instruments)
Joining, by = "name"
# A tibble: 4 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar
3.3.5 Joins with no common column names
What can we do when there is no common column names? Consider this variant of band instruments
band_instruments2
# A tibble: 3 × 2
artist plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
The name
column is called artist - we can join by explicitly stating the column to join by
left_join( band_members, band_instruments2, by = c("name" = "artist"))
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
3.4 Binding operations
These allow you to paste dataframes together.
bind_rows()
sticks them together top-to-bottom.
bind_rows(band_members, band_members)
# A tibble: 6 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
4 Mick Stones
5 John Beatles
6 Paul Beatles
Note the column names need not be identical for this to work. NAs
are propogated as required.
bind_rows(band_members, band_instruments)
# A tibble: 6 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles <NA>
3 Paul Beatles <NA>
4 John <NA> guitar
5 Paul <NA> bass
6 Keith <NA> guitar
bind_cols()
sticks dataframes together side-by-side/
bind_cols(band_members, band_instruments)
New names:
• `name` -> `name...1`
• `name` -> `name...3`
# A tibble: 3 × 4
name...1 band name...3 plays
<chr> <chr> <chr> <chr>
1 Mick Stones John guitar
2 John Beatles Paul bass
3 Paul Beatles Keith guitar
Note how it doesn’t do any sensible matching - it’s just pasting them together. Repeated column names get modified. What happens if the dataframes aren’t of equal length?
<- tibble( names = letters[1:4], values = 1:4)
data_4_rows bind_cols(band_members, data_4_rows)
Error in `bind_cols()`:
! Can't recycle `..1` (size 3) to match `..2` (size 4).
3.5 Quiz
The quiz for this section is mixed in with the quiz for section 5. When you get that far, do the quiz there.