1 Tidy Data
1.1 About this chapter
- Questions:
- What is tidy data?
- Objectives:
- Understanding data type
- Understanding tidy data structures
- Explicitly describing and checking the data structure
- Keypoints:
- Data needs to be in a particular format for tidy data principles to work
1.2 Tidy data
There are many ways to structure data. Here are two quite common ones.
treatment A | treatment B | |
---|---|---|
John Smith | 11 | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatment A | 11 | 16 | 3 |
treatment B | 2 | 11 | 1 |
source: Hadley Wickham
Tables contain two things, variables and values for those variables. In these two tables there are only three variables. treatment
is one, with the values a
and b
. The second is ‘name’, with three values hidden in plain sight, and the third is result
which is the value of the thing actually measured for each person and treatment.
For human reading purposes, we don’t need to state the variables explicitly, we can see them by interpolating between the columns and rows and adding a bit of common sense. This mixing up of variables and values across tables like this has led some to call these tables ‘messy’. A computer finds it hard to make sense of a messy table.
Working with R is made much less difficult if we get the data into a ‘tidy’ format. This format is distinct because each variable has its own column explicitly, like this
name | treatment | result |
---|---|---|
John Smith | a | 11 |
Jane Doe | a | 16 |
Mary Johnson | a | 3 |
John Smith | b | 2 |
Jane Doe | b | 11 |
Mary Johnson | b | 1 |
Now each variable has a column, and each seperate observation of the data has its own row. It is much more verbose for a human, but R can use this easily because we are now explicit about what is called what and how it relates to everything else.
More generally put, a tidy data set should look like this, schematically.
- Each variable is in its own column
- Each observation is in its own row
- The value of a variable in an observation is in a single cell.
1.3 A sample tidy data set
Let’s use a tidy data set that comes with the tidyverse packages. The object diamonds
is built in to tidyr
and can be viewed by typing its name. We’ll use the head()
function to look at the top six rows only
library(tidyverse)
head(diamonds)
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The output tells us that this is a thing called a tibble
- this is just a table like object, more about these later. We can see the size of the tibble - 6 rows, 10 columns (this is truncated because of head()
in reality its 53940 rows long). We can see the column headings and we can see the column type or, as this is called in R-speak, its class.
1.3.1 Class
Each of the columns has a particular type or class. Here class is either <dbl>, <ord> or <int>
. This tells us what kind of data R is in that column. It’s very important that you and R agree about what sort of data is in each column, otherwise the operations you run can go awry.
Thankfully there are only a few main classes to worry about
num
orint
ordbl
- number typeschr
- regular textfctr
- A factor. A category or names for groups. Discrete values.lgl
-TRUE
orFALSE
data. Can only have these two values.
Numeric, logical and character are pretty self explanatory. Factors need a bit more thinking about.
1.3.1.1 Factors
A factor is a variable that can only take pre-known values called levels. Often these will be experimental categories or groups. Usually you will know the values of the level before you even start an experiment. A treatment of a plant with different chemicals could be a factor. Its levels would be names for each treatment studied. E.G GiberellicAcid
, Jasmonate
or Auxin
. Note a factor isn’t restricted to describing inputs. In the same way, the sort of response of a plant to a treatment could be a factor, so high
,low
, hypersensitive
could all be levels of an output factor variable in an infection assay.
A factor can have numeric-looking levels. Treatment or response can often be labelled 1
, 2
, 3
etc, but they are used as categories, not actual measurements or numbers in factors. If the values can be replaced by e.g A
, B
, C
without loss of sense, then the variable is a category and should be encoded as a factor.
In our diamonds
data set, the cut
, color
and clarity
variables are factors - they just happen to be a particular sort of ordered factor
.
Factors are what we will group and split our data sets by. We will do statistics, plots and comparisons based on numbers within factor levels.
1.3.1.2 Checking Class Explicitly
The tibble
table-like object of our diamonds
data does a good job of summarising type. R has some commands for this too.
class()
will give you the class(es) of a specific variable (we can use the $
notation to get a single column out of a table-like object such as a tibble
)
class(diamonds$cut)
[1] "ordered" "factor"
levels()
will tell you all the levels of a factor
levels(diamonds$cut)
[1] "Fair" "Good" "Very Good" "Premium" "Ideal"
str()
will give you a summary of whole table-like objects
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
1.4 Quiz
- How many levels in the factor
color
in thediamonds
data? - Is the table below ‘tidy’?
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745 |
Afghanistan | 1999 | population | 19987071 |
Afghanistan | 2000 | cases | 2666 |
Afghanistan | 2000 | population | 20595360 |
Brazil | 1999 | cases | 37737 |
Brazil | 1999 | population | 172006362 |
Brazil | 2000 | cases | 80488 |
Brazil | 2000 | population | 174504898 |
China | 1999 | cases | 212258 |
China | 1999 | population | 1272915272 |
China | 2000 | cases | 213766 |
China | 2000 | population | 1280428583 |
- How many variables are contained in the table - how many columns should there be for it to be tidy?