1 Tidy Data

1.1 About this chapter

Questions:

What is tidy data?

Objectives:

Understanding data type
Understanding tidy data structures
Explicitly describing and checking the data structure

Keypoints:

Data needs to be in a particular format for tidy data principles to work

1.2 Tidy data

There are many ways to structure data. Here are two quite common ones.

	treatment A	treatment B
John Smith	11	2
Jane Doe	16	11
Mary Johnson	3	1

	John Smith	Jane Doe	Mary Johnson
treatment A	11	16	3
treatment B	2	11	1

source: Hadley Wickham

Tables contain two things, variables and values for those variables. In these two tables there are only three variables. treatment is one, with the values a and b . The second is ‘name’, with three values hidden in plain sight, and the third is result which is the value of the thing actually measured for each person and treatment.

For human reading purposes, we don’t need to state the variables explicitly, we can see them by interpolating between the columns and rows and adding a bit of common sense. This mixing up of variables and values across tables like this has led some to call these tables ‘messy’. A computer finds it hard to make sense of a messy table.

Working with R is made much less difficult if we get the data into a ‘tidy’ format. This format is distinct because each variable has its own column explicitly, like this

name	treatment	result
John Smith	a	11
Jane Doe	a	16
Mary Johnson	a	3
John Smith	b	2
Jane Doe	b	11
Mary Johnson	b	1

Now each variable has a column, and each seperate observation of the data has its own row. It is much more verbose for a human, but R can use this easily because we are now explicit about what is called what and how it relates to everything else.

More generally put, a tidy data set should look like this, schematically.

from Garret Grolemund - http://garrettgman.github.io/tidying/

Each variable is in its own column
Each observation is in its own row
The value of a variable in an observation is in a single cell.

1.3 A sample tidy data set

Let’s use a tidy data set that comes with the tidyverse packages. The object diamonds is built in to tidyr and can be viewed by typing its name. We’ll use the head() function to look at the top six rows only

library(tidyverse)
head(diamonds)

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

The output tells us that this is a thing called a tibble - this is just a table like object, more about these later. We can see the size of the tibble - 6 rows, 10 columns (this is truncated because of head() in reality its 53940 rows long). We can see the column headings and we can see the column type or, as this is called in R-speak, its class.

1.3.1 Class

Each of the columns has a particular type or class. Here class is either <dbl>, <ord> or <int>. This tells us what kind of data R is in that column. It’s very important that you and R agree about what sort of data is in each column, otherwise the operations you run can go awry.

Thankfully there are only a few main classes to worry about

num or int or dbl - number types
chr - regular text
fctr - A factor. A category or names for groups. Discrete values.
lgl - TRUE or FALSE data. Can only have these two values.

Numeric, logical and character are pretty self explanatory. Factors need a bit more thinking about.

1.3.1.1 Factors

A factor is a variable that can only take pre-known values called levels. Often these will be experimental categories or groups. Usually you will know the values of the level before you even start an experiment. A treatment of a plant with different chemicals could be a factor. Its levels would be names for each treatment studied. E.G GiberellicAcid, Jasmonate or Auxin. Note a factor isn’t restricted to describing inputs. In the same way, the sort of response of a plant to a treatment could be a factor, so high,low, hypersensitive could all be levels of an output factor variable in an infection assay.

A factor can have numeric-looking levels. Treatment or response can often be labelled 1, 2, 3 etc, but they are used as categories, not actual measurements or numbers in factors. If the values can be replaced by e.g A, B, C without loss of sense, then the variable is a category and should be encoded as a factor.

In our diamonds data set, the cut, color and clarity variables are factors - they just happen to be a particular sort of ordered factor.

Factors are what we will group and split our data sets by. We will do statistics, plots and comparisons based on numbers within factor levels.

1.3.1.2 Checking Class Explicitly

The tibble table-like object of our diamonds data does a good job of summarising type. R has some commands for this too.

class() will give you the class(es) of a specific variable (we can use the $ notation to get a single column out of a table-like object such as a tibble)

class(diamonds$cut)

[1] "ordered" "factor"

levels() will tell you all the levels of a factor

levels(diamonds$cut)

[1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"

str() will give you a summary of whole table-like objects

str(diamonds)

tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

1.4 Quiz

How many levels in the factor color in the diamonds data?
Is the table below ‘tidy’?

country	year	type	count
Afghanistan	1999	cases	745
Afghanistan	1999	population	19987071
Afghanistan	2000	cases	2666
Afghanistan	2000	population	20595360
Brazil	1999	cases	37737
Brazil	1999	population	172006362
Brazil	2000	cases	80488
Brazil	2000	population	174504898
China	1999	cases	212258
China	1999	population	1272915272
China	2000	cases	213766
China	2000	population	1280428583

How many variables are contained in the table - how many columns should there be for it to be tidy?