6 Loading your own data

Questions:

How do I use my data?

Objectives:

Preparing a csv file in ‘tidy’ format
Understanding file system paths
Loading a file to a dataframe
Explicitly describing the file contents

Keypoints:

Data needs to be in a particular format for ggplot to work
Specifying the data type is sometimes necessary when creating a data frame.

6.1 Tidy data

There are many ways to structure data. Here are two quite common ones.

	treatmenta	treatmentb
John Smith	11	2
Jane Doe	16	11
Mary Johnson	3	1

	John Smith	Jane Doe	Mary Johnson
treatementa	11	16	3
treatementb	2	11	1

source: Hadley Wickham

Tables contain two things, variables and values for those variables. In these two tables there are only three variables. treatment is one, with the values a and b . The second is ‘name’, with three values hidden in plain sight, and the third is result which is the value of the thing actually measured for each person and treatment.

For human reading purposes, we don’t need to state the variables explicitly, we can see them by interpolating between the columns and rows and adding a bit of common sense. This mixing up of variables and values across tables like this has led some to call these tables ‘messy’. A computer finds it hard to make sense of a messy table.

Working with R is made much less difficult if we get the data into a ‘tidy’ format. This format is distinct because each variable has its own column explicitly, like this

name	treatment	result
John Smith	a	11
Jane Doe	a	16
Mary Johnson	a	3
John Smith	b	2
Jane Doe	b	11
Mary Johnson	b	1

Now each variable has a column, and each seperate observation of the data has its own row. It is much more verbose for a human, but R can use this easily because we are now explicit about what is called what and how it relates to everything else.

6.2 Getting your data into tidy format

The bad news here is that there is no magic function to make your data tidy. If you have an existing table then you can do this manually in Excel or some other spreadsheet package. If you have lots of data, it is possible to do it programmatically in R, see the dplyr and tidyr packages, which are complex but designed for this purpose. Also have a look at the cheat-sheet here https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf.

6.3 Loading in a CSV file

R can deal with a lot of file formats, but the most common and easily used one is ‘csv’, a comma-separated value file. These can be exported from virtually any spreadsheet program so it’s a good interchange format to get data into R from wherever you already have it. Loading a file is done easily with the readr package`.

readr is a tool for loading data into R. It can be loaded on its own with library(readr). We will use readr to load in data from a ‘flat’ .csv file.

6.3.1 read_csv()

The main function is read_csv() which can read a standard comma seperated values file from disk into an R dataframe. There are a few variants of read_csv() which may be appropriate for different sorts of .csv file, but they all work the same.

read_csv2() - reads semi-colon delimited files, which are commonly used where a comma is used as a decimal separator
read_tsv() - reads tab delimited files
read_delim() - reads files delimited by an arbitrary character

The first argument to read_csv() is the path to the file to read. Here I’ll read a file on my Desktop that contains the diamonds data we’ve been using.

library(readr)
read_csv("~/Desktop/diamonds.csv")

This will create an object called a dataframe that can be used just like the built-in data.

6.4 Finding the file

R needs to be given the correct and full path to a file. This means the full address of the file on the hard disk of your computer. R doesn’t have a file chooser so you need to know how to write this down.

Computer file systems are laid out in folders and sub-folders with files inside them. Conceptually, this results in a tree of folders and a path down the branches from the root of the tree to everything else. The root gets called ‘/’ on Mac/Linux computers and ’C:' on Windows computers

FileSystem source: Software Carpentry

This picture of an example file system shows how that is formed. When we write this down, everytime we go inside a new folder we use a slash to show we’ve changed folder. Most computer systems have a ‘Users’ or similar folder in which each users stuff is stored. Supposing we’re in Larry’s folder then the path would be /Users/larry. And a file called my_file.txt in that folder would be /Users/larry/my_file.txt.

So to write the full file path for R we can use this pattern, the first bit would be /Users/username/ (or C:\Documents and Settings\username\ or C:\Users\username\) and then the set of folders within that user area follows on. If your file my_file.txt is on the Desktop the full path would be /Users/username/Desktop/my_file.txt (or C:/Documents and Settings/username/Desktop/my_file.txt)

6.4.1 Make it easy on yourself

The easiest way not to have to think too hard about this stuff is to set up a consistent folder and file structure for every analysis and use RMarkdown documents to run your analysis. Here’s an example scheme:

Create a new folder and call it something relevant to your experiment, e.g disease_incidence_2019-11-01
Within the folder create a sub-folder called raw and a sub-folder called output_images.
Put your tidy csv file in the raw folder.
Create a new R Markdown document and save it in the disease_incidence_2019-11-01 folder.

Now whenever you open and run that R Markdown document, the path of your input file is "raw/my_input_filename.csv". You can save your plots with the ggsave() function to "output_images/filename.png" (don’t forget the quotes).

If you never mess around with the relative positions of the files and folders described, then the paths will always be the same. You can move the whole folder without worrying, just don’t jumble it’s contents.

6.5 Making sure the data types are correct

When we load new data we need to make sure that any header has been properly parsed as column names, and that the columns have been identified as the right sort of data

On loading with readr we see a column specification, read_csv() has guessed at what the columns should be and made those types. Its fine for the most part, but some of those columns we’d prefer to be factors. We can set our own column specification to force the column types on loading. We only have to do the ones that read_csv() gets wrong. Specifically, lets fix cut and color to a factor. We can do that with the col_types argument.

read_csv("~/Desktop/diamonds.csv",
    col_types = cols(
      cut = col_factor(NULL),
      color = col_factor(NULL)
    )
)

6.5.1 Parser functions

This works by assigning a parser function that returns a specific type to each column, here it’s col_factor(). There are parser functions for all types of data, and all of them can be used if read_csv() doesn’t guess your data properly. We won’t go into detail of all of them, just remember that if your numbers or dates or stuff won’t load properly, there’s a parser function that can help.

Once the specification shows what you expect, then you are good to start analysing.

6.6 Quiz

1. Make a new folder called `analysis` on the Desktop
2. Inside `analysis` make a new folder called `raw` and put `example_ros_data_flg22.xlsx` into it.
3. Start a new R Markdown document and save it in `analysis`

Convert raw/example_ros_data_flg22.xlsx into a ‘tidy’ format .csv file and save to raw
Load in the data from the tidy file using read.csv() (Hint: You may need to save a csv version from Excel - R won’t read .xlsx files.)
Check the datatypes and headers using str(), change them if necessary.
Create a plot that shows each data point in each treatment (Col, pp2c38, pp2c48 pp2c38/pp2c48) in each day the experiment was done.
Make sure the plot you generate gets saved to a folder inside analysis called output_images