Literate Computing

Author

Dan MacLean

Published

May 1, 2022

Motivation

If we didn’t have to do it over and over, it wouldn’t be called re-search.

Developing a data analysis is hard, it can involve many mis-steps and changes of mind from redoing of little bits here and there that weren’t quite right the first time, to introducing new ideas or removing whole sections that didn’t work out. This iterative process is completely in-line with all other aspects of research and means that we have a personal need to be able to record exactly what we’ve done with high accuracy, and high reproducibility. It is also our scientific responsibility and an aspect of scientific integrity that we are clear and open about the methods we use as they are key in the interpretations and understanding of the results that we get. In the jargon of the field we think of this as ‘keeping a proper lab book’, but when it comes to using a computer, what we need to record, when is not often clear nor is it sometimes easy to do so. As a result the methods sections of many reports, theses and papers report scientific computing in a vague and uninterpretable way, making statements like ‘tests were done in Excel’ or ‘GenStat was used to perform \(t\)-tests’, or ‘a custom R script was used’. These nebulous reports are useless for anyone trying to understand exactly what was done and reports using them are unreproducible. That they pass reviewers so often is a clear indication of the failure of reviewing of methods. In practice these sorts of write ups are no better and no more informative than announcing that statistical analyses were done with a magic spell.

A major failing of computer graphical interfaces is that they do not make it easy for us to repeat actions, which is ironic as computers are excellent at repeating instructions very quickly. Scripts and programs are required to get the best reproducibility out of our computers, but scripts in R and Python (and any other computer language) are not easily read by people, even those with a great deal of experience in programming. Very quickly reproducible scripts become unusable lumps of code because users can’t tell what is in them and what they are supposed to do, a phenomenon that has it’s own acronym - WORN - write once, read never.

Literate programming is the skill of writing code that is readable and understandable, often without the need to read the actual code in any depth. This is a very useful day-to-day skill to have when working in science as multidisciplinary teams abound. It is also useful when switching from project to project as we can understand what we were doing in a new project and get going again quickly. Using literate programming also eases our duty to report clearly and openly what we did in our analysis as the task of writing the code and explaining what we did are accomplished simultaneously.

In this course we’ll learn how to create a literate programming document in two popular data science languages, R and Python. The two systems share a common core and much of what you learn for one will be applicable to the other.