Join the Shiny Community every month at Shiny Gatherings

Reproducible Research: What to Do When Your Results Can’t Be Reproduced


Even the most sophisticated machine learning methods, the most beautiful visualizations, and perfect datasets can be useless if you ignore the context of your code execution. In this article, I will describe a few situations that can destroy your reproducibility. Then I will explain how to resolve these problems and avoid them in the future.

3 danger zones

There are three main areas where things can go wrong if you don’t pay attention:

  1. R session context
  2. Operating System (OS) context
  3. Data versioning

In this blog post, I will focus on the first two areas. They are quite similar because both are related to the software context of research execution. While more and more data scientists pay attention to what happens in their R session, not many are thinking about OS context, which can also significantly influence their research.

Data versioning is a huge topic itself and I will cover it in a separate article in the future. For now, let’s examine some examples of where things can go wrong when you have different R session contexts or various Operating System configurations.

R session context

The basic elements of an R session context that data scientists should be aware of are:

  • R version
  • Packages versions
  • Using set.seed() for a reproducible randomization
  • Floating-point accuracy

R version

Not many monitor this R language changelog on a regular basis, but sometimes crucial changes are made and can lead to a failure in your code. One such situation is described here. Change in R(3.2.1) for “nested arima model has higher log-likelihood”, led to an inconsistency of results for arima calculations with d >= 1.

Packages versions

Now, let’s see an example with a changing package version. I am going to use dplyr in versions 0.4.0 and 0.5.0:

dplyr 0.4.0

df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)
result <- df %>% dplyr::distinct(x) %>% sum
# result == 8

dplyr 0.5.0

df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)
result <- df %>% dplyr::distinct(x) %>% sum
# result == 3

Wow! Version 0.5.0 introduced functionality-breaking changes and our result is now completely different.

set.seed()

When your code uses randomization and you want to have reproducible results, it is important to remember about setting a seed.

> sample(1:10, 5)
[1] 6 7 2 5 9
> sample(1:10, 5) # Different result
[1] 5 7 8 2 3

…and with setting seed:

> set.seed(42)
> sample(1:10, 5)
[1] 10  9  3  6  4
> set.seed(42)
> sample(1:10, 5) # Result is the same
[1] 10  9  3  6  4

Random numbers returned by the generator are in fact a sequence of numbers that can be reproduced. Setting a seed informs the generator where to start.

Floating-point accuracy

Pay attention to how expressions on floating-point numbers are implemented. Let’s assume you want to implement an algorithm described in a recently published scientific paper. You get unexpected results in some cases and now try to debug it. And you eventually get this:

> x <- 3
> y <- 2.9
> x - y == 0.1
[1] FALSE

What? It can be solved by testing for “near equality”:

> x <- 3
> y <- 2.9
> all.equal(x - y, 0.1)
[1] TRUE

More about floating-point arithmetic can be found here.

Operating System Context

When your research is shared with others and you can’t predict which operating system someone is going to use, many things can go wrong. Any difference in one of the following elements:

  • System packages versions
  • System locale
  • Environment variables

can result in different execution results.

Many R packages depend on system packages. You can view it in a package DESCRIPTION file under the parameter SystemRequirements. Let’s look at the example with the png package and running it on two different operating systems.

Package png DESCRIPTION file:

Package: png
Version: 0.1-7
...
SystemRequirements: libpng
...

Let’s assume we wrote a custom function savePlotToPNG that saves a plot to a file. We want to write a unit test for it and the idea is to compare produced result with a previously generated reference image. We generate reference plots on Ubuntu 16.04:

reference.png.path <- file.path("referencePlots", "ref0001.png")
savePlotToPNG(reference.png.path)

…and then we run tests on Debian 8 Jessie. We use exactly the same version of png package:

reference.png.path <- file.path("referencePlots", "ref0001.png")
test.plot.path <- tempfile(fileext = ".png")

savePlotToPNG(test.plot.path)

identical(x = png::readPNG(test.plot.path),
          y = png::readPNG(reference.png.path))
# Error! libpng system package is different on Ubuntu 16.04

…plots don’t match! This is caused by different libpng system package versions.

Locale and system variables

Ubuntu:

> x <- c("-110", "-1", "-10", "10", "0", "110", "1")
> sort(x)
[1] "0"    "1"    "-1"   "10"   "-10"  "110"  "-110"

Windows:

> x <- c("-110", "-1", "-10", "10", "0", "110", "1")
> sort(x)
[1] "-1"    "-10"    "-110"   "0"   "1"  "10"  "110"

…okayyy 🙂 String sorting in R is based on the locale which is different for Windows and Linux systems. Read more about setting locales here.

Tools for R session-level reproducibility

Data scientists should be aware of the traps described above. These are the foundations of reproducible research. Below there is a list of tools that support managing packages versions:

Packrat – a popular package that allows you to create and manage a private library attached to your project.

Switchr – handy abstraction over .libPaths(){:target="_blank"}, lib.loc and other lower-level library settings. Helpful in creating and managing libraries.

Checkpoint – helpful when you need to find older versions of a specific package.

Tools for OS-level reproducibility

The best approach is to do your research in an isolated environment, that can be shared as an image. Inside of this environment, you can also use any tool for R session-level reproducibility, like Packrat, Switchr, or Checkpoint.

To create an isolated environment you can use virtualization or containerization. There are many tools that can be used for that: Vagrant, Docker, rkt, and many more. I recommend using Docker because it’s easy to use and has good community support.

Example structure of your ~/project:

Dockerfile
install_libraries.R
R/ <-- contains your research R scripts

Let’s look at a sample Dockerfile that describes the steps necessary for the recreation of an isolated environment:

UPDATE: I simplified Dockerfile according to Carl Boettiger’s comment.

FROM rocker/rstudio:3.3.2

MAINTAINER Paweł Przytuła "join@wordpress.appsilon.com"

COPY install_libraries.R /code/install_libraries.R

RUN R -f /code/install_libraries.R

install_libraries.R contains code that reproduces an R environment. It can be either some Packrat code or something like this:

install.packages("devtools")
devtools::install_version("dplyr", version = "0.5.0")
devtools::install_version("stringr", version = "1.1.0")
devtools::install_version("forecast", version = "7.3")

After that it’s enough to run the following command to build an image:

docker build -t myimagename:1.0.0 .

When an image is ready you should either save it as a file using docker save -o myimagename.img myimagename:1.0.0 or upload it to Docker Hubsteps described here.

Using a previously saved Docker image you can start a container with the following:

docker run -d -p 8787:8787 -v ~/project/R:/R myimagename:1.0.0

Because the image is based on rocker/rstudio, you can access an RStudio session via http://localhost:8787 (login rstudio, password rstudio). Your R/ directory will be available inside the running container via /R path. Awesome, isn’t it?!

With a Docker image like this, you have control over the whole software context needed for reproducible research. You can easily extend it with more packages and share it without fear of losing reproducibility. I recommend this approach in any data science project, in which results must be reproduced on more than one machine and also when working in a bigger data science team.