Werden wir Helden für einen Tag

Home | About | Archive

Week 1: To make use of Rocker for reproducible data analysis

Posted on Dec 1, 2018 by Chung-hong Chan

firststep

Preface: I would like to start a new series to write something lightly technical about R or associated pieces of stuff every week in a year. I don’t want to wait till January to start my New Year Resolutions. So let me get it to start now. This is the first week and let’s see how long I can sustain it. My bet is three weeks.

Before I talk about Rocker, I should give an express summary of what Docker is.

There are actually millions of Docker tutorials out there. I think I should not repeat it here. Also, I don’t want to burden this notes with a lot of technical details. I just give a two-bullet point summary of what Docker is:

  1. Docker is a system to produce isolated and reproducible environments (a.k.a. containers)
  2. A container is a running instance of an image. An image can be build from a Dockerfile. A Dockerfile is a recipe of how your image should be built.

Try to think about the above two bullet points in the reversed order. Therefore, in order to produce a container, one should

  1. Write a Dockerfile to define how your image should be configured
  2. Build your image from your Dockerfile
  3. Run your image

Actually, there are tons and tons of pre-built docker images available. if you don’t want to customize your image, you can skip step 1 and step 2. When you run a pre-built docker image for the first time, it will fetch the pre-built docker image online and then run it. Suppose you have Docker installed, try this example:

docker run -p 8787:8787 -e PASSWORD=abc123 rocker/rstudio

You don’t need to know every single detail of the above command 1. But you should see docker trying to download something from the internet (your image with the name rocker/rstudio from the docker hub) and then run it.

Now you have a running Docker container.

In order to see the result, you can use your browser to access http://localhost:8787

And login with username: rstudio and password: abc123. And boom, you have a running RStudio Server that you can play with.

You can stop the Docker container by pressing Ctrl + c.

To prove my point, you can rerun the above command again

docker run -p 8787:8787 -e PASSWORD=abc123 rocker/rstudio

You should not see the downloading screen and the container start immediately. It is because you already have the image rocker/rstudio on your computer. You may start another terminal session with this command to list out all the available images on your computer.

docker images

You can also have a look at all the running containers

docker container ls

There should be a column called IMAGE. And you should see rocker/rstudio in there. Once again, it proves my point: A container is a running instance of an image.

So, now you know how to run a container. It’s time to visit two important features of Docker containers: isolate and reproducible.

Suppose you have created something with your RStudio session in your container instance. Let’s say you have created an R script and save it as “abc.R”. Or maybe you have installed a package called readODS with:

install.packages(‘readODS’).

Now you stop your Docker container by pressing Ctrl + c. The first thing you will notice is that the file abc.R is nowhere to seem in your host. Therefore, your container is a container because it is isolated from your host.

And now relaunch your container with the same command.

docker run -p 8787:8787 -e PASSWORD=abc123 rocker/rstudio

Now, even under your new RStudio session, you cannot see your abc.R. Also, the package readODS is not installed.

It is because the process of creating a container is reproducible. Every time you run your container, the state is reproduced to the initial state. Therefore, by default, Docker container provides no permanent storage. Everything you did in your container session will be erased.

Why is it important? Before I talk about this topic, I would like to talk about Docker Volume. Volumes provide us with permanent storage.

Suppose you have developed an R script on your local computer to do an analysis. That script is located in your current directory. If you launch your container with this:

docker run -v `pwd`:/home/rstudio -p 8787:8787 -e PASSWORD=abc123 rocker/rstudio

You should be able to see your files from your current directory in RStudio. You can even modify those files with your RStudio. Those changes will be saved in your host machine. And also, RStudio will produce some logs in your current directory as well 2.

Once again, why is it important?

When we talk about “reproducible data analysis”, we mostly focus on two things: 1) sharing of research data and 2) sharing of code. However, there is one thing that is missing from this discussion, which is: reproducible environment. Docker solves the ‘the code works on my computer’ problem by clearly specifying how the running environment should look like, document it and then reproduce it. Be it on a Windows laptop, a Mac Pro or on an AWS instance, one can reproduce exactly the same environment 3 so that the (shared) code can run. Traditionally, this procedure can be archived with the virtual machine (e.g. vagrant boxes). But Docker is a scalable alternative 4.

For example, you have an R script that must be run under R version 3.4.0 and with the version 1.6.4 of readODS. It is so difficult to reproduce such an environment with a real machine now because both the R version and readODS version are obsolete. With Docker, it is very easy to reproduce such an environment. We can create a Dockerfile to describe how our image should be configured.

First, we can find the specific version of rocker (dockerized version of R) from dockerhub. From there, I found a version with R 3.4.0 and tidyverse. 5

So, in a plain text file called Dockerfile, write and save:

FROM rocker/tidyverse:3.4.0
CMD ["R"]

A detail explanation of Dockerfile is available here. From there, build an image called readods and run the docker image with

docker build -t readods .
docker run -ti -rm readods

In the R prompt, you can see that it is actually R 3.4.0. Then the next question is, how to make sure that we have readODS version 1.6.4. Because the rocker image has devtools installed, We can make use of devtools::install_version() to install a specific version of a package if such version is on CRAN. We edit the Dockerfile:

FROM rocker/tidyverse:3.4.0
RUN R -e "devtools::install_version('readODS', version = '1.6.4', repos = 'http://cran.us.r-project.org')"
CMD ["R"]

Rebuild our image and run it.

docker build -t readods .
docker run -ti -rm readods

In the R prompt, you may require(readODS) and have a look at sessionInfo(). We have reproduced the required environment!

Because our Dockerfile is just a plaintext file, you can send it to your friends so that they can reproduce such an environment. You may also push it to your github, or bundle it together with your shared code and data.

I hope this post is motivational enough for you to consider using docker. There are more interesting things to explore about R and docker. For example, there are containerit (automatic generation of Dockerfile based on current R environment) and stevedore (Docker client in R).

BTW, if you want to delete your image, run:

docker image rm readods

One down, 51 to go.


  1. -p registers a container’s port to the host. -e sets the environmental variable. In this case, we set an environmental variable PASSWORD to abc123. That environmental variable will be used to set up our container. 

  2. ./rstudio and ./.rstudio, please consider delete them. 

  3. The environment is reproduced down to the OS-level. If you are using Rocker, it is always Debian Linux. 

  4. For example, you are working on your Macbook with only 4GB of ram and you have worked under a container. Now you find that you don’t have enough ram for your data analysis and need a meaty machine. You can buy an AWS instance with 64GB of ram, launch the same container. Your container can take advantage of all 64GB of ram. 

  5. I don’t mind having tidyverse also. The reason for using tidyverse instead of base is the tidyverse version has devtools installed. 


Powered by Jekyll and profdr theme