Werden wir Helden für einen Tag

rang 0.3.0 or why computational reproducibility is not about recipes

Posted on Oct 10, 2023 by Chung-hong Chan

rang 0.3.0 is on CRAN. Similar to rio 1.0.0 onward, rang is now under the umbrella of GESIS TSA. The biggest change is the support for Apptainer (thanks Egor Kotov for the contribution). A relatively minor improvement is the introduction of use_rang() to enhance existing research compendia.

Previous on this blog: rang 0.1.0, 0.2.0

I can write another blog post to introduce all the features. But I would like to use an example instead. Before I dive into that example, I also would like to clear the house a little bit to talk about computational reproducibility in general.

Computational Reproducibility

My team recently wrote a preprint about computational reproducibilty. We define computational reproducibility as “A result is reproducible when the same analysis steps performed on the same dataset consistently produce the same answer. This reproducibility of the result can be checked by the original investigators and other researchers with the nonrestrictive reproducible materials.” Note: we are talking the reproducibility of a result. Not the reproducibility of building a computational environment or any tech issue.

To be honest, I am not an expert in this field. My team has several veterans and I am the greenest of them all: I only have around 1 year of experience tackling computational reproducibility of science (or the “computational social” ones, whatever this term means). The more I dive into this issue, the more I know it is almost entirely not a technical issue at all. As we argue in the linked preprint, it is more of an incentive issue. No tool can fix the issue of the reluctance to share data and code. I feel more and more that the discussion such as using renv, Docker, NixOS, or Pipapo to foster computational reproducibility is increasingly irrelevant for the actual issue on the ground. For those who are willing to share their code and data, they are avant-garde researchers already. Unfortunately, this group of avant-garde researchers are super-informed super-minority. Unfortunately also, software tools such as rang or Docker or NixOS or whatnot do not have a wider appeal but just to these avant-garde scientists. And in my opinion, there is also an oversupply of these tools. In my previous post about executable research compendia, I wrote that there are rrtools, rcompendium, template, manuscriptPackage, and ProjectTemplate in the R world alone ¹.

I struggled for quite a long time, as written previously, whether we should introduce yet another research compendium format to the R world. I am so glad that the final answer is no. However, we introduce use_rang() to enhance existing research compendia. But is it useful? Let’s talk about the world’s most famous research compendium and I will explain to you why reproducibility is not about recipes.

Boettiger (2018)

As of writing, Boettiger (2018) (github link) is a 5 years old paper. And the author, Professor Carl Boettiger, is probably one of the most avant-garde scientists in the realm of Open Science: He is a co-founder of the Rocker project. He wrote this 2018 paper with computational reproducibility in mind. The research compendium was adopted to another format by Anna Krystalli. In my opinion, this paper should be considered the Stanford Rabbit of research compendium.

But 5 years is a long time. The original research compendium uses Rocker and Binder. I tried to launch Binder by clicking the Binder button on the research compendium’s Github repo. It does not work. The Rocker-3.5.0-based container can’t fetch any CRAN package. The research compendium actually asks for one GitHub package. Ironically, that GitHub package can be installed. Totally contradictory to what we think about CRAN and GitHub packages.

This is an example of rotting. The research compendium and its technology doesn’t change, but the world surrounding it changes rapidly. Unlike most scientific works such as grants and papers, software is never done. You can consider a paper written. But the development of technology surround the paper never stops. The Rocker image used is still frozen in what it was 5 years ago.

The Rocker 3.x line still uses MRAN (Microsoft R Application Network), which was closed in July this year. Also, Rocker 3.x is based on Debian Stretch, whose long term support was ended on 2020-07-06. It is not surprising at all these Rocker images no longer working out of the box.

Here, I wanted to highlight a point in The Turing Way: Merely sharing the container recipe (be it Dockerfile, or by extension non-container recipes such as nix configuration file or whatnot) is not reproducible in the long run. It is because there is no way to ensure the reconstruction of the container image, which involves a lot of moving parts on the internet (e.g. MRAN, Debian repos), can run flawlessly in the future. If you want to ensure that people in the future can have a compatible environment to run your code, you should share the container image. If you use NixOS without any form of containerization, share your hard drive image.

It’s totally in reverse to what you might have heard about preserving software: You should preserve the source code (that’s recipe), rather than the binary. But there are many differences between compiling the source code of software and rebuilding container images. Compiling source code probably doesn’t pull anything from the internet (well, unless that’s stringi, which pulls the ICU library from GitHub; or something like Rust crates). In fact, there is some evidence that one should preserve both the source code and the binary. I have watched enough Modern Vintage Gamer to know that one might not be able to compile the source code for a variety of reasons. But if one has the binary, it is extremely likely that an emulator or a compatibility layer is available in the future. Look at all those illegal ROM files for the Atari 2600 on the internet.

Going back to the container discussion, I bet my money on the binary such as Docker or Apptainer image files created now would still run in 2033, probably via an emulator. But it is almost a sure thing that the Dockerfile or Apptainer definition files created now won’t work in 2033, if they require files (dependencies) to be in their same pristine internet locations in 2023.

That being said, can we fix the (minor) issues of the container building task so that the container can be built in 2023? Of course we can. And here come the journey. But does it mean we have reproducibility? Read on!

Building the container image

My first step is to clone the current repository.

git clone https://github.com/cboettig/noise-phenomena.git
cd noise-phenomena

Now, I use rang::use_rang() to “enhance” this compendium. Suppose I want to use Apptainer for this.

Rscript -e "rang::use_rang(apptainer = TRUE)"

It will add several things to the compendium. Notably, inst/rang/update.R (I will just say update.R from now on). Running this script will do a lot of things. Most importantly, it scans the current here directory and scan for all used R packages. Out of the box, this file works in many cases; but not here. But we will need to edit it multiple times today.

A Makefile is also added. I edited the handle of the project, let’s say “boettiger”.

Now, I need to edit update.R. By default, it should look like this:

library(rang)
library(here)

cran_mirror <- "https://cloud.r-project.org/"

## Please note that the project scanning result should be checked manually.
## 1. Github packages must be added manually
## as_pkgrefs(here::here())
## 2. You might also want to change the `snapshot_date` to a fix date, when
## the project is finalized.

rang <- resolve(here::here(),
                snapshot_date = NA,
                verbose = TRUE)

## You might want to edit `post_installation_steps` or `cache`
apptainerize(rang, output_dir = here::here(), verbose = TRUE, cache = TRUE,
             post_installation_steps = c(recipes[["make"]], recipes[["texlive"]], recipes[["clean"]]),
             insert_readme = FALSE,
             copy_all = TRUE,
             cran_mirror = cran_mirror)

I put snapshot_date as a fixed date. Let’s say “2018-05-22” (the original date the paper was published).

rang <- resolve(here::here(),
                snapshot_date = "2018-05-22",
                verbose = TRUE)

Now, I run: make update (to run update.R). It looks okay. But it can’t query the package regimeshifts, which is a Github package.

Querying:  cran::Matrix 
Querying:  cran::magrittr 
Querying:  cran::Rcpp 
Query failed:  cran::regimeshifts

Now, I need to modify update.R so that it knows regimeshifts is a GitHub package. There are two ways to do this.

The first way is to use the new feature of rang 0.3.0 to read the DESCRIPTION file in the compendium. (Thanks David Schoch for contributing this).

all_pkgs <- as_pkgrefs(here::here("DESCRIPTION"))
all_pkgs ## github::cboettig/regimeshifts is there
rang <- resolve(all_pkgs,
                snapshot_date = "2018-05-22",
                verbose = TRUE)

The second way is to manually modify the scanned result. I will take the second method as most of the compendia out there do not have a DESCRIPTION file.

all_pkgs <- as_pkgrefs(here::here())
all_pkgs[all_pkgs == "cran::regimeshifts"] <- "github::cboettig/regimeshifts"

rang <- resolve(all_pkgs,
                snapshot_date = "2018-05-22",
                verbose = TRUE)

After the modification, I run make update. It sounds okay. And the container definition file container.def is generated.

Let’s move on to build the container image. That’s make build. The issue about MRAN doesn’t bother me, because rang actually cached all R packages locally. We don’t need to query the defunct MRAN to install R packages during the image building. It runs, until it can’t install the system dependency libcurl4-openssl-dev for the R package curl.

Actually, if you run this locally:

all_pkgs <- as_pkgrefs(here::here())
all_pkgs[all_pkgs == "cran::regimeshifts"] <- "github::cboettig/regimeshifts"

rang <- resolve(all_pkgs,
                snapshot_date = "2018-05-22",
                verbose = TRUE)
rang$sysreqs

rang knows that it should install libcurl4-openssl-dev for curl. But as I said, the Rocker image (which rang also uses) is Debian Stretch and it can’t install those apt packages from the outdated repositories. I can also see that in the trace.

+ apt-get update -qq
W: The repository 'http://security.debian.org/debian-security stretch/updates Release' does not have a Release file.
W: The repository 'http://deb.debian.org/debian stretch Release' does not have a Release file.
W: The repository 'http://deb.debian.org/debian stretch-updates Release' does not have a Release file.
E: Failed to fetch http://security.debian.org/debian-security/dists/stretch/updates/main/binary-amd64/Packages  404  Not Found [IP: 2a04:4e42:400::644 80]
E: Failed to fetch http://deb.debian.org/debian/dists/stretch/main/binary-amd64/Packages  404  Not Found [IP: 2a04:4e42:8d::644 80]
E: Failed to fetch http://deb.debian.org/debian/dists/stretch-updates/main/binary-amd64/Packages  404  Not Found [IP: 2a04:4e42:8d::644 80]
E: Some index files failed to download. They have been ignored, or old ones used instead.

After fiddling it for a while, I think the easiest method is to edit the container definition file itself. To illustrate what I have edited, I think it’s better to display the diff

@@ -16,14 +16,18 @@
 
 export CACHE_PATH=inst/rang/cache
 export RANG_PATH=inst/rang/rang.R
+sed -i -e 's/deb.debian.org/archive.debian.org/g' \
+           -e 's|security.debian.org|archive.debian.org/|g' \
+           -e '/stretch-updates/d' /etc/apt/sources.list
 apt-get update -qq \
        && apt-get install -y libpcre3-dev zlib1g-dev pkg-config libcurl4-openssl-dev \
-       && apt-get install -y libcairo2-dev libcurl4-openssl-dev libfontconfig1-dev libfreetype6-dev libfribidi-dev libglpk-dev libgmp3-dev libharfbuzz-dev libicu-dev libjpeg-dev libpng-dev libssl-dev libtiff-dev libxml2-dev make pandoc zlib1g-dev
+       && apt-get install -y libcairo2-dev libcurl4-openssl-dev libfontconfig1-dev libfreetype6-dev libfribidi-dev libglpk-dev libgmp3-dev libharfbuzz-dev libicu-dev libjpeg-dev libpng-dev libssl-dev libtiff-dev libxml2-dev make pandoc zlib1g-dev libxt-dev
 Rscript $RANG_PATH
+Rscript -e "hrbrthemes::import_roboto_condensed()"
 ## install GNU make
 apt-get -y install make
 ## install texlive
-apt-get install -y pandoc pandoc-citeproc texlive
+apt-get install -y pandoc pandoc-citeproc texlive-full
 ## Clean up caches
 rm -rf /var/lib/apt/lists/* \
        && if [ -d "$CACHE_PATH" ]; then rm -rf $CACHE_PATH; fi

What I have done is to edit /etc/apt/sources.list so that the archive repositories should be used. Another edit is the additional system dependency libxt-dev, which the R package Cairo needs but it is not written in the DESCRIPTION file of Cairo. And finally, I need to run the missing step of hrbrthemes::import_roboto_condensed() and switch to full texlive. Read on, I will come back to fonts and texlive later.

The problem, however, is that this edit will be overwritten when we run make update again. What I did was to create the diff file.

diff -u container.def container_edited.def > container.def.patch

And add this to the Makefile

update:
    Rscript inst/rang/update.R
    patch < container.def.patch

Now, I can make build without any problem. After running this, I got the container image boettigerimg.sif. Now, my friends, this is the file you should also share alongside your research compendium (if it works).

Reproducible?

LaTeX

Now, we can launch a bash shell in the container by running make bash. I can sure the R part of the entire pipeline works. For example, try:

cd paper
Rscript -e "rmarkdown::render('paper.Rmd', output_format = 'html_file')"

It runs fine. The problem, however, is \(\LaTeX\).

## don't work
Rscript -e "rmarkdown::render('paper.Rmd')"

I actually don’t know which \(\LaTeX\) flavor was used by Professor Boettiger to render the RMarkdown (not documented). But probably not vanilla texlive-full’s pdflatex under Rocker. The good (or bad) thing about Apptainer is that I can get the artefact immediately out of the container. I got paper.tex out and tried to render it with a more modern latexmk and latexmk decided on LuaHBTeX. I can get a PDF file of up to Page 3. No more.

However, I can render appendixA/appendixA.Rmd to appendixB/appendixA.pdf. appendixB/appenfixB.Rmd is another story. It stuck at a point and says “You are recommended to install the tinytex package to build PDF.FALSE”.

It actually shows that \(\LaTeX\) environment is surprisingly not reproducible; and we usually don’t document how \(\LaTeX\) documents were rendered and assume \(\LaTeX\) to be “just work”. Also, it shows that I am not a \(\LaTeX\) wizard.

External dependencies

The file appendixA/Dai-Figure.R appears to be an additional exploratory analysis, as no figures or tables are generated from it.

I can’t get it to run, because it downloads a file from Dryad. The URL of the Dai et al. (2012) dataset does not exist anymore. I think the dataset is also available from a later paper by Dai et al (2015). But I am not sure.

Similar to the many moving parts when building container images, we cannot assume things don’t change on the internet. In our preprint, we define external dependencies as:

the parts in the research pipeline that depend on external entities (e.g. APIs or external libraries). External dependencies are a barrier to computational reproducibility when researchers have no control over the external entities and/or when external entities are not transparent. … Prominent examples of external dependencies that fall into both classes (no control and not transparent) are APIs for data access and data analysis (such as the Twitter API, ChatGPT or Perspective API). Materials with dependencies on external entities that are not under the control of the researcher and are not transparent cannot be considered computational reproducible since the behavior and functionality of the external entity can change at any point in time, as well as its accessibility.

Downloading files from the internet is a sure sign of external dependencies and should be avoided at all costs. If possible, archive the files. It also informs us how we should deposit our data. But to be honest I am not a subject matter expert. I have read that Zenodo would keep your data as long as CERN exists (which should at least be 20 years).

Summary

To be sure, I am not criticizing Professor Boettiger’s approach to reproducibility. I think his effort to make his research as transparent as possible should surely be valued.

We are notoriously bad at predicting the future. We probably don’t know how we would do computational research in the future. Making something reproducible now is probably easy. But making it reproducible, let’s say in this case, in 5 years, is already quite difficult.

The point I would like to make in this post actually is that tools are great. But for long term reproducibility how the tools work is probably more important, e.g. does it have a lot of moving parts?

The most important point is that the way we share our reproducible materials has the highest impact on long-term computational reproducibility. We should share both the binary (container images) and recipes, as states in The Turing Way. We should reduce external dependencies. There are blind spots in our thinking about reproducibility. We focus too much on R, Python or even system dependencies but not “minor” components such as \(\LaTeX\) and fonts. Lucky for us, hrbrthemes actually contains the binary of the font files. It saves the day.

That being said, it is probably my lack of knowledge for making this attempt to make use of Professor Boettiger’s research compendium so clumsy. The failure has nothing to do with Professor Carl Boettiger or the other contributors of rang such as Egor Kotov and David Schoch. The failure is all mine.

I have pushed my edits to Professor Boettiger’s research compendium on GitHub. If you are going to learn one thing from this blog post, my edits are also recipes and should be considered not reproducible.

I am sorry that I also contributed to this oversupply of tools. ↩