chainsawriot

Tom's Diner, rang, Bioconductor for nonbioinformaticians, or my preconceptions about dependencies are wrong

Posted on Feb 26, 2023 by Chung-hong Chan

Act I: Tom’s Diner

You may know that Tom’s Diner is a 1987 song by Suzanne Vega. If you don’t know the song, you should find it and listen to it. The song has no music instrument playing, just Vega singing. If you listen to it with a good pair of headphones, you should hear no noise in the background, again, just Vega singing. You can focus on her singing and you should be able to hear her microphone’s slight echo, some popping sounds from her mouth, and her breathing. These details are interesting, to say the least.

You might wonder, why am I bringing Tom’s Diner up? You probably would listen to the song from one of those streaming services, Spotify, Apple Music, or YouTube. It is extremely unlikely these days for you to listen to Tom’s Diner from CD, Cassette, or Vinyl. Listening to Tom’s Diner from streaming now, you can still listen to all the details.

I have recently finished reading the book “How Music Got Free: A Story of Obsession and Invention” by Stephen Richard Witt. One major story in the book is how MP3 was invented here in Germany. MP3, as you might know, is a lossy music compression format. It works great for music because dropping some details in music (e.g. a guitar riff, drum beat or whatever), our brain will somehow “correct” for those missing details. But our brain can’t do the same for human voice. And dropping details in human voice generates sonic artefacts. I think now in the age of Zoom, it is not that difficult to imagine how those sonic artefacts would sound like.

The engineers at the Fraunhofer Institut said they used Tom’s Diner to check for problems of the MP3 compression algorithm on encoding human voice. They said that they have listened to the song for perhaps a million times with some special high-fidelity headphones from Japan to check how Suzanne Vega’s voice got changed by the MP3 compression algorithm. Some early iterations of the algorithm could not produce the clear separation of human voice from the background and generate a lot of background noise (sonic artefacts). Basically, Vega’s song was used as the test case to debug the algorithm. And therefore, some say Suzanne Vega is the “Mother of MP3”. In summary, the fidelity we can listen to in Tom’s Diner today via the compressed music from any streaming service today is not something we can take it for granted. It is based on massive amount of engineering.

Act II: rang

Yes, the package was previously named gran

My team lead David and I are developing the R package rang. It is now clear that my previous post on Docker is actually my research note on Docker and R.

I should blog about my R packages more often. But I am quite lazy about it. Instead, David has written a nice blog post about rang. The idea about rang is elegantly simple: you have a list of R packages and a date (we call this date “snapshot date”); rang will find out the latest version of all these R packages at that specific snapshot date and produce a Dockerfile to reconstruct the possible computational environment on that date. Nice and simple. If you think that this is a simple task, it is similar to thinking the voice of Suzanne Vega in Tom’s Diner is a simple thing to reconstruct.

There are deep dependencies: package A depends on package B and package B depends on package C and on and on… To reconstruct the correct computational environment, all deep dependencies should also be in the correct version.

There are also system requirements. Package A might need a few system libraries. In the DESCRIPTION file of an R package, there is a field called SystemRequirements. For example, installing the R package xml2 needs the deb package libxml2-dev on Ubuntu / Debian. If rvest needs xml2, then actually rvest also needs libxml2-dev.

There are also different repositories. You might know CRAN. But there are other sources of R packages. Github is another source. The topic of the day, “bioconductor”, is another “CRAN-like” repository. And for the sake of completeness, I just wanted to remind my readers that there are actually more of these “CRAN-like” repositories. Omegahat and R Universe are some other CRAN-like repositories. And I don’t want to talk about them and probably, in the short term, rang would probably not support Omegahat and R Universe.

And let’s talk about Bioconductor.

Act III: Bioconductor for nonbioinformaticians

Bioconductor is repositories of R packages for bioinformatics, computational biology, and related fields. Although my undergraduate degree is biology, I am not a biologist anymore. And of course, I don’t have any training in bioinformatics or computational biology.

We think about interoperability a lot when developing rang. rang supports a standard called package references proposed by the folks at r-lib. The standard have four sources: CRAN, GitHub, local, and Bioconductor. The v0.1 of rang supports CRAN and GitHub. And it is natural for us to extend the support to local and Bioconductor in the upcoming v0.2. My concern about extending this support was that both David and I didn’t know much about Bioconductor. We needed to learn how Bioconductor works in order to make rang understand Bioconductor. And man, to me, it’s extremely painful to learn Bioconductor the hard way.

We made several wrong design choices in the implementation. I, as the obsessive software tester, identified many problems through testing. Here, I wanted to mention a Bioconductor package called Organism.dplyr by Martin Morgan, Daniel van Twisk, and Yubo Cheng. And it is actually my Tom’s Diner. And through testing the reconstruction of an envionment with Organism.dplyr I proved many of my previous assumptions about Bioconductor and software dependencies of R packages are wrong.

Act IV: My preconceptions about dependencies are wrong

What so special about Organism.dplyr, you might ask. Well, the package depends on a variety of CRAN and Bioconductor packages. In the end, installing just one Organism.dplyr will install additional 91 deep dependencies. And these 91 deep dependencies need 8 system requirements. And in order to get two numbers correct, we squashed a lot of bugs.

It is also extremely common for Bioconductor packages to depend on CRAN packages. The number one thing I got it wrong about R software dependencies is to assume CRAN packages only depend on CRAN packages. It is violating CRAN policies to use the Remotes field in DESCRIPTION to make a CRAN package depending on GitHub packages. However, it is okay for a CRAN package to have Bioconductor dependencies. For example, the CRAN package restfulr depends on the Bioconductor package S4Vectors. I made a mistake to assume CRAN packages only depend on CRAN packages and therefore an earlier implementation of rang sought for S4Vectors in CRAN. And of course, it can never be found and therefore created a requirement that can never be satisfied. And because of this finding, we needed to consider Bioconductor even for CRAN packages.

And if you want to explore all of these Bioconductor-dependent CRAN packages, try this:

bioc_pkgs <- rownames(available.packages(repos = "https://bioconductor.org/packages/release/bioc", filters = list()))
unique(unlist(tools::package_dependencies(bioc_pkgs, reverse = TRUE)))

And the above code actually shows my second incorrect preconception about software dependencies.

Bioconductor, unlike CRAN, is not just one repository. There are “Software Packages”, but also other Packages, such as “Annotation Data”, “Experiment Data”, “Workflow”, and even “Books”! For example, the repository URL of “Annotation Data” of the current release is “https://bioconductor.org/packages/release/data/annotation/”. Therefore, to get a complete list of all Bioconductor packages, one should consider all Bioconductor repositories!

Finally, about System requirements. Many system requirements of Bioconductor packages are not queryable with the current System Requirements database. One of the most commonly used Bioconductor packages is called Rhtslib. It is one of the top 20 most downloaded Bioconductor packages. In its systemRequirements field, it says “libbz2 & liblzma & libcurl (with header files), GNU make”. Both libbz2 and liblzma are not available from the database.

After working on Bioconductor for long enough, I basically trustno1.

The finale: The High Fidelity

I told my team lead David that Organism.dplyr is my miniboss to beat. And after hours and hours of high-intensity hacking, I can finally reconstruct a high fidelity environment with Organism.dplyr. My brain at that moment might be damaged and I heard Suzanne Vega’s doo, doo, doo, doo, doo, doodoo, doo.

Dijkstra said “program testing can be a very effective way to show the presence of bugs, but is hopelessly inadequate for showing their absence.” No one can prove an implementation is free of bug. But thanks to Tom’s Din.., I mean, Organism.dplyr, I found and fixed many.