chainsawriot

Current thinking about writing R (2/2): Free as in mummies

Posted on Sep 29, 2023 by Chung-hong Chan

Previously on this blog: clarity

I am not super familiar with the concept of code smells. However, my favorite one is called Speculative Generality. It tells how human nature can impact the quality of the program. The general statement of “Speculative Generality”, simply put, is trying to guess the future and implement currently unused (or unusable) features that might be useful for implementing other features that can cater for that guessed future. In many cases, this could or could not be a problem. However, in Open Source Software development, I see this mostly as a big problem. I can give an example to this. Let’s talk about a pretty harmless function called rio::is_file_text().

Disclaimer: I am not “subtweeting” the contributor of rio::is_file_text(). I just wanted to use this as an example to talk about the reality of maintaining an R package, or any open source project in general.

The R package rio introduced a function called rio::is_file_text() in 0.5.22 (~2019 on GitHub, i.e. ~4 years ago). It was then introduced in the CRAN version, 0.5.26, in 2021-03.

It was introduced into the rio codebase for the anticipation of another planned feature called try_import(). The idea is probably to try different import methods for a data file that has no file extension. But as you can imagine if I brought this up in this context, the planned feature try_import() never comes. Therefore, no one actually knows how try_import() should look like, what problem this function is trying to solve, and in what situation is this function useful. All of this came from one contributor. Up to this day, I still don’t know how a checker of plain text file can help in the context of rio. It might have a better utility in packages such as fs. And there exists the Unix utility called file.

When rio 0.5.22 with rio::is_file_text() was on CRAN, the developer who proposed try_import() and developed rio::is_file_text() no longer has time in developing rio any further. And then the function rio::is_file_text() stayed in the codebase for over 3 years. I don’t want to comment on the implementation of it, which is not the focus of this blog post.

However, on November 1, 2021, package maintainers who used rio as a dependency received an e-mail from the CRAN Team (actually that person), saying rio is failing the CRAN routine checks and produced random segmentation faults. If rio doesn’t solve those issues, all packages using rio (or reverse dependencies) will be removed from CRAN. Sure, that leaded to concerns.

This is an example of software decay: It worked yesterday, does not work today. Usually, software decay is due to the changing environment. The root cause of those segfaults, even now, is still unclear. What we know is that the root cause is probably not rio::is_file_text() but an underlying package. It’s probably not about the implementation of rio::is_file_text() either, it’s about the underlying package used for testing rio::is_file_text() and that underlying package generated segfaults.

The point I would like to make is, any feature introduced in the codebase carries a cost, or several different costs. Martin Fowler lists out 4 different types of cost: cost of building, cost of repair, cost of carry, and cost of decay. It sounds easy (or fun) to implement a feature and the cost of building might be low. However, in the long run, any feature in a codebase would require further cost of repair, carry, and decay. When the feature breaks, we need to fix it (cost of repair). When the feature is in the codebase, it increases the complexity of maintenance (cost of carry). When the feature cannot deal with the changing environment, it decays and may contribute to the (bad) reputation of the software (cost of decay).

In the open source world, unfortunately, those subsequent costs are usually not paid by the original implementer, because developers usually come and go. The future costs are usually paid by the maintainers and some other contributors. In this case, it was exactly that: Thomas Leeper at the time fixed the tests and resubmited rio back to CRAN. And rio was saved.

All features come with those costs, because —according to Murphy’s Law— all features would eventually fail and therefore would need maintenance. The questions are: 1) are we are willing to pay for the four costs of a particular feature? 2) is the feature important enough to justify the costs?

Take rio as an example again. If rio::import() breaks, the cost of repair is always worthy because it is the most important function of the package that probably 99% of the users would use. We must fix it. But rio::is_file_text()? First of all, it was designed for a speculative future and that speculative future never comes. We don’t know any real life usage of the function (more on that later). I think we can safe to say that we are, or more accurate, I am, not willing to pay for the costs of keeping rio::is_file_text() any further. Even a fully implemented solution to the speculative future has a questionable value, because humans are extremely bad at predicting the future. Sorry to say this, but a half-baked solution to the speculative future has no value. And therefore, in rio v1.0.0, the function is aptly removed. We can only do this in this major release, which breaking changes are allowed.

I can also safely say that some other features of rio were implemented not out of a certain demand, but out of some speculative futures. The super sluggish HTML/XML import and export features are two examples. Do we know any real life usage? No. But can we make sure that absolutely no one uses them and we won’t break any code if we remove them? We cannot and therefore we can only consider how to deal with them in the next major release v2.0.0 in the distant future. This is the cost of carry; and if the cost of carry is too high that even prevents us from making any change, the software is already broken.

One of the Extreme Programming mantras is YAGNI (You are not gonna need it). I am sorry to say this again, but most of the feature requests are in that category. And a software with a lot of features is usually not a good sign. Many open source software have a lot of features. Just to say a few: Wordpress, drupal, GNU emacs; in the R world: tidyverse, arrow, and data.table. These projects have a greater risk of decaying than the humble rio. But there are one drastic difference: these software has a lot of contributors (it makes business sense to contribute to these projects) and therefore can deal with the cost of having a lot of features. But for a software like rio mainly used by scientists, maintained by scientists, do we have that luxury? I know how frustrating it is to work with scientists. I said this as if I am not one myself. They have other goals (papers, grants, teaching, h-index, number of followers on ~~Twitter~~ X or Bluesky or whatever social media platform that can garner eyeballs). And maintaining open source projects are usually not in their goals. Contributing to open source, if they ever do, is their one-night stand. Software, unlike papers or grants, is never done. I like to say open source is “free as in puppies”. But puppies would one day die (weep). I think it is more appropriate to say “free as in mummies”. It’s like a public museum, let’s say the British Museum, and you have a mummy in your collection. It might have taken you a lot of energy to ~~stole~~ curate the artefact from Egypt. It has a certain value, usually cultural value. Some random school kids might want to look at that mummy during their summer vacation. You can’t measure the economic value of keeping it. You don’t know whether keeping that mummy in your collection is worthy or not. Once it’s in your collection, it won’t die. You have to maintain it like forever.

You might want to keep a puppy at home. How about a mummy at home? YAGNI.

Debrief

If you think I am arrogant, read Rich Hickey’s more arrogant “Open Source is Not About You”. If you want a book treatment, read: Working in Public: The Making and Maintenance of Open Source Software by Nadia Eghbal.