chainsawriot

10 years of rio

Posted on Aug 28, 2023 by Chung-hong Chan

rio was born in the office showed at the lower right corner of this photo
source: CC BY-SA 3.0

Update 2023-08-29: Jump

Back in August 2013, I shared an office in Eliot Hall of the University of Hong Kong (one of a few remaining early KGV-era buildings) with Geoff Chan. He is now assistant professor of data science at Chinese University of Hong Kong, so I should call him Professor Chan. Back then, I talked to the future Professor Chan about an idea of writing an R package that is similar to stringr. I was (and still am) amazed by how simple but crazily useful for cleaning up the organic messiness created during the natural growth of R. In the case of stringr, the major goal back then was to unify the API of various regular expression functions. In 2013, stringr had no dependencies and it was purely a set of wrappers of various Base functions. Hadley Wickham names all the functions str_ for consistency and the argument order is always the character input first (contrast this with gsub where regular expression comes first; whereas character input comes first for strsplit). The whole idea of stringr back then was simple enough to be fitted in a 3-page R Journal paper.

I felt at that time the file input and output functions of Base R have the same problem. I came across this problem when I needed to write an R object as RDS. I could not remember the name of the write function (saveRDS), because it does not have the same consistent name with other functions serving the same purpose (write.csv, write.table, foreign::write.dta etc). There should be a way to do the same stringr magic with these I/O functions.

The design was (and still, is) simple: just two functions - import and export. At the time, no package used these two function names and therefore no potential conflict. (reticulate::import was after rio::import, 2017) There’s one more function: convert, but it is just import and then export. Similar to stringr, the API is consistent: import(file) and export(x, file). Both functions make an assumption that the file extension of a file indicates its format. At the time the last release of Mac OS 9 (a major OS without the concept of file extension) was 10 years ago so file extension was universal across major OSes. This assumption is absolutely reasonable.

Therefore, you can use the same function to read different files: import("mtcars.csv"), import("mtcars.sav"), import("mtcars.rds"). The same case for writing: export(mtcars, "mtcars.csv"), export(mtcars, "mtcars.dta"). You don’t need to care about the underlying function, whether it’s read.csv or saveRDS. Of course, you might argue this monolithic approach is totally against the Unix Philosophy (“Do one thing and do it well”). But the convenience this approach brought to the table outweighs the need for dogmatically conforming to the Unix design principle.

The first CRAN version was very cringe-worthy. import was just a large switch call. The crazy thing is that this still works on R 4.3!

import <- function(file="", format=NULL, header=TRUE, ... ) {
  format <- .guess(file, format)
  x <- switch(format,
              txt=read.table(file=file, sep="\t", header=header, ...), ##tab-seperate txt file
              rds=readRDS(file=file, ...),
              csv=read.csv(file=file, header=header, ...),
              dta=read.dta(file=file, ...),
              sav=read.spss(file=file,to.data.frame=TRUE, ...),
              mtp=read.mtp(file=file, ...),
              rec=read.epiinfo(file=file, ...),
              stop("Unknown file format")
              )
  return(x)
}

.guess <- function(filename, format=NULL) {
  # guess the file format of filename based on file extension
  # TODO: use the unix utility "file" to read the file info.
  # or MIME info.
  if (!is.character(filename)) {
    stop("Filename is not a string")
  }
  guess_format <- ifelse(!is.null(format), tolower(format), str_extract(tolower(filename), "\\.(txt|csv|dta|sav|sas|rec|rds|mtp)$"))
  if (is.na(guess_format)) {
    stop("Unknown file format")
  } else {
    return(str_replace(guess_format, "\\.", ""))
  }
}

The feature mentioned in the TODO of .guess() has never been implemented. Also, I didn’t know by then there is tools::file_ext() (although this built-in function under the hood is also just regular expression).

rio was the first R package I uploaded to CRAN. And actually, I had my first experience with the back then not-so-friendly CRAN team. I was accused by a CRAN team member for wasting his time ¹. But after many back-and-forth e-mails and uploads, the first version of rio, v0.1.1, was released on CRAN on 2013-08-28 at 14:02 CEST. That’s right: that was exactly ten years ago today.

I used rio in my own PhD research for quickly save and load data. But I did not find rio to be widely used in 2013-2014. There was no development for almost a year (as there was no need, rio worked well enough for my research), until I received an e-mail from Dr Thomas J. Leeper (now research scientist at ~~Facebook~~ Meta) in 2015 saying he updated the package to support more formats (excel, json, etc.) and asking how should he proceed with contributing to the package. At the time, I was busy with my own PhD research (plus million other research projects and services). He even offered to me to uptake the maintainership of rio. I agreed and then the rest is history.

I still think that this is one of the best decisions I have ever made. The package flourishes under Thomas Leeper’s wonderful leadership. I think rio is sort-of popular. Of course, it’s not the “Billboard 100” popular. But I believe it’s not bad. At the time I am writing this, it got downloaded over 42,000 times per month from the RStudio CRAN mirror alone. In 10 years, the total number of downloads from the RStudio CRAN mirror is 10 Millions! Despite having no software paper, rio got cited 85 times so far. It is my most cited R package and contributes one point to my H-index. However, I don’t claim any credit for the success of rio. The honor should be on Dr Leeper as well as all other contributors to the package. The package has been improved a lot from the above cringe-worthy switch call. The code base is more maintainable because of the S3 method approach suggested by Jason Becker. Also, rio supports more file formats now, contributed by various volunteers.

I still contribute code to the package occasionally and am still dogfooding myself by using rio in almost all of my projects. rio is mentioned in the R4DS book by Hadley Wickham et al. rio is recommended by Bruno Rodrigues in his book “Modern R with the tidyverse”. Sharon Machlis in her book “Practical R for Mass Communication and Journalism” says “The magic of rio”. Perhaps rio is magical now. It’s trusted by some big organizations too, such as World Health Organization, European Centre for Disease Prevention and Control, Doctors Without Borders. The design of rio is particularly helpful in disease monitoring because different agencies around the world produce their data in different formats. Having just one R function to read them all can save a lot of troubles, particularly for epidemiologists and doctors. The package is silently serving the world.

The first 10 years of rio were incredible. I can’t wait to see what this little program would bring in the years to come.

Postscript

Being that upbeat is actually not my style. I would like to add two melancholy historical footnotes about rio. Missed opportunities, some might say.

Sima

rio actually has a little sister called Sima, which is written in Ruby. The name “Sima” (司馬, “the one who controls the horses”) is the ancient Chinese name of the military rank Marshal. The Ruby Gem is kind of complete, but never published on rubygems.org. The idea of Sima is actually the same as rio:

Sima.export(obj, "~/output.mar")
obj = Sima.import("~/output.mar")

### Sima can guess the file format you want to serialize/deserialize
Sima.export({:testing => [2,3,4,5,6]}, "~/output.yml")
Sima.import("~/output.yml")

It was created for a top secret research project at HKU, which the first prototype was written in Ruby (my choice). Because of Sima, I had the opportunity to present it in a lightning talk at RubyConf Taiwan 2016 in Academia Sinica. This was the first time (and the only time so far) I see Yukihiro Matsumoto a.k.a. Matz in person. I still wear the official T Shirt of that conference often. The reference on that T-shirt, ペンパイナッポーアッポーペン, did not age very well. At that conference, I talked quite extensively with two Taiwanese Ruby programmers (one of them I shared a AirBnB with and he bought me breakfast even). Both of them are now working at some extremely important tech companies now.

Sima did not get as much traction as rio. Sadly, the development coincided with my PhD graduation. My boss back then would like to offer me a postdoc position to work on this further but I decided to move to somewhere else. And after graduation, I was not part of that top secret research project anymore. The end product got rewritten in Python or whatnot. I don’t need to use Sima. Probably no one other than me ever uses Sima.

The only thing I did for the Gem was to handle those GitHub security alerts. Up to a point, I archived the gem earlier this year. RIP, Sima (2016 - 2023). I still like Ruby. Unfortunately, I barely program any Ruby now. Probably I’ve forgotten most of Ruby.

RStudio

At one point in 2015, Hadley Wickham contacted the rio development team via GitHub issue about relicensing rio in MIT, when rio was and still is in GPL-2 (again, my choice). It was because they (RStudio the company, now Posit) wanted to revamp the dataset import interface of RStudio (the editor). rio could be their choice. But the GitHub issue was closed because they wanted to rethink about their choices. As you may know now, the dataset import interface is now based on readr, readxl, and haven ultimately.

rio wasn’t chosen, unfortunately. What if rio were chosen? I don’t know.

update

My message to the community

As you might notice, leeper/rio got redirected to chainsawriot/rio. That’s because Thomas transferred the repo as well as the maintainership of rio to me. I am now the maintainer of rio. But this is not my first time. Back when I created rio in 2013, the package was not as popular as it is now. As some have said that maintaining open source software is “free as in puppies”. I would like to take this opportunity to thank Thomas for taking care of this puppy for so many years.

The short term goal for me is to prepare a maintenance release to CRAN in the coming days #306 . I will also attend to your many bug reports and feature requests. However, please note that I will focus on bug fixes and I need to take a relative conservative approach to all your feature requests. Due to many reverse dependencies, I probably will not introduce any breaking changes. Also, introducing new formats would also increase the (human) cost for maintenance. I might come up with a new approach to manage this complexity. But in the meantime, I will be very selective about introducing new formats. Our goal should be fostering the core functionalities of rio: importing and exporting rectangular data.

Coincidentally, today marks the 10th anniversary of the first CRAN release of rio. I would like to thank all of you for choosing rio, as well as all the contributors for making rio easy and fun to use. Last but not least, and once again, Thomas Leeper for his work on this package.

And the maintenance release: 0.5.30 is on CRAN.

In fact, I still need to deal with his “Please correct before [two weeks] to safely retain your package on CRAN” emails occasionally. ↩