chainsawriot

rio 1.0.0

Posted on Sep 15, 2023 by Chung-hong Chan

Update: 2023-09-19 jump

Previous on this blog: 10 years of rio

If you want to know more about the roller coaster ride since my last post on rio on Aug 28, jump here. But you are probably here to learn more about the package: rio 1.0.0 is on CRAN.

This is the first release of rio under the GESIS TSA Moniker! This is also the debut of GESIS TSA on CRAN.

install.packages("rio")

The following are what you should know about rio 1.0.0.

Massive Code Clean Up

It is so easy to look at the new features. But I would like to talk about the user-invisible changes first.

Like all projects, the rio codebase did get quite messy after the first decade of organic growth. Developers come and go. Some features were developed halfway. Different developers developed their features their way. Also, there were so many redundancy or, in my own opinion only, sub-optimal choices. It’s a problem, for sure. But it is very common for a decade-old codebase to look like that. And that’s why code needs constant upkeep. It’s exciting to develop new things. But working with legacy code is also a very important skill to have. Maybe you should read this post by Joel Spolsky.

The rio development team’s first assignment was to clean up (we said “declutter”) the codebase so that the code is less bloat. Also along the way, we discovered several bugs hidden in there and we fixed them. We clarified many different confusing (and competing) concepts in the codebase. We made consistent several inconsistent behaviors (e.g. ... and setclass). We trimmed the code to remove different “sources of truth” so that only one source of truth is left to maintain. We stopped reinventing the wheels and use existing robust and lightweight solutions instead. We deprecated features that are almost never used but took us much energy to support.

This clean up introduced a few breaking changes (see NEWS.md). Therefore, this is v1.0.0. Having said that, we also know the core functionalities (see below) of rio must be incredibly stable. We are happy to announce that version 1.0.0 breaks none of the 49 reverse dependencies on CRAN and BioConductor. That’s why even with such a big change rio’s CRAN submission process took a few hours (even with 49 revdeps) and looked effortless. For some code available on GitHub that we can test using the changed features, those code are still functional. Probably no one would shed a tear for the deprecated features.

Foster core functionalities

We think that the core functionalities of rio are just two: importing and exporting rectangular data. We should focus on improving these two functionalities rather than other. With this core principle, we made the following changes:

Make `setclass` have the last say

The parameter setclass controls what the class of the imported data is. Previously, this parameter can be overridden by the underlying function. Now, it has the last say.

require(rio)
## By default as data frame
import("starwars.ods")
## To tibble
import("starwars.ods", setclass = "tibble")
## Still tibble!
import("starwars.csv", data.table = TRUE, setclass = "tibble")

We also introduced the option rio.import.class so that you can select your preferred output class.

options(rio.import.class = "tibble")
import("starwars.ods")
import("starwars.csv")

Well, you can also put the options in your R initialization script such as .Rprofile, although we don’t recommend it for reproducibility reasons. It would be much better to do this for each of your analysis script.

Currently, the output class can be “data.frame”, “tibble”, “data.table”, and …

Support Apache Arrow out of the box

rio, similar to R and Linux, is licensed with GPL-2. We believe using proprietary software harms computational reproducibility because not all of us can have access to proprietary software such as SPSS, Stata, and Microsoft Excel.

rio has two tiers of supported formats, based on whether the underlying packages are in Imports or Suggests. Out of the box, only those in Imports are supported. Those in Suggests can be easily installed using rio::install_formats() after the installation.

Out of practicality, all formats in the “Imports” tier, except plain-text formats such as CSV, were proprietary binary formats previously: haven (SPSS, SAS, Stata) and readxl (Microsoft Excel) ¹. Without running rio::install_formats() and considering only binary formats, it was only possible to export your data to these formats. Although supporting proprietary formats is practical, we should also encourage open formats.

In 1.0.0, we upgraded arrow to the “Imports” tier ². Apache arrow (Apache License 2.0) supports two open binary formats: Apache Parquet and Feather. These formats are now widely used and rio should support them out of the box. Therefore, if you have received a data file in xlsx, you can convert it immediately to an open format out of the box.

Update: 2023-09-19 Due to compiling time concerns, I am sorry to announce that we’ll need to roll back the decision to move arrow to the “Import” tier. It will be in version 1.0.1 soon on CRAN. In order to use the arrow features, please install arrow manually for now. I am deeply sorry for any confusion caused. Please participate in this discussion about supporting open binary format, which I have reopened.; now back to the original broadcast.

require(arrow)
rio::convert("starwars.xlsx", "starwars.parquet")

Now, this parquet file can be read by many open source software, notably DuckDB, Apache Spark and pandas.

Related to this, you can now import a data file and immediately set the imported object to Arrow Table for data manipulation using arrow. Therefore, arrow is a new output class that rio supports.

require(arrow)
require(dplyr)
terminology_arrow <- rio::import("https://evs.nci.nih.gov/ftp1/CDISC/SDTM/SDTM%20Terminology.xls", sheet = 2, setclass = "arrow")
terminology_arrow %>% filter(`Codelist Name` == "Unit") %>% collect()

New formats: `qs` and `fods`

In the Suggests tier, we support two new formats: qs (Quick Serialization) and fods (OpenDocument Spreadsheet “Flat”).

rio also supports all new features of readODS 2.1.0.

## export a list of data frame
rio::export(list("mtcars" = mtcars, "iris" = iris), "many_table.fods")
rio::import("many_table.fods", sheet = "mtcars")

`export_list()`’s new `archive` argument

It is now possible to use export_list() (this function exports multiple files, whereas export() exports one file) to export a bunch of files to a single archive, such as a zip file.

rio::export_list(list("mtcars" = mtcars, "iris" = iris), file = "%s.parquet", archive = "many_files.zip")
rio::import("many_files.zip", which = "mtcars.parquet")

Use `writexl`

writexl is the current simplest and fastest XLSX writer. I know it from my development of readODS. We decided to use writexl going forward. Surprisingly, this choice does not break anything.

Improved Documentation

The entire documentation has been rewritten; rather than using the examples to test the package, most examples are now some practical usage scenarios. Several vignettes are added to explain how the package works.

New logo

Last but not least, new logo ³!

Acknowledgment

I would like to thank tremendously to my teammate David Schoch for his help; Jason Becker and Bill Denney for the discussion; and GitHub user “zahlenzauber” for confirming the labelling mechanism is working.

backstory

I contacted Dr Thomas Leeper previously about helping to fix the burning issues of rio. Thomas then offered to transfer back the maintainership to me because he doesn’t have time for open source software development for now. rio is then back to me. I immediately released a maintenance release, 0.5.30, on CRAN to fix the most imminent issues. Also along the way notified the CRAN team about the transfer of maintainership.

I am willing to maintain rio, but I believe this widely used package would benefit a lot for having group maintainship. My initial thought was to submit rio for rOpenSci and I did that. But the submission was rejected.

I talked with my team lead David Schoch about the rOpenSci desk rejection. He had an idea of having a small repository of Open Science software. The GESIS TSA GitHub organization was born. If you want to know more about our team, please go to our department website.

I migrated two other Open Science software to the organization. The next R package to be released under the GESIS TSA Moniker will be rang 0.3.

Well, xlsx (or more specifically OOXML) is technically an ISO / ECMA open standard. Microsoft owns the patent of OOXML. ↩
There’s a dicussion on GitHub on which open binary format to support. There were only two choices: Apache Parquet (arrow) and OASIS ODS (readODS). arrow was chosen. ↩
Well, if you know the reference of this logo, please remember to stretch your body often and do regular exercise. BTW, her name is Rio and she dances on the sand. ↩