Update: 2023-09-19 jump
Previous on this blog: 10 years of rio
If you want to know more about the roller coaster ride since my last post on rio on Aug 28, jump here. But you are probably here to learn more about the package: rio 1.0.0 is on CRAN.
This is the first release of
rio under the GESIS TSA Moniker! This is also the debut of GESIS TSA on CRAN.
The following are what you should know about
It is so easy to look at the new features. But I would like to talk about the user-invisible changes first.
Like all projects, the
rio codebase did get quite messy after the first decade of organic growth. Developers come and go. Some features were developed halfway. Different developers developed their features their way. Also, there were so many redundancy or, in my own opinion only, sub-optimal choices. It’s a problem, for sure. But it is very common for a decade-old codebase to look like that. And that’s why code needs constant upkeep. It’s exciting to develop new things. But working with legacy code is also a very important skill to have. Maybe you should read this post by Joel Spolsky.
rio development team’s first assignment was to clean up (we said “declutter”) the codebase so that the code is less bloat. Also along the way, we discovered several bugs hidden in there and we fixed them. We clarified many different confusing (and competing) concepts in the codebase. We made consistent several inconsistent behaviors (e.g.
setclass). We trimmed the code to remove different “sources of truth” so that only one source of truth is left to maintain. We stopped reinventing the wheels and use existing robust and lightweight solutions instead. We deprecated features that are almost never used but took us much energy to support.
This clean up introduced a few breaking changes (see NEWS.md). Therefore, this is v1.0.0. Having said that, we also know the core functionalities (see below) of
rio must be incredibly stable. We are happy to announce that version 1.0.0 breaks none of the 49 reverse dependencies on CRAN and BioConductor. That’s why even with such a big change rio’s CRAN submission process took a few hours (even with 49 revdeps) and looked effortless. For some code available on GitHub that we can test using the changed features, those code are still functional. Probably no one would shed a tear for the deprecated features.
We think that the core functionalities of
rio are just two: importing and exporting rectangular data. We should focus on improving these two functionalities rather than other. With this core principle, we made the following changes:
setclasshave the last say
setclass controls what the class of the imported data is. Previously, this parameter can be overridden by the underlying function. Now, it has the last say.
require(rio) ## By default as data frame import("starwars.ods") ## To tibble import("starwars.ods", setclass = "tibble") ## Still tibble! import("starwars.csv", data.table = TRUE, setclass = "tibble")
We also introduced the option
rio.import.class so that you can select your preferred output class.
options(rio.import.class = "tibble") import("starwars.ods") import("starwars.csv")
Well, you can also put the
options in your R initialization script such as
.Rprofile, although we don’t recommend it for reproducibility reasons. It would be much better to do this for each of your analysis script.
Currently, the output class can be “data.frame”, “tibble”, “data.table”, and …
rio, similar to R and Linux, is licensed with GPL-2. We believe using proprietary software harms computational reproducibility because not all of us can have access to proprietary software such as SPSS, Stata, and Microsoft Excel.
rio has two tiers of supported formats, based on whether the underlying packages are in
Suggests. Out of the box, only those in
Imports are supported. Those in
Suggests can be easily installed using
rio::install_formats() after the installation.
Out of practicality, all formats in the “Imports” tier, except plain-text formats such as CSV, were proprietary binary formats previously:
haven (SPSS, SAS, Stata) and
readxl (Microsoft Excel) 1. Without running
rio::install_formats() and considering only binary formats, it was only possible to export your data to these formats. Although supporting proprietary formats is practical, we should also encourage open formats.
In 1.0.0, we upgraded
arrow to the “Imports” tier 2. Apache arrow (Apache License 2.0) supports two open binary formats: Apache Parquet and Feather. These formats are now widely used and
rio should support them out of the box. Therefore, if you have received a data file in xlsx, you can convert it immediately to an open format out of the box.
Update: 2023-09-19 Due to compiling time concerns, I am sorry to announce that we’ll need to roll back the decision to move
arrow to the “Import” tier. It will be in version 1.0.1 soon on CRAN. In order to use the arrow features, please install
arrow manually for now. I am deeply sorry for any confusion caused. Please participate in this discussion about supporting open binary format, which I have reopened.; now back to the original broadcast.
require(arrow) rio::convert("starwars.xlsx", "starwars.parquet")
Related to this, you can now import a data file and immediately set the imported object to Arrow Table for data manipulation using arrow. Therefore,
arrow is a new output class that
require(arrow) require(dplyr) terminology_arrow <- rio::import("https://evs.nci.nih.gov/ftp1/CDISC/SDTM/SDTM%20Terminology.xls", sheet = 2, setclass = "arrow") terminology_arrow %>% filter(`Codelist Name` == "Unit") %>% collect()
In the Suggests tier, we support two new formats:
qs (Quick Serialization) and
fods (OpenDocument Spreadsheet “Flat”).
rio also supports all new features of readODS 2.1.0.
## export a list of data frame rio::export(list("mtcars" = mtcars, "iris" = iris), "many_table.fods") rio::import("many_table.fods", sheet = "mtcars")
It is now possible to use
export_list() (this function exports multiple files, whereas
export() exports one file) to export a bunch of files to a single archive, such as a zip file.
rio::export_list(list("mtcars" = mtcars, "iris" = iris), file = "%s.parquet", archive = "many_files.zip") rio::import("many_files.zip", which = "mtcars.parquet")
writexl is the current simplest and fastest XLSX writer. I know it from my development of
readODS. We decided to use
writexl going forward. Surprisingly, this choice does not break anything.
The entire documentation has been rewritten; rather than using the examples to test the package, most examples are now some practical usage scenarios. Several vignettes are added to explain how the package works.
Last but not least, new logo 3!
I would like to thank tremendously to my teammate David Schoch for his help; Jason Becker and Bill Denney for the discussion; and GitHub user “zahlenzauber” for confirming the labelling mechanism is working.
I contacted Dr Thomas Leeper previously about helping to fix the burning issues of
rio. Thomas then offered to transfer back the maintainership to me because he doesn’t have time for open source software development for now.
rio is then back to me. I immediately released a maintenance release, 0.5.30, on CRAN to fix the most imminent issues. Also along the way notified the CRAN team about the transfer of maintainership.
I am willing to maintain
rio, but I believe this widely used package would benefit a lot for having group maintainship. My initial thought was to submit
rio for rOpenSci and I did that. But the submission was rejected.
I talked with my team lead David Schoch about the rOpenSci desk rejection. He had an idea of having a small repository of Open Science software. The GESIS TSA GitHub organization was born. If you want to know more about our team, please go to our department website.
I migrated two other Open Science software to the organization. The next R package to be released under the GESIS TSA Moniker will be
Well, if you know the reference of this logo, please remember to stretch your body often and do regular exercise. BTW, her name is Rio and she dances on the sand. ↩