chainsawriot

readODS 2.1.0

Posted on Sep 11, 2023 by Chung-hong Chan

Previously on this blog: 1.7.0 / 1.8.0 / 2.0.0

readODS 2.1.0 is now on CRAN. This is the third release under the rOpenSci moniker.

To install it from CRAN:

install.packages("readODS")

First thing you might notice is that it’s just one month after the last release. That’s right. Within just one month, we made a lot of progress. In my opinion, this is probably the real “stable” release of the v2 series. The nice thing about 2.1 is that you probably will not see notable differences (except maybe one point, see below). Just like 2.0.0, you will see mostly (but sometimes, dramatically) performance increases. Let’s dive into it.

Free of `xml2`; complete C++ rewrite of `write_(f)ods`

readODS was completely rewritten in 2016 to use xml2 (The previous versions were using XML). Following the trend of rewriting xml2 code in C++ with RapidXML since 2.0.0, 2.1.0 is the first version which xml2 is no longer a dependency. write_ods (namely the part on updating and appending sheets) was the last part in 2.0.0 that needs xml2. This part, along with the entire ODS writing algorithm, has been rewritten in C++ with RapidXML.

This is a good in threefold. First, readODS has no required system dependencies anymore ¹. You don’t need to install libxml2-dev on Ubuntu, for example. On a blank state Rocker (R container), you can install readODS 2.1.0 right away.

Second, massive speed improvement. As said previously, the ODS writing algorithm by Dr Detlef Steuer implemented in 1.8.0 was 100x faster than 1.7.0. Now, the C++ rewrite of the same algorithm in 2.1.0 is 10x faster than 1.8.0. Therefore, 2.1.0 is almost 1000x faster than 1.7.0! Don’t believe it? I have a benchmark. It took 46.4s in January this year with 1.7.3. Now the same benchmark takes 0.047s. nycflights13 can be written to ODS in 15.4s (was 162s). In the previous post I said 2.1.0 should complete the round trip of writing and reading nycflights13 under 60s (was 201s). Now 46s.

Third, UTF-8 safe. Because of cpp11’s “UTF-8 everywhere” policy, the C++ version of ODS writing mechanism is truly UTF-8 safe. Now, we can make sure the writing mechanism works in R < 4.2 on Windows (where UTF-8 is not the default). The CI infrastructure regularly checks to make sure R 3.6 on Windows is supported. This might look insignificant as most of us are probably using R >= 4.2 and/or non-Windows OSes. But there are still many workers in heavily regulated computing environments (such as various governments around the world) where R < 4.2 on Windows is still the standard. As this package is used by various governments’ Open Data initiatives, this is a boon.

While rewriting write_ods, we also added the functionality to write Flat ODS. So now you can read and write FODS on 2.1!

Writing a list of data frames into the same (F)ODS file

This is a requested feature. And both writexl and openxlsx have the same feature. Now you can write a list of data frames into the same file:

temp_ods <- tempfile(fileext = ".ods")
readODS::write_ods(list("flower_data" = iris, "car_data" = mtcars), temp_ods)
readODS::list_ods_sheets(temp_ods) # two sheets

Correct `col_types` support

This one was fixed because of the not-so-positive review of the 1.x series of readODS by the fellow social scientist Dr Didier Ruedin at the University of Neuchâtel, Switzerland.

Of course, the speed issue mentioned in his review has now been fixed. But the issue related to col_types remained in 2.0.x. So now, you can use col_types the same way as readxl. It can now either accept a character or a list. ²

readODS::read_ods("starwars.ods", col_types = "??f")
readODS::read_ods("starwars.ods", col_types = list(species = "f"))

Correct reading of “column names only” ODS

This one is potentially breaking, but we must make readODS compatible with all other data reading functions.

In the previous version, this returned a single-row data.frame:

temp_ods <- tempfile(fileext = ".ods")
readODS::write_ods(mtcars[0,], temp_ods)

x <- readODS::read_ods(temp_ods, col_names = TRUE)

Whereas this (readr) returned a zero-row data.frame:

temp_csv <- tempfile(fileext = ".csv")
write.csv(mtcars[0,], temp_csv)

x <- readr::read_csv(temp_csv, col_names = TRUE)

In 2.1.0, read_ods behaves the same way as readr::read_csv, data.table::fread, utils::read.csv, readxl::read_xlsx, and openxlsx::read.xlsx regarding data files with just column names when col_names / header / colNames is TRUE: it will return a zero-row data frame.

If you need the old behavior of 2.0.0, you can set this option:

options("readODS.v200" = TRUE)
read_ods(write_ods(mtcars[0,]), col_names = TRUE)

We will remove this option (readODS.v200) in version 3.0.0.

Various minor fixes

ods_sheets() will not be removed, as long as readxl::excel_sheets() exists.
Several bug fixes and clean up.

Roadmap to 2.2

With the entire package being now complete rewritten in C++, I believe room for massive speed improvement is not much (but still possible). I think I can consider the Projekt 71 is officially finished!

From here, I need to separate my own roadmap and the project(readODS)’s roadmap. To be honest, developing readODS is not on my roadmap at least in the next few months. Speaking of open source development, I will work on decluttering rio with David Schoch ³ and ship rio v1.0.0. I also hope to ship rang v0.3.0 in Q4. The priority will be rio > rang > readODS in the next few months. Speaking of rio, I need your opinion on this: rio should support at least one open standard out of the box, what should it be? Apache Parquet (arrow) or ODS (readODS)?

For the project, version 2.2 would probably focus on quality of life improvements. The most important feature would be dealing with ODS and FODS with the same function, rather than two different functions. It would make read_ods() more like read_excel(). Hopefully, it will be released in 2024 Q1.

Acknowledgment

I would like to thank Peter Brohan and Jenny Bryan for the valuable discussions during the development cycle of 2.1.0.

stringi, a dependency, actually has one optional system dependency (libicu-dev). If the system dependency was not found, it will get downloaded from the internet. ↩
But unfortunately, “-“ is still not supported due to this issue. ↩
I am extremely grateful for his help. But at times I am terribly sorry to drag him into this crazy chaotic codebase. ↩