readODS 2.1.0 is now on CRAN. This is the third release under the rOpenSci moniker.
To install it from CRAN:
First thing you might notice is that it’s just one month after the last release. That’s right. Within just one month, we made a lot of progress. In my opinion, this is probably the real “stable” release of the v2 series. The nice thing about 2.1 is that you probably will not see notable differences (except maybe one point, see below). Just like 2.0.0, you will see mostly (but sometimes, dramatically) performance increases. Let’s dive into it.
xml2; complete C++ rewrite of
readODS was completely rewritten in 2016 to use
xml2 (The previous versions were using
XML). Following the trend of rewriting
xml2 code in C++ with RapidXML since 2.0.0, 2.1.0 is the first version which
xml2 is no longer a dependency.
write_ods (namely the part on updating and appending sheets) was the last part in 2.0.0 that needs
xml2. This part, along with the entire ODS writing algorithm, has been rewritten in C++ with RapidXML.
This is a good in threefold. First,
readODS has no required system dependencies anymore 1. You don’t need to install
libxml2-dev on Ubuntu, for example. On a blank state Rocker (R container), you can install
readODS 2.1.0 right away.
Second, massive speed improvement. As said previously, the ODS writing algorithm by Dr Detlef Steuer implemented in 1.8.0 was 100x faster than 1.7.0. Now, the C++ rewrite of the same algorithm in 2.1.0 is 10x faster than 1.8.0. Therefore, 2.1.0 is almost 1000x faster than 1.7.0! Don’t believe it? I have a benchmark. It took 46.4s in January this year with 1.7.3. Now the same benchmark takes 0.047s.
nycflights13 can be written to ODS in 15.4s (was 162s). In the previous post I said 2.1.0 should complete the round trip of writing and reading
nycflights13 under 60s (was 201s). Now 46s.
Third, UTF-8 safe. Because of
cpp11’s “UTF-8 everywhere” policy, the C++ version of ODS writing mechanism is truly UTF-8 safe. Now, we can make sure the writing mechanism works in R < 4.2 on Windows (where UTF-8 is not the default). The CI infrastructure regularly checks to make sure R 3.6 on Windows is supported. This might look insignificant as most of us are probably using R >= 4.2 and/or non-Windows OSes. But there are still many workers in heavily regulated computing environments (such as various governments around the world) where R < 4.2 on Windows is still the standard. As this package is used by various governments’ Open Data initiatives, this is a boon.
write_ods, we also added the functionality to write Flat ODS. So now you can read and write FODS on 2.1!
This is a requested feature. And both
openxlsx have the same feature. Now you can write a list of data frames into the same file:
temp_ods <- tempfile(fileext = ".ods") readODS::write_ods(list("flower_data" = iris, "car_data" = mtcars), temp_ods) readODS::list_ods_sheets(temp_ods) # two sheets
This one was fixed because of the not-so-positive review of the 1.x series of
readODS by the fellow social scientist Dr Didier Ruedin at the University of Neuchâtel, Switzerland.
Of course, the speed issue mentioned in his review has now been fixed. But the issue related to
col_types remained in 2.0.x. So now, you can use
col_types the same way as
readxl. It can now either accept a character or a list. 2
readODS::read_ods("starwars.ods", col_types = "??f") readODS::read_ods("starwars.ods", col_types = list(species = "f"))
This one is potentially breaking, but we must make
readODS compatible with all other data reading functions.
In the previous version, this returned a single-row data.frame:
temp_ods <- tempfile(fileext = ".ods") readODS::write_ods(mtcars[0,], temp_ods) x <- readODS::read_ods(temp_ods, col_names = TRUE)
Whereas this (
readr) returned a zero-row data.frame:
temp_csv <- tempfile(fileext = ".csv") write.csv(mtcars[0,], temp_csv) x <- readr::read_csv(temp_csv, col_names = TRUE)
read_ods behaves the same way as
openxlsx::read.xlsx regarding data files with just column names when
TRUE: it will return a zero-row data frame.
If you need the old behavior of 2.0.0, you can set this option:
options("readODS.v200" = TRUE) read_ods(write_ods(mtcars[0,]), col_names = TRUE)
We will remove this option (
readODS.v200) in version 3.0.0.
ods_sheets()will not be removed, as long as
With the entire package being now complete rewritten in C++, I believe room for massive speed improvement is not much (but still possible). I think I can consider the Projekt 71 is officially finished!
From here, I need to separate my own roadmap and the project(
readODS)’s roadmap. To be honest, developing
readODS is not on my roadmap at least in the next few months. Speaking of open source development, I will work on decluttering
rio with David Schoch 3 and ship
rio v1.0.0. I also hope to ship
rang v0.3.0 in Q4. The priority will be
readODS in the next few months. Speaking of
rio, I need your opinion on this:
rio should support at least one open standard out of the box, what should it be? Apache Parquet (
arrow) or ODS (
For the project, version 2.2 would probably focus on quality of life improvements. The most important feature would be dealing with ODS and FODS with the same function, rather than two different functions. It would make
read_ods() more like
read_excel(). Hopefully, it will be released in 2024 Q1.
I would like to thank Peter Brohan and Jenny Bryan for the valuable discussions during the development cycle of 2.1.0.
stringi, a dependency, actually has one optional system dependency (
libicu-dev). If the system dependency was not found, it will get downloaded from the internet. ↩
I am extremely grateful for his help. But at times I am terribly sorry to drag him into this crazy chaotic codebase. ↩