Previously on this blog: 1.7.0 / 1.8.0 / 2.0.0
readODS 2.1.0 is now on CRAN. This is the third release under the rOpenSci moniker.
To install it from CRAN:
install.packages("readODS")
First thing you might notice is that it’s just one month after the last release. That’s right. Within just one month, we made a lot of progress. In my opinion, this is probably the real “stable” release of the v2 series. The nice thing about 2.1 is that you probably will not see notable differences (except maybe one point, see below). Just like 2.0.0, you will see mostly (but sometimes, dramatically) performance increases. Let’s dive into it.
xml2
; complete C++ rewrite of write_(f)ods
readODS
was completely rewritten in 2016 to use xml2
(The previous versions were using XML
). Following the trend of rewriting xml2
code in C++ with RapidXML since 2.0.0, 2.1.0 is the first version which xml2
is no longer a dependency. write_ods
(namely the part on updating and appending sheets) was the last part in 2.0.0 that needs xml2
. This part, along with the entire ODS writing algorithm, has been rewritten in C++ with RapidXML.
This is a good in threefold. First, readODS
has no required system dependencies anymore 1. You don’t need to install libxml2-dev
on Ubuntu, for example. On a blank state Rocker (R container), you can install readODS
2.1.0 right away.
Second, massive speed improvement. As said previously, the ODS writing algorithm by Dr Detlef Steuer implemented in 1.8.0 was 100x faster than 1.7.0. Now, the C++ rewrite of the same algorithm in 2.1.0 is 10x faster than 1.8.0. Therefore, 2.1.0 is almost 1000x faster than 1.7.0! Don’t believe it? I have a benchmark. It took 46.4s in January this year with 1.7.3. Now the same benchmark takes 0.047s. nycflights13
can be written to ODS in 15.4s (was 162s). In the previous post I said 2.1.0 should complete the round trip of writing and reading nycflights13
under 60s (was 201s). Now 46s.
Third, UTF-8 safe. Because of cpp11
’s “UTF-8 everywhere” policy, the C++ version of ODS writing mechanism is truly UTF-8 safe. Now, we can make sure the writing mechanism works in R < 4.2 on Windows (where UTF-8 is not the default). The CI infrastructure regularly checks to make sure R 3.6 on Windows is supported. This might look insignificant as most of us are probably using R >= 4.2 and/or non-Windows OSes. But there are still many workers in heavily regulated computing environments (such as various governments around the world) where R < 4.2 on Windows is still the standard. As this package is used by various governments’ Open Data initiatives, this is a boon.
While rewriting write_ods
, we also added the functionality to write Flat ODS. So now you can read and write FODS on 2.1!
This is a requested feature. And both writexl
and openxlsx
have the same feature. Now you can write a list of data frames into the same file:
temp_ods <- tempfile(fileext = ".ods")
readODS::write_ods(list("flower_data" = iris, "car_data" = mtcars), temp_ods)
readODS::list_ods_sheets(temp_ods) # two sheets
col_types
supportThis one was fixed because of the not-so-positive review of the 1.x series of readODS
by the fellow social scientist Dr Didier Ruedin at the University of Neuchâtel, Switzerland.
Of course, the speed issue mentioned in his review has now been fixed. But the issue related to col_types
remained in 2.0.x. So now, you can use col_types
the same way as readxl
. It can now either accept a character or a list. 2
readODS::read_ods("starwars.ods", col_types = "??f")
readODS::read_ods("starwars.ods", col_types = list(species = "f"))
This one is potentially breaking, but we must make readODS
compatible with all other data reading functions.
In the previous version, this returned a single-row data.frame:
temp_ods <- tempfile(fileext = ".ods")
readODS::write_ods(mtcars[0,], temp_ods)
x <- readODS::read_ods(temp_ods, col_names = TRUE)
Whereas this (readr
) returned a zero-row data.frame:
temp_csv <- tempfile(fileext = ".csv")
write.csv(mtcars[0,], temp_csv)
x <- readr::read_csv(temp_csv, col_names = TRUE)
In 2.1.0, read_ods
behaves the same way as readr::read_csv
, data.table::fread
, utils::read.csv
, readxl::read_xlsx
, and openxlsx::read.xlsx
regarding data files with just column names when col_names
/ header
/ colNames
is TRUE
: it will return a zero-row data frame.
If you need the old behavior of 2.0.0, you can set this option:
options("readODS.v200" = TRUE)
read_ods(write_ods(mtcars[0,]), col_names = TRUE)
We will remove this option (readODS.v200
) in version 3.0.0.
ods_sheets()
will not be removed, as long as readxl::excel_sheets()
exists.With the entire package being now complete rewritten in C++, I believe room for massive speed improvement is not much (but still possible). I think I can consider the Projekt 71 is officially finished!
From here, I need to separate my own roadmap and the project(readODS
)’s roadmap. To be honest, developing readODS
is not on my roadmap at least in the next few months. Speaking of open source development, I will work on decluttering rio
with David Schoch 3 and ship rio
v1.0.0. I also hope to ship rang
v0.3.0 in Q4. The priority will be rio
> rang
> readODS
in the next few months. Speaking of rio
, I need your opinion on this: rio
should support at least one open standard out of the box, what should it be? Apache Parquet (arrow
) or ODS (readODS
)?
For the project, version 2.2 would probably focus on quality of life improvements. The most important feature would be dealing with ODS and FODS with the same function, rather than two different functions. It would make read_ods()
more like read_excel()
. Hopefully, it will be released in 2024 Q1.
I would like to thank Peter Brohan and Jenny Bryan for the valuable discussions during the development cycle of 2.1.0.
stringi
, a dependency, actually has one optional system dependency (libicu-dev
). If the system dependency was not found, it will get downloaded from the internet. ↩
But unfortunately, “-“ is still not supported due to this issue. ↩
I am extremely grateful for his help. But at times I am terribly sorry to drag him into this crazy chaotic codebase. ↩