Werden wir Helden für einen Tag

Home | About | Archive

readODS 2.0.0

Posted on Aug 6, 2023 by Chung-hong Chan

Previously on this blog: 1.7.0 / 1.8.0

Updates


readODS 2.0 will be released on CRAN pretty soon and we are finalizing it. It will be the second release under the rOpenSci moniker. If you want to try readODS 2.0 now:

remotes::install_github("ropensci/readODS")

or

install.packages("readODS", repos = "https://ropensci.r-universe.dev")

Almost all of the improvements were contributed by Peter Brohan. Thank you, Peter Brohan!

Here are what you will see in 2.0.0.

New features

A complete C++ rewrite of read_ods

read_ods is now completely rewritten in C++ / rapidxml. The rewritten read_ods is many times faster than the pure R implementation in 1.8.0 or below. I can’t even accurately benchmark the speed improvement.

The standard benchmark test is to read the file issue81.ods (334K, 5,539 × 11). It is by no means a large file, but certainly a “regular” data size. readODS 1.7.3 took around 9s to read this file. readODS 2.0.0 takes 0.5s.

Speed is important, but perhaps not the most important. The most important improvement is that large ODS files are now readable in a reasonable time frame.

nycflights13::flights (336,776 × 19) and Lahman::Batting (112,184 × 22) are often used in demos of big data analysis, e.g. sparklyr. These two datasets are not exactly “big”, because those two data frame can probably live happily in RAM. For example, the object size of Lahman::Batting is just 11MB, which is like nothing.

However, the two datasets are big enough to choke many other systems. Even LibreOffice Calc takes a few seconds to read the data. readODS <= 1.8 took like over an hour to read Ladman::Batting and gave you nothing. I don’t even want to try the much bigger nycflights13::flights with readODS <= 1.8.

Now, Ladman::Batting and nycflights13::flights (as ODS files) can be read by readODS::read_ods in 5s and 14s respectively. For your comparison, the equivalent Python library (pandas_ods_reader, which Julia’s OdsIO.jl is also based on) took 16s and 41s respectively. I observe almost always this 3x performance of readODS::read_ods versus pandas_ods_reader.

The infamous “issue 71 file” jts0501.ods could not read by readODS <= 1.8.0. It exhausted 50G of Ram and gave you nothing. Now, the file can be read in two minutes using my little laptop with 16G ram! And it also makes readODS the only game out of the big three interpreted languages (R, Python, Javascript) which can read that ODS file. So, basically Peter Brohan tackled the Projekt 71!

The following are how I did the benchmarks

## Rscript -e "readODS::write_ods(Lahman::Batting, 'Batting.ods')"
time Rscript -e "readODS::read_ods('Batting.ods')"
time python -c "from pandas_ods_reader import read_ods; read_ods('Batting.ods')"
## Rscript -e "readODS::write_ods(nycflights13::flights, 'flights.ods')"
time Rscript -e "readODS::read_ods('flights.ods')"
time python -c "from pandas_ods_reader import read_ods; read_ods('flights.ods')"
## wget https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/853155/jts0501.ods
time Rscript -e "readODS::read_ods('jts0501.ods')"
time python -c "from pandas_ods_reader import read_ods; read_ods('jts0501.ods')" ## die

Support for reading Flat ODS

Also thanks to the C++ rewrite contributed by Peter Brohan, read_ods can read Flat ODS (fods). fods is similar to (but not entirely the same as) ODS, but without the Zip compression. The new function read_fods can be used.

tibble is now the default

read_ods is now similar to readxl::read_excel and by default, the output data.frame is a tibble. tibble is nicer to display and therefore should be the default. It also doesn’t introduce any additional dependency as readODS depends on readr since version 1.7.3 and installing readr will install tibble anyway.

Of course, the most obvious issue that might break existing code is the difference in the square bracket operation. By default, using square bracket with a tibble to get one column still gives a tibble, i.e. drop = FALSE; while the vanilla data frame gives a vector, i.e. drop = TRUE. Concretely:

mtcars[, "mpg"] ## is a vector
mtcars[, "mpg", drop = FALSE] ## is a df

tibble::tibble(mtcars)[, "mpg"] ## is still a tibble / df
tibble::tibble(mtcars)[, "mpg", drop = TRUE]

## consistent behavior
mtcars$mpg
tibble::tibble(mtcars)$mpg

Don’t like that? read_ods(as_tibble = FALSE).

For developers who use read_ods, they can have two choices to make the code supporting read_ods >=1.7.x and >=2.0.0:

  1. If you use read_ods and then use square brackets to subset a single column as a vector, please don’t use as_tibble = FALSE (as read_ods v1 does not have this parameter). Instead, you should specify drop = TRUE in the square bracket operation, or
  2. A lazier method is to change x <- read_ods(file_name) to x <- as.data.frame(read_ods(file_name)); if you don’t want to edit every square bracket operation.

Improved stability of write_ods

The new write_ods introduced in 1.8.0 (contributed by Dr Detlef Steuer) is significantly improved.

Compared with 1.8.0 on CRAN, write_ods is now using zip (the R package) instead of utils::zip to make it truly cross-platform (previous versions of readODS::write_ods generated no file on Windows systems without RTools, for example). Also, a temporary fix was introduced to fix an issue related to non-UTF8 locale.

The output from write_ods was tested against the “big three” applications - LibreOffice Calc, Microsoft Excel, and Google Sheets, to make sure the generated files can be opened by these applications.

NA behavior is now consistent with writexl

The NA behavior of v1 write_ods was erratic. For instance, the NA values of a numeric column in a generated ODS file were displayed as 0 in LibreOffice Calc. There is no concept of NA in the original OASIS definition of ODS.

Following the way writexl handling NA, write_ods is now writing NA values as blank cells. If you don’t like this, you can change na_as_string = TRUE so that NA values are displayed as the string value “NA”. But this would have a side effect of reading the file by read_ods that column would be coerced as characters by default.

Small improvements

Deprecated functions

In version 1.7.0, we announced that several functions will be removed. Now, these functions have been officially removed:

The Vignette was updated to give suggestions on what should be used instead in >= 2.0.0. For all existing CRAN packages using any of these removed functions, pull requests to the respective projects have been provided.

Roadmap to 2.1

Now, the API of readODS should be considered stable and we will not introduce any breaking change until version 3.0.0. All changes to the v2 series will be incremental. Before talking about the roadmap, we need to talk about the “roundtrip” example of writexl.

This is the roundtrip example in the readme of writexl, another rOpenSci project:

library(nycflights13)
out <- readxl::read_xlsx(writexl::write_xlsx(flights))
all.equal(out, flights)

The roundtrip of writing an Excel file writexml::write_xlsx and then reading it back using readxl::read_xlsx takes 8s on my machine. For the third line: Whether the column “time_hour” is all equal is actually time zone dependent. But all other columns are equal. In v2.0.0, we modified the API of write_ods to allow the same roundtrip.

library(nycflights13)
out <- readODS::read_ods(readODS::write_ods(flights))
all.equal(out, flights)

And it took 201s. Reading the file with read_ods only took 14s, as noted previously. So, it took around three minutes to write the ODS file. write_ods, despite 100x faster than it was in version <= 1.7.3, is relatively slower than writexl::write_xlsx. The major task for 2.1.0 will be rewriting the simple ODS writing algorithm provided by Dr Steuer to C++. I myself don’t need the combo of readODS::read_ods and readODS::write_ods to have the same superb performance of readxl::read_xlsx and writexl::write_xlsx. But I hope that we can bring this roundtrip to less than 60s. Rewriting write_ods in C++ would also be helpful to solve the lingering issues on non UTF-8 systems, e.g. R < 4.2 on Windows.

The major work will also be done to write_ods. From this roundtrip example, we know that date columns are handled differently between readODS and writexl / readxl. In 2.1.0, we will try to make them compatible. Writing fods is a feature requested by the community. We will try to implement it as well.

Acknowledgment

I would like to thank Peter Brohan again for his help to make readODS 2.0.0 possible.


Powered by Jekyll and profdr theme