Previously on this blog: 1.7.0 / 1.8.0
ods_sheets
because of reverse dependencies. Please tell us what you think about ods_sheets
on GitHub.install.packages("readODS")
readODS
2.0 will be released on CRAN pretty soon and we are finalizing it. It will be the second release under the rOpenSci moniker. If you want to try readODS
2.0 now:
remotes::install_github("ropensci/readODS")
or
install.packages("readODS", repos = "https://ropensci.r-universe.dev")
Almost all of the improvements were contributed by Peter Brohan. Thank you, Peter Brohan!
Here are what you will see in 2.0.0.
read_ods
read_ods
is now completely rewritten in C++ / rapidxml. The rewritten read_ods
is many times faster than the pure R implementation in 1.8.0 or below. I can’t even accurately benchmark the speed improvement.
The standard benchmark test is to read the file issue81.ods
(334K, 5,539 × 11). It is by no means a large file, but certainly a “regular” data size. readODS
1.7.3 took around 9s to read this file. readODS
2.0.0 takes 0.5s.
Speed is important, but perhaps not the most important. The most important improvement is that large ODS files are now readable in a reasonable time frame.
nycflights13::flights
(336,776 × 19) and Lahman::Batting
(112,184 × 22) are often used in demos of big data analysis, e.g. sparklyr. These two datasets are not exactly “big”, because those two data frame can probably live happily in RAM. For example, the object size of Lahman::Batting
is just 11MB, which is like nothing.
However, the two datasets are big enough to choke many other systems. Even LibreOffice Calc takes a few seconds to read the data. readODS
<= 1.8 took like over an hour to read Ladman::Batting
and gave you nothing. I don’t even want to try the much bigger nycflights13::flights
with readODS
<= 1.8.
Now, Ladman::Batting
and nycflights13::flights
(as ODS files) can be read by readODS::read_ods
in 5s and 14s respectively. For your comparison, the equivalent Python library (pandas_ods_reader
, which Julia’s OdsIO.jl
is also based on) took 16s and 41s respectively. I observe almost always this 3x performance of readODS::read_ods
versus pandas_ods_reader
.
The infamous “issue 71 file” jts0501.ods
could not read by readODS
<= 1.8.0. It exhausted 50G of Ram and gave you nothing. Now, the file can be read in two minutes using my little laptop with 16G ram! And it also makes readODS
the only game out of the big three interpreted languages (R, Python, Javascript) which can read that ODS file. So, basically Peter Brohan tackled the Projekt 71!
The following are how I did the benchmarks
## Rscript -e "readODS::write_ods(Lahman::Batting, 'Batting.ods')"
time Rscript -e "readODS::read_ods('Batting.ods')"
time python -c "from pandas_ods_reader import read_ods; read_ods('Batting.ods')"
## Rscript -e "readODS::write_ods(nycflights13::flights, 'flights.ods')"
time Rscript -e "readODS::read_ods('flights.ods')"
time python -c "from pandas_ods_reader import read_ods; read_ods('flights.ods')"
## wget https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/853155/jts0501.ods
time Rscript -e "readODS::read_ods('jts0501.ods')"
time python -c "from pandas_ods_reader import read_ods; read_ods('jts0501.ods')" ## die
Also thanks to the C++ rewrite contributed by Peter Brohan, read_ods
can read Flat ODS (fods
). fods
is similar to (but not entirely the same as) ODS, but without the Zip compression. The new function read_fods
can be used.
read_ods
is now similar to readxl::read_excel
and by default, the output data.frame is a tibble. tibble is nicer to display and therefore should be the default. It also doesn’t introduce any additional dependency as readODS
depends on readr
since version 1.7.3 and installing readr
will install tibble
anyway.
Of course, the most obvious issue that might break existing code is the difference in the square bracket operation. By default, using square bracket with a tibble to get one column still gives a tibble, i.e. drop = FALSE
; while the vanilla data frame gives a vector, i.e. drop = TRUE
. Concretely:
mtcars[, "mpg"] ## is a vector
mtcars[, "mpg", drop = FALSE] ## is a df
tibble::tibble(mtcars)[, "mpg"] ## is still a tibble / df
tibble::tibble(mtcars)[, "mpg", drop = TRUE]
## consistent behavior
mtcars$mpg
tibble::tibble(mtcars)$mpg
Don’t like that? read_ods(as_tibble = FALSE)
.
For developers who use read_ods
, they can have two choices to make the code supporting read_ods
>=1.7.x and >=2.0.0:
read_ods
and then use square brackets to subset a single column as a vector, please don’t use as_tibble = FALSE
(as read_ods
v1 does not have this parameter). Instead, you should specify drop = TRUE
in the square bracket operation, orx <- read_ods(file_name)
to x <- as.data.frame(read_ods(file_name))
; if you don’t want to edit every square bracket operation.write_ods
The new write_ods
introduced in 1.8.0 (contributed by Dr Detlef Steuer) is significantly improved.
Compared with 1.8.0 on CRAN, write_ods
is now using zip
(the R package) instead of utils::zip
to make it truly cross-platform (previous versions of readODS::write_ods
generated no file on Windows systems without RTools, for example). Also, a temporary fix was introduced to fix an issue related to non-UTF8 locale.
The output from write_ods
was tested against the “big three” applications - LibreOffice Calc, Microsoft Excel, and Google Sheets, to make sure the generated files can be opened by these applications.
writexl
The NA behavior of v1 write_ods
was erratic. For instance, the NA values of a numeric column in a generated ODS file were displayed as 0 in LibreOffice Calc. There is no concept of NA in the original OASIS definition of ODS.
Following the way writexl
handling NA, write_ods
is now writing NA values as blank cells. If you don’t like this, you can change na_as_string = TRUE
so that NA values are displayed as the string value “NA”. But this would have a side effect of reading the file by read_ods
that column would be coerced as characters
by default.
range
parameter of read_ods
accepts sheet name, e.g. “Sheet1!A2:B7”.include_external_data
to read_ods
(by default FALSE
) to control data access from external sources.check_names
argument of read_ods
and use vctrs::vec_as_names()
to do name repairs.In version 1.7.0, we announced that several functions will be removed. Now, these functions have been officially removed:
read.ods
get_num_sheets_in_ods
getNrOfSheetsInODS
ods_sheets
The Vignette was updated to give suggestions on what should be used instead in >= 2.0.0. For all existing CRAN packages using any of these removed functions, pull requests to the respective projects have been provided.
Now, the API of readODS
should be considered stable and we will not introduce any breaking change until version 3.0.0. All changes to the v2 series will be incremental. Before talking about the roadmap, we need to talk about the “roundtrip” example of writexl
.
This is the roundtrip example in the readme of writexl
, another rOpenSci project:
library(nycflights13)
out <- readxl::read_xlsx(writexl::write_xlsx(flights))
all.equal(out, flights)
The roundtrip of writing an Excel file writexml::write_xlsx
and then reading it back using readxl::read_xlsx
takes 8s on my machine. For the third line: Whether the column “time_hour” is all equal is actually time zone dependent. But all other columns are equal. In v2.0.0, we modified the API of write_ods
to allow the same roundtrip.
library(nycflights13)
out <- readODS::read_ods(readODS::write_ods(flights))
all.equal(out, flights)
And it took 201s. Reading the file with read_ods
only took 14s, as noted previously. So, it took around three minutes to write the ODS file. write_ods
, despite 100x faster than it was in version <= 1.7.3, is relatively slower than writexl::write_xlsx
. The major task for 2.1.0 will be rewriting the simple ODS writing algorithm provided by Dr Steuer to C++. I myself don’t need the combo of readODS::read_ods
and readODS::write_ods
to have the same superb performance of readxl::read_xlsx
and writexl::write_xlsx
. But I hope that we can bring this roundtrip to less than 60s. Rewriting write_ods
in C++ would also be helpful to solve the lingering issues on non UTF-8 systems, e.g. R < 4.2 on Windows.
The major work will also be done to write_ods
. From this roundtrip example, we know that date columns are handled differently between readODS
and writexl
/ readxl
. In 2.1.0, we will try to make them compatible. Writing fods
is a feature requested by the community. We will try to implement it as well.
I would like to thank Peter Brohan again for his help to make readODS
2.0.0 possible.