ods_sheetsbecause of reverse dependencies. Please tell us what you think about
readODS 2.0 will be released on CRAN pretty soon and we are finalizing it. It will be the second release under the rOpenSci moniker. If you want to try
readODS 2.0 now:
install.packages("readODS", repos = "https://ropensci.r-universe.dev")
Almost all of the improvements were contributed by Peter Brohan. Thank you, Peter Brohan!
Here are what you will see in 2.0.0.
read_ods is now completely rewritten in C++ / rapidxml. The rewritten
read_ods is many times faster than the pure R implementation in 1.8.0 or below. I can’t even accurately benchmark the speed improvement.
The standard benchmark test is to read the file
issue81.ods (334K, 5,539 × 11). It is by no means a large file, but certainly a “regular” data size.
readODS 1.7.3 took around 9s to read this file.
readODS 2.0.0 takes 0.5s.
Speed is important, but perhaps not the most important. The most important improvement is that large ODS files are now readable in a reasonable time frame.
nycflights13::flights (336,776 × 19) and
Lahman::Batting (112,184 × 22) are often used in demos of big data analysis, e.g. sparklyr. These two datasets are not exactly “big”, because those two data frame can probably live happily in RAM. For example, the object size of
Lahman::Batting is just 11MB, which is like nothing.
However, the two datasets are big enough to choke many other systems. Even LibreOffice Calc takes a few seconds to read the data.
readODS <= 1.8 took like over an hour to read
Ladman::Batting and gave you nothing. I don’t even want to try the much bigger
readODS <= 1.8.
nycflights13::flights (as ODS files) can be read by
readODS::read_ods in 5s and 14s respectively. For your comparison, the equivalent Python library (
pandas_ods_reader, which Julia’s
OdsIO.jl is also based on) took 16s and 41s respectively. I observe almost always this 3x performance of
The infamous “issue 71 file”
jts0501.ods could not read by
readODS <= 1.8.0. It exhausted 50G of Ram and gave you nothing. Now, the file can be read in two minutes using my little laptop with 16G ram! And it also makes
The following are how I did the benchmarks
## Rscript -e "readODS::write_ods(Lahman::Batting, 'Batting.ods')" time Rscript -e "readODS::read_ods('Batting.ods')" time python -c "from pandas_ods_reader import read_ods; read_ods('Batting.ods')"
## Rscript -e "readODS::write_ods(nycflights13::flights, 'flights.ods')" time Rscript -e "readODS::read_ods('flights.ods')" time python -c "from pandas_ods_reader import read_ods; read_ods('flights.ods')"
## wget https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/853155/jts0501.ods time Rscript -e "readODS::read_ods('jts0501.ods')" time python -c "from pandas_ods_reader import read_ods; read_ods('jts0501.ods')" ## die
Also thanks to the C++ rewrite contributed by Peter Brohan,
read_ods can read Flat ODS (
fods is similar to (but not entirely the same as) ODS, but without the Zip compression. The new function
read_fods can be used.
read_ods is now similar to
readxl::read_excel and by default, the output data.frame is a tibble. tibble is nicer to display and therefore should be the default. It also doesn’t introduce any additional dependency as
readODS depends on
readr since version 1.7.3 and installing
readr will install
Of course, the most obvious issue that might break existing code is the difference in the square bracket operation. By default, using square bracket with a tibble to get one column still gives a tibble, i.e.
drop = FALSE; while the vanilla data frame gives a vector, i.e.
drop = TRUE. Concretely:
mtcars[, "mpg"] ## is a vector mtcars[, "mpg", drop = FALSE] ## is a df tibble::tibble(mtcars)[, "mpg"] ## is still a tibble / df tibble::tibble(mtcars)[, "mpg", drop = TRUE] ## consistent behavior mtcars$mpg tibble::tibble(mtcars)$mpg
Don’t like that?
read_ods(as_tibble = FALSE).
For developers who use
read_ods, they can have two choices to make the code supporting
read_ods >=1.7.x and >=2.0.0:
read_odsand then use square brackets to subset a single column as a vector, please don’t use
as_tibble = FALSE(as
read_odsv1 does not have this parameter). Instead, you should specify
drop = TRUEin the square bracket operation, or
x <- read_ods(file_name)to
x <- as.data.frame(read_ods(file_name)); if you don’t want to edit every square bracket operation.
write_ods introduced in 1.8.0 (contributed by Dr Detlef Steuer) is significantly improved.
Compared with 1.8.0 on CRAN,
write_ods is now using
zip (the R package) instead of
utils::zip to make it truly cross-platform (previous versions of
readODS::write_ods generated no file on Windows systems without RTools, for example). Also, a temporary fix was introduced to fix an issue related to non-UTF8 locale.
The output from
write_ods was tested against the “big three” applications - LibreOffice Calc, Microsoft Excel, and Google Sheets, to make sure the generated files can be opened by these applications.
The NA behavior of v1
write_ods was erratic. For instance, the NA values of a numeric column in a generated ODS file were displayed as 0 in LibreOffice Calc. There is no concept of NA in the original OASIS definition of ODS.
Following the way
writexl handling NA,
write_ods is now writing NA values as blank cells. If you don’t like this, you can change
na_as_string = TRUE so that NA values are displayed as the string value “NA”. But this would have a side effect of reading the file by
read_ods that column would be coerced as
characters by default.
read_odsaccepts sheet name, e.g. “Sheet1!A2:B7”.
FALSE) to control data access from external sources.
vctrs::vec_as_names()to do name repairs.
In version 1.7.0, we announced that several functions will be removed. Now, these functions have been officially removed:
The Vignette was updated to give suggestions on what should be used instead in >= 2.0.0. For all existing CRAN packages using any of these removed functions, pull requests to the respective projects have been provided.
Now, the API of
readODS should be considered stable and we will not introduce any breaking change until version 3.0.0. All changes to the v2 series will be incremental. Before talking about the roadmap, we need to talk about the “roundtrip” example of
This is the roundtrip example in the readme of
writexl, another rOpenSci project:
library(nycflights13) out <- readxl::read_xlsx(writexl::write_xlsx(flights)) all.equal(out, flights)
The roundtrip of writing an Excel file
writexml::write_xlsx and then reading it back using
readxl::read_xlsx takes 8s on my machine. For the third line: Whether the column “time_hour” is all equal is actually time zone dependent. But all other columns are equal. In v2.0.0, we modified the API of
write_ods to allow the same roundtrip.
library(nycflights13) out <- readODS::read_ods(readODS::write_ods(flights)) all.equal(out, flights)
And it took 201s. Reading the file with
read_ods only took 14s, as noted previously. So, it took around three minutes to write the ODS file.
write_ods, despite 100x faster than it was in version <= 1.7.3, is relatively slower than
writexl::write_xlsx. The major task for 2.1.0 will be rewriting the simple ODS writing algorithm provided by Dr Steuer to C++. I myself don’t need the combo of
readODS::write_ods to have the same superb performance of
writexl::write_xlsx. But I hope that we can bring this roundtrip to less than 60s. Rewriting
write_ods in C++ would also be helpful to solve the lingering issues on non UTF-8 systems, e.g. R < 4.2 on Windows.
The major work will also be done to
write_ods. From this roundtrip example, we know that date columns are handled differently between
readxl. In 2.1.0, we will try to make them compatible. Writing
fods is a feature requested by the community. We will try to implement it as well.
I would like to thank Peter Brohan again for his help to make
readODS 2.0.0 possible.