Previous on this blog: 10 years of rio, rio 1.0.0
Jump to the backstory
rio
1.0.1 supports many formats. However, not all formats are created equal unfortunately. There are “better” formats. It is not because rio
supports some formats better. Instead, it’s because the underlying R packages are more optimized; or the file format itself is well designed. I will quote “better” / “good” all along.
In the realm of rio
, we can talk about the how a file format is “better” by considering the following qualities:
export()
import()
There are other qualities: How popular is the format? How many other software applications support the format (interoperability)? How accepted is the format (particularly important for governments)? How secure is the format? However, these qualities are difficult to benchmark. Therefore, I will focus on the four and benchmark them.
How to measure the four qualities? I measured how much time it takes to export and import Lahman::Batting
(11,2164 x 22); the file size of the file; and how close is the imported object to the original Lahman::Batting
. In code, it’s like this:
filename <- tempfile(fileext = paste0(".", format))
# Writing
bench::mark(rio::export(data_sample, filename))$median
# Reading
bench::mark(x <- rio::import(filename))$median
# File size
file.info(filename)$size
# Accuracy
length(waldo::compare(data_sample, x))
Please take the accuracy measurement with a large grain of salt. But that’s probably the simplest way to “quantify” accuracy. Also, it would give advantage to serialization formats (such as qs
, rds
).
I will benchmark the following formats: csv
, csvy
, xlsx
, r
, xpt
, sav
, zsav
, rds
, rda
, dta
, dump
, dbf
, arff
, parquet
, fst
, feather
, json
, matlab
, qs
,ods
, fods
, yaml
.
As you might know, the csv
functionalities of rio
are actually provided by data.table
. xlsx
functionalities are provided by readxl
and writexl
. I also threw in several non-rio
things there. vroom
(for reading and writing csv
), readr
(same, and actually vroom
plus cares), and openxlsx
(for xlsx
; previously it was used by rio
to export xlsx
).
If you want to know why some formats were not tested, jump here.
I normalized the export time, import time and file size by dividing the respective values of csv
. So, the values of csv
are always 1. And I sort the data by import time.
format | Export Time (Relative to csv) | Import Time (Relative to csv) | File Size (Relative to csv) | Accuracy |
---|---|---|---|---|
feather | 0.9 | 0.3 | 0.5 | 0 |
parquet | 3.6 | 0.4 | 0.3 | 0 |
qs | 1.7 | 0.7 | 0.2 | 0 |
csv | 1.0 | 1.0 | 1.0 | 2 |
csvy | 1.0 | 1.2 | 1.0 | 1 |
vroom (not rio) | 5.7 | 2.0 | 1.0 | 21 |
rda | 27.4 | 2.1 | 0.3 | 0 |
rds | 27.5 | 2.1 | 0.3 | 0 |
readr (not rio) | 5.7 | 5.3 | 1.0 | 21 |
fst | 7.4 | 6.0 | 0.5 | 0 |
dbf | 37.6 | 6.9 | 2.0 | 4 |
dta | 10.3 | 14.6 | 1.6 | 22 |
zsav | 20.9 | 16.5 | 0.3 | 22 |
sav | 10.5 | 18.4 | 0.8 | 22 |
xpt | 15.6 | 31.0 | 2.9 | 21 |
xlsx | 141.4 | 36.3 | 1.3 | 21 |
yaml | 27.1 | 55.6 | 1.9 | 2 |
matlab | 43.3 | 75.7 | 0.3 | 4 |
json | 18.7 | 85.0 | 3.2 | 3 |
r | 50.1 | 89.9 | 1.8 | 0 |
dump | 49.9 | 94.5 | 1.8 | 0 |
arff | 132.2 | 107.2 | 3.2 | 19 |
fods | 77.3 | 119.7 | 42.2 | 21 |
openxlsx (not rio) | 266.6 | 136.6 | 1.3 | 21 |
ods | 258.2 | 253.7 | 0.8 | 21 |
All the values are in the same direction (lower is “better”). I can also extract the first principal component to rank their “goodness”.
format | score |
---|---|
feather | -1.4 |
qs | -1.4 |
parquet | -1.4 |
csvy | -1.3 |
fst | -1.3 |
csv | -1.3 |
rda | -1.2 |
rds | -1.2 |
dbf | -0.8 |
yaml | -0.5 |
vroom (not rio) | -0.4 |
readr (not rio) | -0.4 |
json | -0.2 |
dta | -0.2 |
sav | -0.2 |
matlab | -0.1 |
zsav | -0.1 |
r | 0.0 |
dump | 0.0 |
xpt | 0.0 |
xlsx | 1.1 |
arff | 1.7 |
fods | 2.8 |
openxlsx (not rio) | 3.1 |
ods | 4.2 |
This benchmark is slightly biased towards serialization formats (qs
, feather
, fst
, rda
, rds
, rda
, dump
) because I didn’t consider interoperability. Maybe feather
has some interoperability with Python, but all other serialization formats do not and are R only.
Other than these serialization formats, the “best” non-serialization format is parquet
. Its high fidelity, high compression ratio, and its serialization format-like performance are impressive. If I am going to consider interoperability (and license) as well, parquet
is probably the “best” format overall.
Unfortunately, despite all the optimizations, readODS
performs the worst among all studied formats. fods
is a super bloat and not very accurate format. ods
is not bloat but not as fast and similarly inaccurate. But it also points to a possible direction for optimization. For example, all the compressed XML formats (xlsx
, ods
) are not performant. However, openxlsx
and readODS
are worse than readxl
and writexl
in terms of I/O speed. Both openxlsx
and readODS
use zip
(the R package) for de/compression, whereas readxl
and writexl
use fast solutions for compression (writexl
: minizip, readxl
: use the zip file connection provided by R).
In the text format arena, csv
is still a “good” format. However, csvy
has a slightly better accuracy due to the meta data recorded in the YAML header. vroom
and her friend readr
are slower than data.table::fread()
/ data.table::fwrite()
, probably due to the type conversion and tibble creation processes.
However, given these results, I will not give any advice on which formats to use (except one, see below). In real life applications, these qualities probably mean nothing. You probably select a format because your accountant (or your professor, your client, your friends, your coworkers, or your…) uses that format. Your accountant probably doesn’t use parquet
.
The slightly chaotic code of this benchmark is available here.
Several formats were disqualified (DQ-ed).
dif
, xls
, fortran
, por
, sas7bat
, systat
, eviews
were disqualified because rio
and the entire R ecosystem can only read them, but not write them.pzfx
can support numeric data only. See below also for performance concerns.xml
, html
functionalities in rio
are homegrown, in purely R based on xml2
(which pzfx
is also based on). My initial benchmark with only 3,000 rows suggested these functions are extremely slow (see the table below). It is impossible to import and export Lahman::Batting
in a time frame measured in minutes. So: If you use the xml
and html
functionalities of rio
, please stop immediately and use something else!format | Export Time (Relative to csv) | Import Time (Relative to csv) | File Size (Relative to csv) | Accuracy |
---|---|---|---|---|
csv | 1.0 | 1.0 | 1 | 5 |
xml | 12486.7 | 2147.0 | 6 | 6 |
html | 8703.5 | 3244.1 | 5 | 5 |
If you must use an xml
format, please use fods
. If you must use html
format, consider using knitr::kable(format = "html")
. I think we (the rio
development team) also need to talk about the xml
and html
functionalities.
You might have read this in the update of the previous post on rio
that arrow
is now removed from Imports
in v1.0.1. We did that because of the feedback about compiling time concerns by Will Landau (of targets
fame) for v 1.0.0. rio
is used in several Shiny
applications (e.g. datamods
and esquisse
), which are frequently dockerized. According to Will Landau, these applications “do not need arrow
” 1. And adding arrow
to Imports
increases the compilation time during Docker image building. We can understand this problem and therefore we accepted his suggestion to move back arrow
to Suggests
in v1.0.1, 4 days after the release of v1.0.0.
Still, if you are not in this Docker use case, you can immediately get arrow
support after installation with rio::install_formats()
. We still need to regroup and think about our strategy on file formats going forward. This benchmark would guide several decisions. Now, I think my another project readODS
can also be benefited from this benchmark.
I really hope that one day someone comes to me and says “I (a genomic researcher, for example) do not need Excel support in rio
.” One can dream about it, right? ↩