chainsawriot

Benchmark of R file formats using rio and friends

Posted on Sep 21, 2023 by Chung-hong Chan

Previous on this blog: 10 years of rio, rio 1.0.0

rio 1.0.1 supports many formats. However, not all formats are created equal unfortunately. There are “better” formats. It is not because rio supports some formats better. Instead, it’s because the underlying R packages are more optimized; or the file format itself is well designed. I will quote “better” / “good” all along.

In the realm of rio, we can talk about the how a file format is “better” by considering the following qualities:

Speed of writing - Time needed for export()
Speed of reading - Time needed for import()
File size - Well, self explanatory
Accuracy - How accurate is the imported object? How far is it from the original object?

There are other qualities: How popular is the format? How many other software applications support the format (interoperability)? How accepted is the format (particularly important for governments)? How secure is the format? However, these qualities are difficult to benchmark. Therefore, I will focus on the four and benchmark them.

Benchmark

How to measure the four qualities? I measured how much time it takes to export and import Lahman::Batting (11,2164 x 22); the file size of the file; and how close is the imported object to the original Lahman::Batting. In code, it’s like this:

filename <- tempfile(fileext = paste0(".", format))
# Writing
bench::mark(rio::export(data_sample, filename))$median
# Reading
bench::mark(x <- rio::import(filename))$median
# File size
file.info(filename)$size
# Accuracy
length(waldo::compare(data_sample, x))

Please take the accuracy measurement with a large grain of salt. But that’s probably the simplest way to “quantify” accuracy. Also, it would give advantage to serialization formats (such as qs, rds).

I will benchmark the following formats: csv, csvy, xlsx, r, xpt, sav, zsav, rds, rda, dta, dump, dbf, arff, parquet, fst, feather, json, matlab, qs ,ods, fods, yaml.

As you might know, the csv functionalities of rio are actually provided by data.table. xlsx functionalities are provided by readxl and writexl. I also threw in several non-rio things there. vroom (for reading and writing csv), readr (same, and actually vroom plus cares), and openxlsx (for xlsx; previously it was used by rio to export xlsx).

If you want to know why some formats were not tested, jump here.

Results

I normalized the export time, import time and file size by dividing the respective values of csv. So, the values of csv are always 1. And I sort the data by import time.

format	Export Time (Relative to csv)	Import Time (Relative to csv)	File Size (Relative to csv)	Accuracy
feather	0.9	0.3	0.5	0
parquet	3.6	0.4	0.3	0
qs	1.7	0.7	0.2	0
csv	1.0	1.0	1.0	2
csvy	1.0	1.2	1.0	1
vroom (not rio)	5.7	2.0	1.0	21
rda	27.4	2.1	0.3	0
rds	27.5	2.1	0.3	0
readr (not rio)	5.7	5.3	1.0	21
fst	7.4	6.0	0.5	0
dbf	37.6	6.9	2.0	4
dta	10.3	14.6	1.6	22
zsav	20.9	16.5	0.3	22
sav	10.5	18.4	0.8	22
xpt	15.6	31.0	2.9	21
xlsx	141.4	36.3	1.3	21
yaml	27.1	55.6	1.9	2
matlab	43.3	75.7	0.3	4
json	18.7	85.0	3.2	3
r	50.1	89.9	1.8	0
dump	49.9	94.5	1.8	0
arff	132.2	107.2	3.2	19
fods	77.3	119.7	42.2	21
openxlsx (not rio)	266.6	136.6	1.3	21
ods	258.2	253.7	0.8	21

All the values are in the same direction (lower is “better”). I can also extract the first principal component to rank their “goodness”.

format	score
feather	-1.4
qs	-1.4
parquet	-1.4
csvy	-1.3
fst	-1.3
csv	-1.3
rda	-1.2
rds	-1.2
dbf	-0.8
yaml	-0.5
vroom (not rio)	-0.4
readr (not rio)	-0.4
json	-0.2
dta	-0.2
sav	-0.2
matlab	-0.1
zsav	-0.1
r	0.0
dump	0.0
xpt	0.0
xlsx	1.1
arff	1.7
fods	2.8
openxlsx (not rio)	3.1
ods	4.2

Discussions

This benchmark is slightly biased towards serialization formats (qs, feather, fst, rda, rds, rda, dump) because I didn’t consider interoperability. Maybe feather has some interoperability with Python, but all other serialization formats do not and are R only.

Other than these serialization formats, the “best” non-serialization format is parquet. Its high fidelity, high compression ratio, and its serialization format-like performance are impressive. If I am going to consider interoperability (and license) as well, parquet is probably the “best” format overall.

Unfortunately, despite all the optimizations, readODS performs the worst among all studied formats. fods is a super bloat and not very accurate format. ods is not bloat but not as fast and similarly inaccurate. But it also points to a possible direction for optimization. For example, all the compressed XML formats (xlsx, ods) are not performant. However, openxlsx and readODS are worse than readxl and writexl in terms of I/O speed. Both openxlsx and readODS use zip (the R package) for de/compression, whereas readxl and writexl use fast solutions for compression (writexl: minizip, readxl: use the zip file connection provided by R).

In the text format arena, csv is still a “good” format. However, csvy has a slightly better accuracy due to the meta data recorded in the YAML header. vroom and her friend readr are slower than data.table::fread() / data.table::fwrite(), probably due to the type conversion and tibble creation processes.

However, given these results, I will not give any advice on which formats to use (except one, see below). In real life applications, these qualities probably mean nothing. You probably select a format because your accountant (or your professor, your client, your friends, your coworkers, or your…) uses that format. Your accountant probably doesn’t use parquet.

The slightly chaotic code of this benchmark is available here [github.com].

dq

Several formats were disqualified (DQ-ed).

dif, xls, fortran, por, sas7bat, systat, eviews were disqualified because rio and the entire R ecosystem can only read them, but not write them.
pzfx can support numeric data only. See below also for performance concerns.
xml, html functionalities in rio are homegrown, in purely R based on xml2 (which pzfx is also based on). My initial benchmark with only 3,000 rows suggested these functions are extremely slow (see the table below). It is impossible to import and export Lahman::Batting in a time frame measured in minutes. So: If you use the xml and html functionalities of rio, please stop immediately and use something else!

format	Export Time (Relative to csv)	Import Time (Relative to csv)	File Size (Relative to csv)	Accuracy
csv	1.0	1.0	1	5
xml	12486.7	2147.0	6	6
html	8703.5	3244.1	5	5

If you must use an xml format, please use fods. If you must use html format, consider using knitr::kable(format = "html"). I think we (the rio development team) also need to talk about the xml and html functionalities.

backstory

You might have read this in the update of the previous post on rio that arrow is now removed from Imports in v1.0.1. We did that because of the feedback about compiling time concerns by Will Landau [github.com] (of targets fame) for v 1.0.0. rio is used in several Shiny applications (e.g. datamods and esquisse), which are frequently dockerized. According to Will Landau, these applications “do not need arrow” ¹. And adding arrow to Imports increases the compilation time during Docker image building. We can understand this problem and therefore we accepted his suggestion to move back arrow to Suggests in v1.0.1, 4 days after the release of v1.0.0.

Still, if you are not in this Docker use case, you can immediately get arrow support after installation with rio::install_formats(). We still need to regroup and think about our strategy on file formats going forward. This benchmark would guide several decisions. Now, I think my another project readODS can also be benefited from this benchmark.

I really hope that one day someone comes to me and says “I (a genomic researcher [retractionwatch.com], for example) do not need Excel support in rio.” One can dream about it, right? ↩