There have been several of tuning steps after my previous post on readODS
2.0.0. Now, pure R write_ods
is twice as fast, just by tuning R code. But how to do the tuning? Let’s take an example: escaping XML. I actually talked about this previously: readODS
1.8 ditched the R package xml2
for writing XML. And basically, we rolled out our own XML writing engine. And part of this engine is to escape the text so that it won’t break the XML structure 1. And they are amp
, lt
, gt
, apos
, and quot
.
This was how the XML escaper looks like in Version 1.8.0.
.escape_xml <- function(x) {
x_no_amp <- stringi::stri_replace_all_fixed(str = x, pattern = c("&"), replacement = c("&"), vectorize_all = FALSE)
stringi::stri_replace_all_fixed(str = x_no_amp, pattern = c("\"", "<", ">", "'"), replacement = c(""", "<", ">", "'"), vectorize_all = FALSE)
}
.escape_xml("You & Me, I love \"you\" > than you do.")
## "You & Me, I love "you" > than you do."
At 1.9.0, it was modified to like this to better handling of UTF-8:
.escape_xml <- function(x) {
x_utf8 <- stringi::stri_enc_toutf8(x)
x_no_amp <- stringi::stri_replace_all_fixed(str = x_utf8, pattern = c("&"), replacement = c("&"), vectorize_all = FALSE)
stringi::stri_replace_all_fixed(str = x_no_amp, pattern = c("\"", "<", ">", "'"), replacement = c(""", "<", ">", "'"), vectorize_all = FALSE)
}
Both implementations are similar: There is a need to first escape the ampersand because all replacements also contain an ampersand. The escaper was indeed working.
But now, the requirement is not just about working. For the future version, the escaper needs to be extremely fast too. Why it needs to be fast? Well, every piece of data in a data frame must be passed through this escaper. One microsecond delay in each operation would generate a huge performance impact.
What I’ve learned so far about performance tuning is: it must be data driven. Let’s create some data:
.create <- function(candidate = c(LETTERS, letters, c("\"", "<",">", "&", "'", " ")), min = 1, max = 400) {
paste(sample(x = candidate, size = sample(seq(min, max), 1), replace = TRUE), collapse = "")
}
set.seed(123)
data <- replicate(300000, .create())
bench::mark(output <- .escape_xml(data), min_time = 5)
So, I let it run for several times in 5s. And each run takes 785ms for one iteration. It doesn’t sound too bad and stringi::stri_replace_all_fixed()
is fast. In comparison, this is the slightly modified implementation of htmltools::htmlEscape()
(version 0.5.6). I have only modified .htmlSpecialsAttrib
to make it escaping in the standard XML way.
htmlEscape <- local({
.htmlSpecials <- list(
`&` = '&',
`<` = '<',
`>` = '>'
)
.htmlSpecialsPattern <- paste(names(.htmlSpecials), collapse='|')
.htmlSpecialsAttrib <- c(
.htmlSpecials,
##`'` = ''',
`"` = '"',
`'` = '''
##`\r` = ' ',
##`\n` = ' '
)
.htmlSpecialsPatternAttrib <- paste(names(.htmlSpecialsAttrib), collapse='|')
function(text, attribute=FALSE) {
pattern <- if(attribute)
.htmlSpecialsPatternAttrib
else
.htmlSpecialsPattern
text <- enc2utf8(as.character(text))
# Short circuit in the common case that there's nothing to escape
if (!any(grepl(pattern, text, useBytes = TRUE)))
return(text)
specials <- if(attribute)
.htmlSpecialsAttrib
else
.htmlSpecials
for (chr in names(specials)) {
text <- gsub(chr, specials[[chr]], text, fixed = TRUE, useBytes = TRUE)
}
Encoding(text) <- "UTF-8"
return(text)
}
})
.html_escape <- function(x) {
htmlEscape(x, attribute = TRUE)
}
The benchmark
bench::mark(output <- .escape_xml(data),
output2 <- .html_escape(data),
min_time = 5)
all.equal(output, output2) ## TRUE
htmlEscape
took 2.5s in one iteration. So, .escape_xml
is at least 2x the performance of htmltools::htmlEscape
(also 0.25x htmltools::htmlEscape
’s memory footprint).
Tim Taylor told me on GitHub that the escaper can be simplified to:
.escape_xml2 <- function(x) {
stringi::stri_replace_all_fixed(str = x, pattern = c("&", "\"", "<", ">", "'"), replacement = c("&", """, "<", ">", "'"), vectorize_all = FALSE)
}
bench::mark(output <- .escape_xml(data),
output2 <- .html_escape(data),
output3 <- .escape_xml2(data),
min_time = 5)
all.equal(output, output3) ## TRUE
By removing the first step, .escape_xml2
takes 686ms; or 40% improvement. The memory footprint is also halved.
But how about going further? Well, next time. The main message of this blog post, however, is: code optimization must be data driven.
imagine the data is I love you > than you do
and you put this data into the XML template <text = "%data">
; the outcome will be <text = "I love you > than you do">
. <text = "i love you>
is already an XML tag. And the data is incorrectly encoded. ↩