chainsawriot

R performance tuning #1

Posted on Aug 12, 2023 by Chung-hong Chan

There have been several of tuning steps after my previous post on readODS 2.0.0. Now, pure R write_ods is twice as fast, just by tuning R code. But how to do the tuning? Let’s take an example: escaping XML. I actually talked about this previously: readODS 1.8 ditched the R package xml2 for writing XML. And basically, we rolled out our own XML writing engine. And part of this engine is to escape the text so that it won’t break the XML structure ¹. And they are amp, lt, gt, apos, and quot.

This was how the XML escaper looks like in Version 1.8.0.

.escape_xml <- function(x) {
    x_no_amp <- stringi::stri_replace_all_fixed(str = x, pattern = c("&"), replacement = c("&amp;"), vectorize_all = FALSE)
    stringi::stri_replace_all_fixed(str = x_no_amp, pattern = c("\"", "<", ">", "'"), replacement = c("&quot;", "&lt;", "&gt;", "&apos;"), vectorize_all = FALSE)
}
.escape_xml("You & Me, I love \"you\" > than you do.")
## "You &amp; Me, I love &quot;you&quot; &gt; than you do."

At 1.9.0, it was modified to like this to better handling of UTF-8:

.escape_xml <- function(x) {
    x_utf8 <- stringi::stri_enc_toutf8(x)
    x_no_amp <- stringi::stri_replace_all_fixed(str = x_utf8, pattern = c("&"), replacement = c("&amp;"), vectorize_all = FALSE)
    stringi::stri_replace_all_fixed(str = x_no_amp, pattern = c("\"", "<", ">", "'"), replacement = c("&quot;", "&lt;", "&gt;", "&apos;"), vectorize_all = FALSE)
}

Both implementations are similar: There is a need to first escape the ampersand because all replacements also contain an ampersand. The escaper was indeed working.

But now, the requirement is not just about working. For the future version, the escaper needs to be extremely fast too. Why it needs to be fast? Well, every piece of data in a data frame must be passed through this escaper. One microsecond delay in each operation would generate a huge performance impact.

Data driven

What I’ve learned so far about performance tuning is: it must be data driven. Let’s create some data:

.create <- function(candidate = c(LETTERS, letters, c("\"", "<",">", "&", "'", " ")), min = 1, max = 400) {
    paste(sample(x = candidate, size = sample(seq(min, max), 1), replace = TRUE), collapse = "")
}
set.seed(123)
data <- replicate(300000, .create())
bench::mark(output <- .escape_xml(data), min_time = 5)

So, I let it run for several times in 5s. And each run takes 785ms for one iteration. It doesn’t sound too bad and stringi::stri_replace_all_fixed() is fast. In comparison, this is the slightly modified implementation of htmltools::htmlEscape() (version 0.5.6). I have only modified .htmlSpecialsAttrib to make it escaping in the standard XML way.

htmlEscape <- local({

  .htmlSpecials <- list(
    `&` = '&amp;',
    `<` = '&lt;',
    `>` = '&gt;'
  )
  .htmlSpecialsPattern <- paste(names(.htmlSpecials), collapse='|')
  .htmlSpecialsAttrib <- c(
    .htmlSpecials,
    ##`'` = '&#39;',
    `"` = '&quot;',
    `'` = '&apos;'
    ##`\r` = '&#13;',
    ##`\n` = '&#10;'
  )
  .htmlSpecialsPatternAttrib <- paste(names(.htmlSpecialsAttrib), collapse='|')

  function(text, attribute=FALSE) {
    pattern <- if(attribute)
      .htmlSpecialsPatternAttrib
    else
      .htmlSpecialsPattern

    text <- enc2utf8(as.character(text))
    # Short circuit in the common case that there's nothing to escape
    if (!any(grepl(pattern, text, useBytes = TRUE)))
      return(text)

    specials <- if(attribute)
      .htmlSpecialsAttrib
    else
      .htmlSpecials

    for (chr in names(specials)) {
      text <- gsub(chr, specials[[chr]], text, fixed = TRUE, useBytes = TRUE)
    }
    Encoding(text) <- "UTF-8"

    return(text)
  }
})

.html_escape <- function(x) {
    htmlEscape(x, attribute = TRUE)
}

The benchmark

bench::mark(output <- .escape_xml(data),
            output2 <- .html_escape(data),
            min_time = 5)
all.equal(output, output2) ## TRUE

htmlEscape took 2.5s in one iteration. So, .escape_xml is at least 2x the performance of htmltools::htmlEscape (also 0.25x htmltools::htmlEscape’s memory footprint).

Tuning

Tim Taylor told me on GitHub that the escaper can be simplified to:

.escape_xml2 <- function(x) {
    stringi::stri_replace_all_fixed(str = x, pattern = c("&", "\"", "<", ">", "'"), replacement = c("&amp;", "&quot;", "&lt;", "&gt;", "&apos;"), vectorize_all = FALSE)
}

bench::mark(output <- .escape_xml(data),
            output2 <- .html_escape(data),
            output3 <- .escape_xml2(data),
            min_time = 5)
all.equal(output, output3) ## TRUE

By removing the first step, .escape_xml2 takes 686ms; or 40% improvement. The memory footprint is also halved.

But how about going further? Well, next time. The main message of this blog post, however, is: code optimization must be data driven.

imagine the data is I love you > than you do and you put this data into the XML template <text = "%data">; the outcome will be <text = "I love you > than you do">. <text = "i love you> is already an XML tag. And the data is incorrectly encoded. ↩