chainsawriot

Non-API calls conundrum, CRAN, cpp11, minty, true length, vendoring, readODS

Posted on May 26, 2024 by Chung-hong Chan

Update: 2024-06-06 Jump

Non-API calls conundrum

The tale started with my experience submitting the R package minty to CRAN. But I think it’s better to say something about the so called “Non-API calls conundrum”.

R has a set of C-level APIs so that we can use them to manipulate R objects with C (or any language that can call C, such as C++ or Rust). Those APIs are mostly C functions or macros (a.k.a. “entry points”) provided as header files in the R source code. Some of these C functions are documented in the Writing R Extensions manual. But most of them are just in the header files, especially Rinternals.h.

Now, we know that not all entry points provided in these header files are APIs. Using these “non-API” entry points are classified as “non-API calls”. The issue leads to a low-grade backlash earlier last month for two reasons.

First, many entry points have been flagged as “non API” entry points in the R source code retrospectively and that left a bad taste in some developers’ mouth because some widely used C functions are now non-API entry points.

Second, most “non-API calls” have not been checked by R CMD check for so long time, even with the current release version of R, 4.4.0. But this will change in the next version aka the current R-devel version i.e. R 4.5.x, due to the April update to R CMD check. It becomes an immediate trouble for many because CRAN requires R packages to be checked with the R-devel version. Now, R packages with these so called “non-API calls” will emit NOTEs upon checking. Unlike ERRORs or WARNs, NOTEs usually do not trigger CRAN to remove R packages. I must emphasize, however, “usually do not” in the last sentence is not the same as “never”.

How it impacted me is that when submitting a new or updated R package to CRAN, the same check will also be done. The problem here is that a newly submitted R package emitting NOTEs upon checking will be classified as failing the incoming checks and will be “pretest archived”, i.e. not process further automatically. In order to move the submission forward, one has to e-mail the CRAN team manually to explain how the failing of the incoming checks is “false positive”. And this process takes time for the CRAN team to process. If the CRAN team accepts your explanation, the submitted package would be undergone the so-called manual inspection. If the submitted package passes the manual inspection, it is all good. If the package does not pass the manual inspection and editing is requested, one has to go through the whole process once again, including the manual process of explaining the failing incoming checks.

I recently experienced this high-friction process for the submission of the R package minty. The usually very simple CRAN submission process took 3 weeks. Here, I should also say something about why minty had to go through this. The fact is, the C/C++ code in minty itself does not use any of the so-called “non API” entry points. The problem, however, is that minty statically links to the R package cpp11 in order to use C++. And the R package cpp11 uses the so-called “non-API” entry points SET_LENGTH and SET_TRUELENGTH. Therefore, minty emits NOTES upon checking. And the very existence of minty is for readODS. And readODS is also linking to cpp11. So, I can imagine if I want to submit readODS to CRAN, I will have to experience the same high-friction process.

So, to sum up: as long as cpp11 does not update and still contains those “non-API” calls, any R package links cpp11 (i.e. with “LinkingTo: cpp11”) will emit NOTES upon R CMD check, even the C/C++ code in that package does not do any “non API” calls. As of writing, there are more than 50 R packages in this category. You might argue, well, 50 R packages are not that many out of the >20000 CRAN packages. Sure. But among these affected packages there are arrow, readr, readxl, roxygen2, igraph, haven, RPostgres, RSQLite, tidyr, vroom. Chances are, most users should probably have ever used a package linked to cpp11 or are affected by this “non-API calls conundrum” ¹.

Before I move on, I would like to stress the point that I am not complaining the CRAN team, as well as the decision for the R Core Team to block some widely used entry points with the updated check. Of course I understand that they are all volunteers with limited time and they have the freedom to do whatever they want. From my perspective also as a volunteer, I just wanted to avoid this high friction process as much as I could.

cpp11, length and true length

By searching through the source code of cpp11, there is exactly one instance of this so-called “non-API call” and that is in the truncate method of vector and the source code is like this:

inline SEXP truncate(SEXP x, R_xlen_t length, R_xlen_t capacity) {
#if R_VERSION >= R_Version(3, 4, 0)
  SETLENGTH(x, length);
  SET_TRUELENGTH(x, capacity);
  SET_GROWABLE_BIT(x);
#else
  x = safe[Rf_lengthgets](x, length);
#endif
  return x;
}

To investigate what this code does, I dug into the R Internals manual as well as r-devel listserv discussion and this blog post by Brodie Gaslam.

All R objects are SEXPs (see my explanation previously). And each SEXP has two integer elements called length and truelength. length is the information giving back to you when you run length(x). truelength is like a mystery. In R Internals, it is explained in a footnote like this:

The only current use is for hash tables of environments (VECSXPs), where length is the size of the table and truelength is the number of primary slots in use, for the reference hash tables in serialization (VECSXPs), and for ‘growable’ vectors (atomic vectors, VECSXPs and EXPRSXPs) which are created by slightly over-committing when enlarging a vector during subassignment, so that some number of the following enlargements during subassignment can be performed in place), where truelength is the number of slots in use.

It’s kind of Greek to me. From my understanding, truelength is for making addition of new elements to the end of a vector (a.k.a. the push back operation) “in place”, i.e. without creating a copy, an expensive operation. In R 3.4.0 (2017-06-30), the truelength of a vector has since been modified to 1.05 of the length to make this operation more efficient. This “over-allocation” technique is now also used in several R packages, notably data.table and the case in point, cpp11. cpp11, for example, set the truelength (or capacity) the double of length. Another C++ interoperability option, Rcpp, does not do this over-allocation. I think it explains the performance difference between cpp11 and Rcpp in push back operation ².

Enter vendoring and the fix

Not all developers would react to the “non-API calls conundrum”. The developers of data.table, for example, would take no action unless CRAN forces them to. Some would take a wait and see approach to see whether the R Core Team would reverse their decision. The case in point cpp11 takes no action so far, which is fine.

There are also developers who have taken some actions against this “non-API calls conundrum”. The developer of brio (not using cpp11), Gábor Csárdi, plugged the call. Another case is duckdb by Kirill Müller. This case is very relevant to this discussion because duckdb also uses cpp11. But in the DESCRIPTION file, it does not say that it is linking to cpp11. Why?

It is because duckdb takes another approach: vendoring. I think this term is a bit fishy. But the very idea is simply to copy the source code of cpp11 into duckdb. Because cpp11 is header only, so it can support vendoring easily.

Kirill Müller plugged the non-API calls in the vendored cpp11 code inside duckdb. I think this approach is super interesting and should be the way for readODS to take in order to prevent the high-friction CRAN submission process.

There are of course disadvantages of vendoring cpp11. But to me, those disadvantages are not important. As a matter of fact, this “vendoring” business has been the default in the C / C++ world. And retrospectively speaking, readODS contains already vendored code, e.g. rapidXML.

Another issue is that vendoring cpp11 is not the mainstream method. As far as I can search, probably only 4 CRAN packages take this approach (duckdb, tidyfast, quickJSR; cpp11armadillo sort of; arrow was). I wanted to make readODS the fifth.

The steps for vendoring cpp11 are quite straightforward

Run cpp11::cpp_vendor(). All cpp11 header files will be copied to inst/include.
Modify or create a Makevars file in src to make the compiler know that you want to use the vendored cpp11 code: PKG_CPPFLAGS = -I../inst/include.
Remove LinkingTo: cpp11 from DESCRIPTION

That’s it.

In the vendored header file of cpp11 in inst/include, I modify the above truncate method to (referencing Kirill Müller’s approach for duckdb):

inline SEXP truncate(SEXP x, R_xlen_t length, R_xlen_t capacity) {
// Avoid SETLENGTH() and SET_TRUELENGTH() which trigger a warning on R-devel 4.5
#if R_VERSION >= R_Version(3, 4, 0) && R_VERSION < R_Version(4, 5, 0)
  SETLENGTH(x, length);
  SET_TRUELENGTH(x, capacity);
  SET_GROWABLE_BIT(x);
#else
  x = safe[Rf_lengthgets](x, length);
#endif
  return x;
}

Basically, it modifies the preprocessor directive. And I think it is quite straightforward to understand.

readODS with vendored `cpp11` and CRAN submission

As SET_TRUELENGTH is mostly an efficiency mechanism, I should study how the above fix would impact the performance of readODS. Actually, by reading the C++ code in readODS, I know for a fact that this fix will not impact the performance. It is because readODS does not push back to a cpp11 vector ³. As readODS has a suite of benchmarks, I checked the performance of readODS before and after vendoring cpp11. And as expected, no performance impact.

And then on Github Actions as well as with devtools::check_win_devel() to check readODS with the vendored and modified cpp11, it does not emit any NOTE with the R-devel version. So, I think it is a win.

The true smoke test, however, was to submit readODS to CRAN. And the process was “alles in Butter”: 18 minutes on a Sunday. Compare this to three weeks for minty.

readODS 2.3.0

Previously on this blog: Previously on this blog: 1.7.0 / 1.8.0 / 2.0.0 / 2.1.0 ⁴

Finally, I can say something about the new version of readODS. It does not have user visible changes. For the invisible changes, apart from the above vendored cpp11, readODS does not require readr and uses minty instead. It significantly reduces the number of dependencies of readODS. The installation time has been cut to 1/3 of readODS 2.2.0.

Postscript

I just wanted to give an update to this blog about what happened after the CRAN submission.

After the CRAN submission of 2.3.0, I checked randomly the CRAN checks. Now, there is a new thing: rchk issues.

In case you don’t know, on top of the ordinary CRAN checks, there are also additional checks. Previously, I have mentioned the Valgrind check for memory leak. rchk is for Garbage Collection bugs. Namely, R objects generated with C code without putting in the protection stack will get removed by the R Garbage Collector.

The rchk check indicates that the edited vendored cpp11 code has an unprotected variable. After some investigation, I confirmed it and fixed it…

Checking rchk issues is not easy. The way suggested by the original developer of rchk does not work because the Docker image has not been updated for a long time. I used the Docker image provided by rhub.

If you are talking about packages being affected by this “non API calls conundrum” but not via cpp11, there are also dplyr, tibble, rlang, vctrs, and data.table. ↩
I will not comment publicly on whether SET_TRUELENGTH should be banned or not. ↩
But readODS does push back to an std::vector during the XML parsing. And then the information from the std::vector will be used to create a cpp11 vector. Therefore, the length of the output cpp11 vector is known. ↩
I also wanted to add that there was a release readODS 2.2.0 in February this year. But I did not blog that release because I had no time due to this. ↩