Update: 2024-06-06 Jump
The tale started with my experience submitting the R package minty
to CRAN. But I think it’s better to say something about the so called “Non-API calls conundrum”.
R has a set of C-level APIs so that we can use them to manipulate R objects with C (or any language that can call C, such as C++ or Rust). Those APIs are mostly C functions or macros (a.k.a. “entry points”) provided as header files in the R source code. Some of these C functions are documented in the Writing R Extensions manual. But most of them are just in the header files, especially Rinternals.h.
Now, we know that not all entry points provided in these header files are APIs. Using these “non-API” entry points are classified as “non-API calls”. The issue leads to a low-grade backlash earlier last month for two reasons.
First, many entry points have been flagged as “non API” entry points in the R source code retrospectively and that left a bad taste in some developers’ mouth because some widely used C functions are now non-API entry points.
Second, most “non-API calls” have not been checked by R CMD check
for so long time, even with the current release version of R, 4.4.0. But this will change in the next version aka the current R-devel version i.e. R 4.5.x, due to the April update to R CMD check
. It becomes an immediate trouble for many because CRAN requires R packages to be checked with the R-devel version. Now, R packages with these so called “non-API calls” will emit NOTEs upon checking. Unlike ERRORs or WARNs, NOTEs usually do not trigger CRAN to remove R packages. I must emphasize, however, “usually do not” in the last sentence is not the same as “never”.
How it impacted me is that when submitting a new or updated R package to CRAN, the same check will also be done. The problem here is that a newly submitted R package emitting NOTEs upon checking will be classified as failing the incoming checks and will be “pretest archived”, i.e. not process further automatically. In order to move the submission forward, one has to e-mail the CRAN team manually to explain how the failing of the incoming checks is “false positive”. And this process takes time for the CRAN team to process. If the CRAN team accepts your explanation, the submitted package would be undergone the so-called manual inspection. If the submitted package passes the manual inspection, it is all good. If the package does not pass the manual inspection and editing is requested, one has to go through the whole process once again, including the manual process of explaining the failing incoming checks.
I recently experienced this high-friction process for the submission of the R package minty
. The usually very simple CRAN submission process took 3 weeks. Here, I should also say something about why minty
had to go through this. The fact is, the C/C++ code in minty
itself does not use any of the so-called “non API” entry points. The problem, however, is that minty
statically links to the R package cpp11
in order to use C++. And the R package cpp11
uses the so-called “non-API” entry points SET_LENGTH
and SET_TRUELENGTH
. Therefore, minty
emits NOTES upon checking. And the very existence of minty
is for readODS
. And readODS
is also linking to cpp11
. So, I can imagine if I want to submit readODS
to CRAN, I will have to experience the same high-friction process.
So, to sum up: as long as cpp11
does not update and still contains those “non-API” calls, any R package links cpp11
(i.e. with “LinkingTo: cpp11”) will emit NOTES upon R CMD check
, even the C/C++ code in that package does not do any “non API” calls. As of writing, there are more than 50 R packages in this category. You might argue, well, 50 R packages are not that many out of the >20000 CRAN packages. Sure. But among these affected packages there are arrow
, readr
, readxl
, roxygen2
, igraph
, haven
, RPostgres
, RSQLite
, tidyr
, vroom
. Chances are, most users should probably have ever used a package linked to cpp11
or are affected by this “non-API calls conundrum” 1.
Before I move on, I would like to stress the point that I am not complaining the CRAN team, as well as the decision for the R Core Team to block some widely used entry points with the updated check. Of course I understand that they are all volunteers with limited time and they have the freedom to do whatever they want. From my perspective also as a volunteer, I just wanted to avoid this high friction process as much as I could.
By searching through the source code of cpp11
, there is exactly one instance of this so-called “non-API call” and that is in the truncate method of vector and the source code is like this:
inline SEXP truncate(SEXP x, R_xlen_t length, R_xlen_t capacity) {
#if R_VERSION >= R_Version(3, 4, 0)
SETLENGTH(x, length);
SET_TRUELENGTH(x, capacity);
SET_GROWABLE_BIT(x);
#else
x = safe[Rf_lengthgets](x, length);
#endif
return x;
}
To investigate what this code does, I dug into the R Internals manual as well as r-devel listserv discussion and this blog post by Brodie Gaslam.
All R objects are SEXPs (see my explanation previously). And each SEXP has two integer elements called length
and truelength
. length
is the information giving back to you when you run length(x)
. truelength
is like a mystery. In R Internals, it is explained in a footnote like this:
The only current use is for hash tables of environments (VECSXPs), where length is the size of the table and truelength is the number of primary slots in use, for the reference hash tables in serialization (VECSXPs), and for ‘growable’ vectors (atomic vectors, VECSXPs and EXPRSXPs) which are created by slightly over-committing when enlarging a vector during subassignment, so that some number of the following enlargements during subassignment can be performed in place), where truelength is the number of slots in use.
It’s kind of Greek to me. From my understanding, truelength
is for making addition of new elements to the end of a vector (a.k.a. the push back operation) “in place”, i.e. without creating a copy, an expensive operation. In R 3.4.0 (2017-06-30), the truelength
of a vector has since been modified to 1.05 of the length
to make this operation more efficient. This “over-allocation” technique is now also used in several R packages, notably data.table
and the case in point, cpp11
. cpp11
, for example, set the truelength
(or capacity
) the double of length
. Another C++ interoperability option, Rcpp
, does not do this over-allocation. I think it explains the performance difference between cpp11
and Rcpp
in push back operation 2.
Not all developers would react to the “non-API calls conundrum”. The developers of data.table
, for example, would take no action unless CRAN forces them to. Some would take a wait and see approach to see whether the R Core Team would reverse their decision. The case in point cpp11
takes no action so far, which is fine.
There are also developers who have taken some actions against this “non-API calls conundrum”. The developer of brio
(not using cpp11
), Gábor Csárdi, plugged the call. Another case is duckdb
by Kirill Müller. This case is very relevant to this discussion because duckdb
also uses cpp11
. But in the DESCRIPTION file, it does not say that it is linking to cpp11
. Why?
It is because duckdb
takes another approach: vendoring. I think this term is a bit fishy. But the very idea is simply to copy the source code of cpp11
into duckdb
. Because cpp11
is header only, so it can support vendoring easily.
Kirill Müller plugged the non-API calls in the vendored cpp11
code inside duckdb
. I think this approach is super interesting and should be the way for readODS
to take in order to prevent the high-friction CRAN submission process.
There are of course disadvantages of vendoring cpp11
. But to me, those disadvantages are not important. As a matter of fact, this “vendoring” business has been the default in the C / C++ world. And retrospectively speaking, readODS
contains already vendored code, e.g. rapidXML.
Another issue is that vendoring cpp11
is not the mainstream method. As far as I can search, probably only 4 CRAN packages take this approach (duckdb
, tidyfast
, quickJSR
; cpp11armadillo
sort of; arrow
was). I wanted to make readODS
the fifth.
The steps for vendoring cpp11
are quite straightforward
cpp11::cpp_vendor()
. All cpp11
header files will be copied to inst/include
.Makevars
file in src
to make the compiler know that you want to use the vendored cpp11
code: PKG_CPPFLAGS = -I../inst/include
.LinkingTo: cpp11
from DESCRIPTIONThat’s it.
In the vendored header file of cpp11 in inst/include
, I modify the above truncate method to (referencing Kirill Müller’s approach for duckdb
):
inline SEXP truncate(SEXP x, R_xlen_t length, R_xlen_t capacity) {
// Avoid SETLENGTH() and SET_TRUELENGTH() which trigger a warning on R-devel 4.5
#if R_VERSION >= R_Version(3, 4, 0) && R_VERSION < R_Version(4, 5, 0)
SETLENGTH(x, length);
SET_TRUELENGTH(x, capacity);
SET_GROWABLE_BIT(x);
#else
x = safe[Rf_lengthgets](x, length);
#endif
return x;
}
Basically, it modifies the preprocessor directive. And I think it is quite straightforward to understand.
cpp11
and CRAN submissionAs SET_TRUELENGTH
is mostly an efficiency mechanism, I should study how the above fix would impact the performance of readODS
. Actually, by reading the C++ code in readODS
, I know for a fact that this fix will not impact the performance. It is because readODS
does not push back to a cpp11
vector 3. As readODS
has a suite of benchmarks, I checked the performance of readODS
before and after vendoring cpp11
. And as expected, no performance impact.
And then on Github Actions as well as with devtools::check_win_devel()
to check readODS
with the vendored and modified cpp11
, it does not emit any NOTE with the R-devel version. So, I think it is a win.
The true smoke test, however, was to submit readODS
to CRAN. And the process was “alles in Butter”: 18 minutes on a Sunday. Compare this to three weeks for minty
.
Previously on this blog: Previously on this blog: 1.7.0 / 1.8.0 / 2.0.0 / 2.1.0 4
Finally, I can say something about the new version of readODS
. It does not have user visible changes. For the invisible changes, apart from the above vendored cpp11
, readODS
does not require readr
and uses minty
instead. It significantly reduces the number of dependencies of readODS
. The installation time has been cut to 1/3 of readODS
2.2.0.
I just wanted to give an update to this blog about what happened after the CRAN submission.
After the CRAN submission of 2.3.0, I checked randomly the CRAN checks. Now, there is a new thing: rchk
issues.
In case you don’t know, on top of the ordinary CRAN checks, there are also additional checks. Previously, I have mentioned the Valgrind check for memory leak. rchk
is for Garbage Collection bugs. Namely, R objects generated with C code without putting in the protection stack will get removed by the R Garbage Collector.
The rchk
check indicates that the edited vendored cpp11
code has an unprotected variable. After some investigation, I confirmed it and fixed it…
Checking rchk
issues is not easy. The way suggested by the original developer of rchk
does not work because the Docker image has not been updated for a long time. I used the Docker image provided by rhub.
If you are talking about packages being affected by this “non API calls conundrum” but not via cpp11
, there are also dplyr
, tibble
, rlang
, vctrs
, and data.table
. ↩
I will not comment publicly on whether SET_TRUELENGTH
should be banned or not. ↩
But readODS
does push back to an std::vector
during the XML parsing. And then the information from the std::vector
will be used to create a cpp11
vector. Therefore, the length of the output cpp11
vector is known. ↩
I also wanted to add that there was a release readODS
2.2.0 in February this year. But I did not blog that release because I had no time due to this. ↩