(I decided to change the title, I will keep the old ones as “crappyverse”)
There were some voices that I should not teach .C
. I totally agree with that and it was planned to abandon .C
in the third post of the series. To summarize what .C
offers:
The return value (e.g. answer
in this case) must be passed as a pointer argument; the C function must be a void function
#include <stdio.h>
#include <string.h>
void cnchar(char **x, int *answer) {
*answer = strlen(*x);
}
Usually, the function (e.g. cnchar
in this case) can also be used out of the context of R without any/much modification. It is because .C
tries to match the data type from R to C (e.g. R character to C-style string array).
#include <stdio.h>
#include <string.h>
void cnchar(char **x, int *answer) {
*answer = strlen(*x);
}
void main() {
char* some_string = "hello";
int answer = 0;
cnchar(&y, &answer);
printf("%d\n", answer);
}
This concludes our discussion of .C
. Let’s move on. However, I’ll need to talk about the memory model of R.
.Call
is usually considered to be THE FFI one should use (actually there is also .External
which is extremely similar to .Call
on the R side of business). .Call
does not try to match data type like .C
and it actually can return value (not as pointer in the argument). Unless your function is only for side effects, the return value is probably SEXP
.
Anything you bring from R to C is an SEXP
(S Expression). I think it is easier to think about it as a pointer to an R object. The abstraction is so nice that you can even think of it as an R object itself (except GC, see below). You can also produce SEXP
with C code and bring it back to R. Similar to different types of vector, e.g. Character vector, logical vector, there are different types of SEXP
, based on what R object the pointer points to, e.g. STRSXP
is character vector. It is your responsibility to make sure your C code can be correctly type-checked.
It’s great that I have written about this topic previously about C++. In that post, I have also hinted a bit about R being a GC language.
R objects are stored in (the heap) memory. When objects are no longer needed, they still occupy the memory. The occupied memory is only released when a procedure called Garbage Collection (GC) is done. GC is a check of the object in the memory to see if the object is still being referenced. If it is no longer referenced, the object is removed to free up the occupied memory. One can trigger GC by explicitly running the gc()
function. Even if one doesn’t do it explicitly, GC will still be triggered in the background automatically.
You probably never come to a situation an R object you still need and GC accidentally gets it removed (or else R is quite an unsafe language). However, when you create an R object in C, the GC process might think that the memory occupied by your newly created R object using C code were not useful (because nothing in R is referencing it) and therefore that’s a garbage for collection. For safety the C API has notions of PROTECT
and UNPROTECT
. The section 5.9.1 of R-exts is entirely about this. I will show you how to protect an R object during its life cycle in C later.
cnchar
, but for .Call
Once again, this is cnchar
written for .C
.
#include <stdio.h>
#include <string.h>
void cnchar(char **x, int *answer) {
*answer = strlen(*x);
}
This is the same thing but for .Call
.
#include <R.h>
#include <Rdefines.h>
#include <string.h>
SEXP cnchar(SEXP x) {
SEXP result;
PROTECT(result = NEW_INTEGER(1));
PROTECT(x = AS_CHARACTER(x));
INTEGER(result)[0] = strlen(CHAR(STRING_ELT(x, 0)));
UNPROTECT(2);
return(result);
}
Several things:
SEXP
is both the input and output types of the C function.PROTECT()
is to put an SEXP (actually, the memory the SEXP occupied) into the protection stack so that R’s GC process does not collect them as garbage.UNPROTECT()
is to release the objects (ditto the memory statement above) in the stack. It is really a stack and the argument is only an integer. In this case, 2 means release the last two protected things (result
and x
) in the stack. It is important to UNPROTECT()
protected SEXP
s because failure to do so generates memory leak.NEW_INTEGER()
or more general NEW_*()
functions create a new vector of a specific length.AS_CHARACTER()
or more general AS_*()
functions declare a specific SEXP
as a specific type.STRING_ELT()
or more general *_ELT()
functions get a specific element from an SEXP
at a specific location.CHAR()
is for converting R data structure to C data (char*
aka C-style string).The compiling of this is the same: R CMD SHLIB cnchar.c
. But after dyn.load("cnchar.so")
, you have to call the C function by:
.Call("cnchars", x = "Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch")
First, it is less crappy than the previous version because you don’t need to run it like: .C("cnchar", x = "hello", answer = as.integer(0))
. Second, it really returns a single number, not a list. Actually, it is less likely to segfault too. The following would at least give you something, not a segfault. Whether those are good behaviours are up for debate.
.Call("cnchar", x = 123)
.Call("cnchar", x = TRUE)
.Call("cnchar", x = NA)
.Call("cnchar", x = c("a", "b")) # only one number
As the C function is only treating the singular input as a length-1 vector. A crappy way to do vectorization is to use vapply()
.
vapply(c("a", "abc"), function(z) .Call("cnchar", z), 1L)
It is however super trivial to make the C function vectorize. Unlike C++, there is probably only one straight forward way to vectorize: using a for loop to iterate by the index.
#include <R.h>
#include <Rdefines.h>
#include <string.h>
SEXP cnchar2(SEXP x) {
R_xlen_t xlength = Rf_xlength(x);
SEXP result = NEW_INTEGER(xlength);
PROTECT(result);
PROTECT(x = AS_CHARACTER(x));
for (int i = 0; i < xlength; i++) {
INTEGER(result)[i] = strlen(CHAR(STRING_ELT(x, i)));
}
UNPROTECT(2);
return(result);
}
Sure enough, it vectorizes.
.Call("cnchar2", x = c("a", "bbc"))
The definite guide for this is Chapter 6 of r-exts. But I also like unofficial documentation sources such as r-internals.
Let’s talk about it next time.