Previously: ccat()
Suppose this time I want to build a crappy version of nchar()
in C. Unlike the last time I run ccat()
for the side effect of printing something, I need a return value: the number of characters.
string.h of the standard library has the function strlen()
, let’s use that.
And I wrote something like this:
#include <stdio.h>
#include <string.h>
int cnchar(char **x) {
int nc;
nc = strlen(*x);
return nc;
}
Compile it as .so
, dyn.load()
it, and run this in R:
.C("cnchar", x = "hello")
On my machine, it gives me:
$x
[1] "a"
Totally not I wanted. Why is it so?
First of all, the function cnchar()
is some perfectly valid C code. You can give it a try by modifying the C file to:
#include <stdio.h>
#include <string.h>
int cnchar(char **x) {
int nc;
nc = strlen(*x);
return nc;
}
void main() {
/* char* is a C-style string */
char* some_string = "hello";
/* char** is the pointer to a string */
char** y = &some_string;
printf("%d\n", cnchar(y));
}
Despite the craziness of the one star, two stars, and ampersand; compile it by GCC and run, the function is correct. Why doesn’t it work in R? It is because developing C for the FFI with return values, at least with .C()
, is not like developing in ordinary C. By digging into Writing R Extensions and for the .C()
interface, it says:
Note that the compiled code should not return anything except through its arguments: C functions should be of type void…
Therefore, I should write a void
C function instead. And .C()
can only return something through the arguments. Therefore, in order to return some values from C via .C()
, the C code should be:
#include <stdio.h>
#include <string.h>
void cnchar(char **x, int *answer) {
*answer = strlen(*x);
}
In other words, the return value (here: *answer
) should be one of the arguments. And the C code should do the arithmetic (via pointer) that modifies the content of *answer
. With the above modification, it works:
.C("cnchar", x = "hello", answer = as.integer(0))
And the output is:
$x
[1] "hello"
$answer
[1] 5
And if I need to write an R wrapper, the crappy version of nchar()
should be like this:
cnchar <- function(x) {
.C("cnchar", x = x, answer = as.integer(0))$answer
}
cnchar("hello world!")
nchar("hello world!")
With this, it gets to a point where I realize:
The FFI is all about pointers because values from R are all passed as pointers to the C function via the FFI. And in my opinion also, C is all about pointers. The difficult part about pointers is the overloading of the operator * in C. It can have several meanings (out of the context of pointers, it can also mean multiplication.) It’s like “Oida” in the Viennese language. Take the above “pure C” code and I annotate all the meanings of *.
#include <stdio.h>
#include <string.h>
int cnchar(char **x) { /* This function asks for a C-style string pointer to the memory location of a C-style string */
int nc;
nc = strlen(*x); /* here, *x means deference x */
return nc;
}
void main() {
char* some_string = "hello"; /* here, we create a C-style string, and its type is char* */
char** y = &some_string; /* here, we create a pointer to the memory location of x */
printf("%d\n", cnchar(y));
}
Similarly, for the rewritten C code
#include <stdio.h>
#include <string.h>
void cnchar(char **x, int *answer) { /* This function asks for a C-style string pointer and an integer pointer */
*answer = strlen(*x);
}
Computer science educators have come up with different metaphors to explain pointers. For me, the concept is actually not that difficult to understand. I think the instructor of Harvard CS50 explains it better than I can ever do, so I link to it here. Again, the confusion usually comes from the multiple meanings of the asterisk. I really hope that C can be written like that:
#include <stdio.h>
#include <string.h>
void cnchar(char as_pointer(as_string(x)), int as_pointer(answer)) {
content_of(answer) = strlen(content_of(x));
}
Just imagine all the as_pointer()
, as_string()
, and content_of()
are asterisk. With this, I can also explain the ampersand. That’s address_of()
.
As I said, it is super unnatural to put the return value in the argument. If it is not provided, it kills R by a segmentation fault.
.C("cnchar", x = "hello world") ## seg fault
It is not as efficient:
input <- paste(rep("a", 65535000), collapse = "")
system.time(.C("cnchar", x = input, answer = as.integer(0))$answer)
system.time(nchar(input))
Similar to ccat()
, cnchar()
is not vectorized and with no type checking. I will talk about a better way to write similar C functions next time.
To make it more confusing, there are also typed pointer and void pointer. A typed pointer is something like int *x;
. And a void pointer needs to be cased to a typed pointer by, you guess it, an asterisk.
#include <stdio.h>
void main (void) {
int x = 123;
void *pointer_to_x = &x;
printf("%d\n", *((int *) pointer_to_x));
}
To my understanding, extending R does not usually use void pointers.
Another way that cnchar()
is crappy is that it doesn’t work correctly with non-English characters.
nchar("アムロ") #3
cnchar("アムロ") #9
nchar("Graichen lässt Doktorarbeit überprüfen") #38
cnchar("Graichen lässt Doktorarbeit überprüfen") #41
What cnchar()
returns is the number of bytes of the input string. That’s what the underlying C function strlen()
returns. You can get the same thing by changing the type
argument of nchar()
.
nchar("アムロ", type = "bytes") #9
nchar("Graichen lässt Doktorarbeit überprüfen", type = "bytes") #41
One character is equal to one byte in ASCII, so strlen()
can also be used to count characters in that special case. But UTF-8 characters are variable length: They can be 1 to 4 bytes.
If you wanna know how R deals with that, UTSL.