chainsawriot

Building the Crappyverse #2: cnchar()

Posted on May 22, 2023 by Chung-hong Chan

Previously: ccat()

Suppose this time I want to build a crappy version of nchar() in C. Unlike the last time I run ccat() for the side effect of printing something, I need a return value: the number of characters.

string.h of the standard library has the function strlen(), let’s use that.

And I wrote something like this:

#include <stdio.h>
#include <string.h>

int cnchar(char **x) {
    int nc;
    nc = strlen(*x);
    return nc;
}

Compile it as .so, dyn.load() it, and run this in R:

.C("cnchar", x = "hello")

On my machine, it gives me:

$x
[1] "a"

Totally not I wanted. Why is it so?

First of all, the function cnchar() is some perfectly valid C code. You can give it a try by modifying the C file to:

#include <stdio.h>
#include <string.h>

int cnchar(char **x) {
    int nc;
    nc = strlen(*x);
    return nc;
}

void main() {
    /* char* is a C-style string */
    char* some_string = "hello";
    /* char** is the pointer to a string */
    char** y = &some_string;
    printf("%d\n", cnchar(y));
}

Despite the craziness of the one star, two stars, and ampersand; compile it by GCC and run, the function is correct. Why doesn’t it work in R? It is because developing C for the FFI with return values, at least with .C(), is not like developing in ordinary C. By digging into Writing R Extensions and for the .C() interface, it says:

Note that the compiled code should not return anything except through its arguments: C functions should be of type void…

Therefore, I should write a void C function instead. And .C() can only return something through the arguments. Therefore, in order to return some values from C via .C(), the C code should be:

#include <stdio.h>
#include <string.h>

void cnchar(char **x, int *answer) {
    *answer = strlen(*x);
}

In other words, the return value (here: *answer) should be one of the arguments. And the C code should do the arithmetic (via pointer) that modifies the content of *answer. With the above modification, it works:

.C("cnchar", x = "hello", answer = as.integer(0))

And the output is:

$x
[1] "hello"

$answer
[1] 5

And if I need to write an R wrapper, the crappy version of nchar() should be like this:

cnchar <- function(x) {
    .C("cnchar", x = x, answer = as.integer(0))$answer
}

cnchar("hello world!")
nchar("hello world!")

With this, it gets to a point where I realize:

One can’t go any further without a good understanding of pointers.
It is super unnatural to put the return value in the arguments!

Pointers

The FFI is all about pointers because values from R are all passed as pointers to the C function via the FFI. And in my opinion also, C is all about pointers. The difficult part about pointers is the overloading of the operator * in C. It can have several meanings (out of the context of pointers, it can also mean multiplication.) It’s like “Oida” in the Viennese language. Take the above “pure C” code and I annotate all the meanings of *.

#include <stdio.h>
#include <string.h>

int cnchar(char **x) { /* This function asks for a C-style string pointer to the memory location of a C-style string */
    int nc;
    nc = strlen(*x); /* here, *x means deference x */
    return nc;
}

void main() {
    char* some_string = "hello"; /* here, we create a C-style string, and its type is char* */
    char** y = &some_string; /* here, we create a pointer to the memory location of x */
    printf("%d\n", cnchar(y));
}

Similarly, for the rewritten C code

#include <stdio.h>
#include <string.h>

void cnchar(char **x, int *answer) { /* This function asks for a C-style string pointer and an integer pointer */
    *answer = strlen(*x);
}

Computer science educators have come up with different metaphors to explain pointers. For me, the concept is actually not that difficult to understand. I think the instructor of Harvard CS50 explains it better than I can ever do, so I link to it here. Again, the confusion usually comes from the multiple meanings of the asterisk. I really hope that C can be written like that:

#include <stdio.h>
#include <string.h>

void cnchar(char as_pointer(as_string(x)), int as_pointer(answer)) {
    content_of(answer) = strlen(content_of(x));
}

Just imagine all the as_pointer(), as_string(), and content_of() are asterisk. With this, I can also explain the ampersand. That’s address_of().

Why is it crappy?

As I said, it is super unnatural to put the return value in the argument. If it is not provided, it kills R by a segmentation fault.

.C("cnchar", x = "hello world") ## seg fault

It is not as efficient:

input <- paste(rep("a", 65535000), collapse = "")
system.time(.C("cnchar", x = input, answer = as.integer(0))$answer)
system.time(nchar(input))

Similar to ccat(), cnchar() is not vectorized and with no type checking. I will talk about a better way to write similar C functions next time.

Postscript

Void pointer

To make it more confusing, there are also typed pointer and void pointer. A typed pointer is something like int *x;. And a void pointer needs to be cased to a typed pointer by, you guess it, an asterisk.

#include <stdio.h>
void main (void) {
    int x = 123;
	void *pointer_to_x = &x;
	printf("%d\n", *((int *) pointer_to_x));
}

To my understanding, extending R does not usually use void pointers.

Unicode

Another way that cnchar() is crappy is that it doesn’t work correctly with non-English characters.

nchar("アムロ") #3
cnchar("アムロ") #9

nchar("Graichen lässt Doktorarbeit überprüfen") #38
cnchar("Graichen lässt Doktorarbeit überprüfen") #41

What cnchar() returns is the number of bytes of the input string. That’s what the underlying C function strlen() returns. You can get the same thing by changing the type argument of nchar().

nchar("アムロ", type = "bytes") #9
nchar("Graichen lässt Doktorarbeit überprüfen", type = "bytes") #41

One character is equal to one byte in ASCII, so strlen() can also be used to count characters in that special case. But UTF-8 characters are variable length: They can be 1 to 4 bytes.

If you wanna know how R deals with that, UTSL.