Werden wir Helden für einen Tag

Building the Crappyverse #1: ccat()

Posted on May 19, 2023 by Chung-hong Chan

Instead of my usual style of explaining the purpose of this blog post, I want to go directly into code.

C

On Page 6 of the K&R book (The C Programming Language), there is the now famous Hello World C program.

#include <stdio.h>

main()
{
	printf("hello world\n");
}

I think even for people without much C experience they would be able to comprehend this program. The first hash line might look strange. But it’s like library() or require() in R, so that one can use the function printf. Also, it has a main function. One can think of it as the main body of a program.

In order to run this, according to the book one should save this as a file (e.g. hello.c), compile it to binary and run it.

cc hello.c
./a.out

The reason why we need to compile it is maybe well-known, I repeat it here nonetheless: C, unlike R, is a compiled language. The above C program is converted to machine code (binary). The binary can then be directly executed. Unlike R, the output program a.out in principle can be run on another machine directly (with a similar OS at least). Another machine does not need to have the C compiler to run a.out.

Of course, the more “modern” way of writing the Hello World program is this:

#include <stdio.h>

int main() {
	printf("hello world\n");
	return 0;
}

Despite the style, it tells the return type and the return value (0, the Unix exit status code). Again, C, unlike R, is a typed language. For a typed language, one needs to specific the type of all variables as well as the return values. In the above case, I specify the return type as an int (integer).

And a better way to compile this is:

gcc hello.c -o hello
./hello

R

In R, the Hello World program is much simpler.

cat("hello world\n")

But can one run the C version of Hello World inside R? Of course, one can compile the Hello World as binary and then run the binary with system().

system("./hello")

But this is kind of boring. I always want to explore the Foreign Function Interface (FFI) of R. The idea is to compile the C code into a shared library, which can be called directly inside R via functions .C or .Call. Let’s focus on .C this time.

With the same hello.c, I compile it to a shared library using the R CMD interface.

R CMD SHLIB hello.c

By doing that, it generates a shared library file (on my Linux machine, that’s hello.so; on Windows, it should be hello.dll). Now, I can dynamically load this shared library from within R.

dyn.load("hello.so")

Nothing else happens. The trick is to use the FFI to call the main function. Like this:

.C("main")

The output on my machine is like this:

hello world
list()

At least the first line indicates that the interfacing is working. The second line is the empty list returned by .C. According to the documentation, it’s “[a] list similar to the … list of arguments passed in (including any names given to the arguments), but reflecting any changes made by the C or Fortran code.” Let’s ignore it for now. And a simple way to override this is to:

invisible(.C("main"))

And now, it is almost equivalent ¹ to

cat("hello world\n")

Building the Crappyverse

And it makes me wonder: Can I rewrite the R function cat() in C? The answer is yes…ish.

#include <stdio.h>
int ccat(char **x) {
    printf("%s", *x);
    return 0;
}

The C function is called ccat() because it is written in C; or because it is a crappy version of the R function cat(). And please ignore the * business here (pointers) for now. In order to run this, I need to compile it as a shared library

R CMD SHLIB ccat.c

And then run it from R

dyn.load("ccat.so")
invisible(.C("ccat", x = "Hello world\n"))

Extremely convoluted; but it works. I can also make it less convoluted by writing a wrapper R function

ccat <- function(x) {
    invisible(.C("ccat", x = x))
}
ccat("Hello world\n")

And now, I have a crappy version of a built-in R function written in C. I recently found these “building the Crappyverse” exercises quite refreshing because they allow me to poke into the FFI and at the same time improves my system programming knowledge in C. For these Crappyverse exercises, the expected behavior of the function is known (e.g. in this case, the behaviors of the R function cat()), it’s just reinventing the wheel. Doing that is slightly more challenging than the usual programming exercises in many C books. I was inspired by the book “Command-Line Rust” by Ken Youens-Clark, where the exercises are just reinventing common Unix commands. In my opinion, extending R with C in general is a very niche black magic and probably won’t land you a job. However, as R is written in C mainly and I really want to dig deeper, this Crappyverse experience is quite important.

Another fun part of the exercise is to poke around and explain why the reinvented function is crappy. For example, only the first element of a character vector is printed.

cat(c("Hello world\n", "Bonjour le monde\n"))
ccat(c("Hello world\n", "Bonjour le monde\n"))

And giving ccat() other than characters can kill R with a segmentation fault!

cat(123) ## okay
ccat(123) ## seg fault!

I say “almost”, because it is not. ↩