chainsawriot

C++ as a second language from an R programmer's perspective

Posted on Dec 26, 2020 by Chung-hong Chan

seg

Skippable introduction

Advent of Code (AoC) 2020 is over. I though I could solve all of these problems with Rcpp. I might have the ability to solve all of those insanely difficult puzzles, but surely I don’t have the time.

As the creator has said, the whole thing is not only about earning stars. It is about what you have learned. For me, this AoC was a deliberate practice to me. I wanted to sharpen my C++, so that I can write better and more efficient Rcpp code. I think I’ve archived that goal. It is better than those stars.

So, here are what you need to know to use C++ as your second language.

Empowerment promise

In this article, I am going to share what I’ve learned from solving typical problems with C++/Rcpp. After reading this article, you, as an R programmer, should know how to make developing Rcpp easier, with less errors, and hopefully, more joyful.

Preamble: The raison d’être

It can be easily illustrated by this sample code (extracted from Wickham et al. Advanced R). It is a reimplementation of the sum() function.

(sum() is incredibly fast. So reimplementing sum() like this is reinventing the wheel. But it is a good and simple enough exercise.)

Rcpp::cppFunction('double sumC(NumericVector x) {
    int n = x.size();
    double total = 0;
    for(int i = 0; i < n; ++i) {
        total += x[i];
    }
        return total;
    }')

Running this in R will only give you a function. Another way to do this is to type those C++ program in a separate file and then use Rcpp::sourceCpp() to read that file. Notable features of C++ are:

You need to define the type of a new variable (e.g. double for double precision number, int for integer), as well as what type the function is going to return (the very first double).
C++ is zero-indexed (vs. 1-indexed in R).
You need to terminate a line with a semicolon.
For loop is not evil.

You can run the function sumC like a regular R function.

sumC(c(1,2,3,4))

But under the hood, Rcpp compiled the C++ code with a compiler (e.g. gcc) and gave you a link, or more accurately, a pointer, to the compiled version. The compiled version contains machine language instructions.

R (the language), instead, is an interpreted language. R (the application program, e.g. GNU R) is an interpreter of your program. There is no compilation step and the interpreter executes your R code when you run your program. The good things are:

It is easier to develop
It gives you instant feedback, so you can have nice things such as repl (read-eval-print loop) a.k.a. your R prompt.

The only bad thing:

Usually, it runs comparatively slower than compiled code.

You can try to implement the same function in R:

sumR <- function(x) {
    total <- 0
    for (y in x) {
    	total <- total + y
    }
    return(total)
}

And compare the performance of the original sum, sumC and sumR.

x <- rnorm(100000000)
microbenchmark::microbenchmark(sum(x), sumC(x), sumR(x), times = 20)

On my computer, the R version took in average 3 seconds, but the C++ counterpart took 194 milliseconds (i.e. 0.194 second). In other words, the C++ version runs around 20 times faster than the R version. As in this example showed, the major reason for R programmers to (re)write something in C++ is to achieve better performance.

Having said so, however, there is a trade-off between your time and computer’s running time. So, if you run some R code only occasionally or even YOLO (you only live once), it makes no sense in spending 2 hours to rewrite those code in C++ (a language that you are probably not familiar with) to save maybe 3s of your computer’s running time.

If you need to run some code very frequently (e.g. in an R package; or it is a part of a simulation and that part runs for 100000 times), it might make more sense to do the C++ rewrite.

Another major reason to write C++ is to take advantage of the efficient C++/C libraries out there. Boost, Armadillo, Eigen, Shark, GNU Scientific Library, just to name a few. And almost all deep learning frameworks.

C++ has the bad reputation of being a difficult language. However, my opinion is that C++ is not a particular “difficult” language, especially true for newer standards. But it is notoriously vast. If you only goal is to link C++ with R, you don’t need to know everything to kick start. Well, actually, it is not possible for one to know everything to kick start anything. The traditional Pareto distribution applies: you only need to know 20% of the features to achieve 80% of what you need to achieve. Some good links ¹ are:

Advanced R by Wickham et al.
Rcpp for everyone by Tsuda. ²
Basic C++ through Rcpp (slides) by Olmstad.
Unofficial Rcpp API documentation by Balamuta.

Good resources from the official Rcpp team:

Extending R with C++ by Eddelbuettel & Balamuta
Introduction to Rcpp: From simple examples to machine learning (YouTube Video) by Eddelbuettel.

For general C++, I use cppreference a lot. If you need a book, I recommend A Tour of C++ (2nd ed) by Bjarne Stroustrup.

Use (at least) C++11

Probably, you will have a better experience when you program in C++11 than C++98. Probably you will have even better experience with C++17 (or in the future, C++20) too.

Enabling C++11 is easy: you don’t need to do anything if you are using R 4.0 or later. If you want to be sure, simply add this line to your C++ code.

// [[Rcpp::plugins("cpp11")]]

It allows you to use some neat things such as the shorter for loop and auto (see below). Initializing a vector like this is also neat.

std::vector<int> myvector = {1, 2, 3, 4};

BTW, that angle brackets indicate using a template. It usually (but not always) describes what is the type of the thing. std::vector<int> means it is a vector that contains a bunch of integer values. std::vector<std::string> means it is a vector that contains a bunch of string values.

Use iterator because iteration pattern repeats a lot

This pattern repeats a lot.

for (NumericVector::iterator i = input.begin(); i != input.end(); ++i) {
    std::cout << *i << std::endl;
}

Importance is you need to deference each item (i.e. *i), because it only generates a pointer (or a pointer-like stuff) to the actual elements in the container (in this case: NumericVector). It also applies to functions/methods that return an iterator (e.g. std::max_element()). Remember to deref.

If you use C++11, you can even write something like this without using the iterator and dereferencing. In my later solutions, I switched to this style. I will talk to you about that little ampersand in the next section.

for (const int& i : input) {
    std::cout << i << std::endl;
}

Of course, in functional programming languages, this pattern would probably be handled by higher-order functions such as map() or apply() or whatnot. ³.

You probably don’t need to meddle in the pointer business, but you need to know how to pass by reference

Pointer is powerful but also dangerous. I don’t think one would frequently need to handle pointers in Rcpp, (or pass by pointer). But you’ll need to know how to pass by reference.

By default and very similar to R, C++ functions are using pass by value. That is, when you call a function, the data you pass to that function will copy from the call stack (say memory address A) to another stack (say memory address B) where the function is executed. When the execution of function is over, the function might return a value back to the call stack (address A). For small data e.g. one number, this copying is probably fine. But when you are copying 1G of data around in the memory for 10000 times, then it might become a performance issue. For example, it is particular problematic for functions that only subsetting a matrix.

A simple fix to this is to make your function “call by reference” by putting a little ampersand after the type.

unsigned int parse_all(std::vector<std::string>& input);

By doing so, you are telling this function not to copy the data. Instead, you are asking your function to read the data directly from the current memory address (and to modify the data, or mutate, in that memory address if possible).

Of course, it would usually be a better idea to do this when you are simply reading the data. Modifying the data when you are calling by reference might lead to surprise, at least it is not a behavior expected by R programmers. A simple way to ensure safety is to make the reference immutable using the const keyword and don’t do any mutation in the function.

unsigned int parse_all(const std::vector<std::string>& input);

Now, your compiler will complain when input is mutated.

Beware of the conversion between std::string and Rcpp::String

Usually, when a C++ function returns std::string to R, the output is automatically coerced into Rcpp::String. The same goes with std::vector<std::string>. You don’t need to explicitly convert it back to Rcpp::CharacterVector. There might be a small performance penalty, but I am not writing the firmware of a fighter jet or a low latency algorithmic trading system. If it were the case, I probably ditch the R layer.

However, when you bring Rcpp::String into C++, you can’t always assume C++ functions taking std::string would accept that. A safer way is to first convert it.

std::string line = as<std::string>(rstring);

Later I realized the cheap way to handle this is to eschew CharacterVector altogether: just use std::vector<std::string>…

Beware of integer overflow

C++ is statically typed, whereas R is dynamically typed. The discussion so far is mostly about type. For people program mostly in dynamically typed languages such as R and Python (maybe the father of all: Lisp), type checking is weird. But type checking provides safety and efficiency.

On the flip side, however, it can be problematic if you are careless. And for this problem, Type checking can’t save you. You can only discover this during the run time.

overflow

The commonest C++ type is perhaps integer. But what can be stored as an integer has a limit. For example, a regular integer (int) is stored in an allotted 4 bytes of memory and thus it can only hold an integer within the range of -2147483648 to +2147483647. If you add one to an integer +2147483647, it will overflow and goes to -2147483648. The opposite is underflow. The most famous example is the nuclear Gandhi bug in the computer game Civilization. The value of aggressiveness of most playable characters has a range of 2 to 10. Because the value is stored in an 1-byte unsigned integer and theoretically the value can only be 0-256. Gandhi’s aggressiveness is the lowest, with only 1. But adopting democracy in the game decreases the aggressiveness of the leader by two points. A democratized Gandhi should have -1 in aggressiveness, but the integer underflows and that becomes 256. From that point, Gandhi will nuke all his opponents.

Back to the main argument: I have been stumped multiple times by integer overflow. In one occasion I needed to use “long unsigned int” because the answer is in the range of 10¹³.

Actually R has the problem too. I believe the default scaler number in R is 32-bit signed. You can see the maximum by typing .Machine$integer.max. On my machine, it is 2147483647 (This value is less than the world population as of 2020, i.e. 7831341892; or 36.8 Billion tonnes of CO2 we dumped into the atmosphere in a year.). You can also try: .Machine$integer.max + 1. But it will not overflow, because R is dynamically typed as well and the number is automatically coerced into `long`. But it has limit too. Try 2^53 and 2^53 + 1. The latter case would start to lose precision. In most of the systems, this value is called MAX_SAFE_INTEGER.

Numbers larger than 2⁵³ are not uncommon. The most typical ones are Twitter IDs. They are lengthy integers such as 932586791953158144. That’s the reason why the R package rtweet requires bit64, i.e. 64-bit interger.

Use auto sometimes

Again, it is about the topic of typing. A new feature of C++11 is auto. When you write something like:

auto x = 1.9;

C++11 is smart enough to infer this x has a type of double. This is pretty neat for people using dynamically typed languages (e.g. R). ES-11 also recommends to use auto “to avoid redundant repetition of type names” (also see typedef below).

There are some cases where I would opt to use auto, e.g. unpacking of map (see data structures below).

std::map<int, int> hello;
for (auto& data: hello) {
    data.first += 1;
}

Other than that, I don’t use auto (unless I am too tired).

Use more data structures

Smart data structures and dumb code works a lot better than the other way around

~ Eric S. Raymond.

Compared to R’s vector-and-list-centric culture, many data structures are provided natively thanks to C++’s standard library (std).

My C++ code still has the remnant of my heavy reliance on vector. Sometimes, my C++ code reads like an R programmer writing C++. Well, that is a fact.

I needed to use some other data structures to solve some problems.

I have used the rudimentary struct, array, std::map and std::pair. In some cases you can’t even use vector, especially when the vector grows. One big example is day 15. There is a reason for the invention of hash tables such as std::map. Switching from std::map to std::unordered_map can make it even faster.

I need to use object-oriented programming in exactly one case. For that one, it makes more sense to do it that way and it is more of a deliberate practice for me to write some OOP code.

But please make sure you know that many of these C++ data structures can’t be directed brought back to R. In many of my solutions, I usually write something functions called pp_*() to “pretty print” them.

Use typedef

You should use more data structures, but you’ll get angry pretty fast for typing something like std::unordered_map<std::string, std::tuple<unsigned int, unsigned int, unsigned int» repeatedly. To save some typing, use typedef.

typedef std::unordered_map<std::string, std::tuple<unsigned int, unsigned int, unsigned int>> puzzlebook;    
puzzlebook parse_all(std::vector<std::string> input); // Just the declaration

Along the same lines, C++ is usually more verbose than R. Chances are you will miss a character or two when you type and that leads to syntax errors. A helpful trick to prevent mistyping is to create some snippets for very common tasks. For example, I created a snippet (using yasnippet of emacs) to expand “svs” to std::vector<std::string>. “sco” to std::cout « $1 « std::endl;. You got the idea.

Take advantage of Rcpp, but beware of pass-by-reference

Remember that you are using Rcpp, therefore you can use the data structure provided by Rcpp in C++ (as well as all the sugars). Things that I like to use in C++ are those R matrices. I know there are many linear algebra C++ libraries (e.g. GSL) which provide several matrix data structures. I am too lazy to install those libraries so I just use NumericMatrix instead. Zero installation.

One surprise might be the fact that these Rcpp data structures are pass-by-reference even without that little ampersand when you call them from R. For instance, you have such a function:

Rcpp::cppFunction('NumericVector test(NumericVector x) {
      x[0] = 3.1416;
      NumericVector y(1);
      y[0] = 5201314;
  return y;
}')
v1 <- c(1, 2, 3, 4)
v2 <- test(v1)
v2
v1

You might be able to guess v2 is 5201314. Try to guess what the first element of v1 will now be? Surprise! Surprise! (see pass by reference above). v1 is mutated.

If you want to modify a copy of the input, use clone to keep the input intact.

Rcpp::cppFunction('NumericVector test2(NumericVector x) {
      NumericVector y = clone(x);
      y[0] = 3.14;
  return y;
}')
v1 <- c(1, 2, 3, 4)
v2 <- test2(v1)
v2
v1

Contributions

I have explained why one would want to rewrite one’s R code in C++. I have also highlighted the differences between C++ and R. Now you know how to make Rcpp more joyful and less dangerous to use.

Footnotes

¹ You may wonder why the official Rcpp sources are not listed here. I don't think the official Rcpp sources are good to get your feet wet. You probably need to, however, read the official sources later in your journey. I think the main developers of Rcpp assume you know a lot about R and C++ before reading their stuff or even using their stuff. Not to mention asking them questions. Update (2020-12-27) - The original footnote shows my ignorance. Also, the tone of the footnote is unnecessarily rude.

² If you prefer the Japanese version. I enjoy this one, it really assumes you are a beginner. I applause the author for writing this.

³ The creator of the language, Bjarne Stroustrup, defined in the very early documentation that C++ is an object-oriented programming language. The later development of the language (e.g. template) challenges this view. In the newer edition of his book, he also introduces generic programming using template. I think the newer definition is to say C++ is a multi-paradigm language. It is also now common to see people talking about functional C++.