chainsawriot

C++ Q&A for R package developers

Posted on Oct 1, 2023 by Chung-hong Chan

I am now coauthor of the new R package adaR (my team lead David Schoch is the main author and maintainer). If you want to know more about the package, it’s better to go to this blog post.

I googled too much about C++ during the development and triggered the Google "foobar" hiring program

Like I did for rtoot, what I wanted to say about adaR is not the package itself but how it was developed. Mostly about C++ and R. There are two ways for interfacing C++: Rcpp and cpp11. There are some resources on C++ for R development. I wrote one myself too.

There are however only a handful of resources on how to develop / debug C++ code living inside an R package. I think the best one so far is by Davis Vaughan, the current maintainer of cpp11. The following are some notes on C++ development for R programmers. I put this in a Q&A form.

Q: Why is the C++ code performance bad sometimes?

A: Almost always, it is related to optimization. It is important to look at how the C++ code is compiled. By default, devtools::load_all() compile C++ code with -g (with debugging information for GDB) and -O0 (optimization level 0, i.e. no optimization) ¹. You can look at it by reading the trace produced (see -UNDEBUG -Wall -pedantic -g -O0).

   using C++17
   g++ -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'     -fpic  -g -O2 -ffile-prefix-map=/build/r-base-MHXHhT/r-base-4.3.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -UNDEBUG -Wall -pedantic -g -O0 -fdiagnostics-color=always -c RcppExports.cpp -o RcppExports.o

The performance of the software loaded this way is not the best, but it offers better support for debugging (see below). The performance of code compiled with -g and -O0 for this benchmark: bench::mark(adaR::ada_url_parse(rep("http://www.google.com", 10000))) is 51.7ms. One more point about this: if you have the compiled (shared) objects in the src directory, i.e. .o or .so, generated by devtools::load_all() and then you install the package by devtools::install(quick = TRUE), you will install the non-optimized version to your default library.

remotes::install_github() is different. It compiles code with R’s default level of optimization (-O2, out the highest -O3).

using C++17
g++ -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'     -fpic  -g -O2 -ffile-prefix-map=/build/r-base-MHXHhT/r-base-4.3.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c RcppExports.cpp -o RcppExports.o

Similarly, if you build the source package using R CMD build and then install the source package using install.packages(), it compiles code using -O2. Similarly, if you install any package with compiled code from CRAN with install.packages(), it uses -O2.

using C++17
g++ -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG  -I'/usr/local/lib/R/site-library/Rcpp/include'     -fpic  -g -O2 -ffile-prefix-map=/build/r-base-MHXHhT/r-base-4.3.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -c RcppExports.cpp -o RcppExports.o

The same benchmark bench::mark(adaR::ada_url_parse(rep("http://www.google.com", 10000))) takes only 18.1ms. So, if you want to know the true performance of the code you are developing, vis-à-vis another package on CRAN, you should either use remotes::install_github or to build the source package and install it. Don’t benchmkark your code with devtools::load_all().

Q: How to build higher-order functions in C++?

A: For rtoot, I talked about building higher-order functions to reduce the complexity of the software.

In the book R for Data Science, Wickham and Grolemund say “You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).” Here, I want to say: “You should consider writing a higher-order function whenever you’ve copied and pasted a function more than twice.”

In the R part, you can write something like this:

ada_get_href <- function(url, decode = TRUE) {
    if (is.null(url)) {
        return(character(0))
    }
    out <- Rcpp_ada_get_href(url)
    if (isTRUE(decode)) {
        return(url_decode2(out))
    }
    return(out)
}

ada_get_username <- function(url, decode = TRUE) {
    if (is.null(url)) {
        return(character(0))
    }
    out <- Rcpp_ada_get_username(url)
    if (isTRUE(decode)) {
        return(url_decode2(out))
    }
    return(out)
}

ada_get_password <- function(url, decode = TRUE) {
    if (is.null(url)) {
        return(character(0))
    }
    out <- Rcpp_ada_get_password(url)
    if (isTRUE(decode)) {
        return(url_decode2(out))
    }
    return(out)
}

or this:

.get <- function(url, decode, func) {
    if (is.null(url)) {
        return(character(0))
    }
    out <- func(url)
    if (isTRUE(decode)) {
        return(url_decode2(out))
    }
    return(out)
}

ada_get_href <- function(url, decode = TRUE) {
    .get(url, decode, Rcpp_ada_get_href)
}

ada_get_href <- function(url, decode = TRUE) {
    .get(url, decode, Rcpp_ada_get_href)
}

ada_get_password <- function(url, decode = TRUE) {
    .get(url, decode, Rcpp_ada_get_password)
}

Can you do something similar in C++? Yes. ²

//higher-order function for all Rcpp_ada_get_*
CharacterVector Rcpp_ada_get(const CharacterVector& url_vec, std::function<ada_string(ada_url)> func) {
  unsigned int n = url_vec.length();
  CharacterVector out(n);
  for (int i = 0; i < url_vec.length(); i++) {
    String s = url_vec[i];
    std::string_view input(s.get_cstring());
    ada_url url = ada_parse(input.data(), input.length());
    if (!ada_is_valid(url)) {
      out[i] = NA_STRING;
    } else {
      out[i] = charsub(func(url));
    }
    ada_free(url);
  }
  return (out);
}

// [[Rcpp::export]]
CharacterVector Rcpp_ada_get_href(const CharacterVector& url_vec) {
  return Rcpp_ada_get(url_vec, &ada_get_href);
}

// [[Rcpp::export]]
CharacterVector Rcpp_ada_get_username(const CharacterVector& url_vec) {
  return Rcpp_ada_get(url_vec, &ada_get_username);
}

// [[Rcpp::export]]
CharacterVector Rcpp_ada_get_password(const CharacterVector& url_vec) {
  return Rcpp_ada_get(url_vec, &ada_get_password);
}

Q: How to debug C++ functions?

A: Usually, if one did something crazy, the compiler would catch most of the errors. However, the program might still be able to compiled but with runtime errors. The most common one is segfault.

Running it inside an ordinary R session, you know nothing about what causes the runtime errors inside your C++ code. It will just segfault and kill the R session. One way to debug this is the so-called Caveman Debugging. You may add many print statements (or in Rcpp: Rcout << or std::cout <<) to tell you when the program crashes. But it is not the most effective way to debug a program. It’s called Caveman Debugging for a reason.

Just like you would do with debug() in R, you should use tools to debug your program. There are two main tools that I use. Let’s talk about the first one: gdb. It’s also covered in the above linked post by Davis Vaughan (although it is about lldb, the llvm equivalent, with some subtle differences. BTW, if you are on macos, your compiler is likely clang, the llvm C compiler.)

The idea is to launch R with gdb, i.e. R -d gdb.

It will directly go to gdb (not R). I recommend the GDB cheat sheet. But the first step is usually to set breakpoints. For example, I set the breakpoint at the C++ function causing runtime errors, e.g. b Rcpp_ada_parse. It will complain about it (GDB doesn’t know what Rcpp_ada_parse is). But just say Yes. And then enter run.

It should go to R. At this point, you can use devtools::load_all() to load the -g compiled version of the C++ code. You can then directly run the C++ function, e.g. Rcpp_ada_parse. It will then go back to gdb, because the break point is triggered.

Now you can use gdb similar to R’s debug(). Enter next to run the function line by line. Or use print i to print a certain variable. By steping through the execution, you know when the runtime error is triggered ³.

Another method is not to set the break point at the beginning (you might not know which C++ function is causing the issue, if one C++ function calls another), directly go with run. And then devtools::load_all() and run the C++-laden R code. When a segfault is raised, gdb will also pop up. At this point, you can use where full to print out the current call stack. You might see something like:

#2  0x00007fffee78c14a in Rcpp_ada_parse (input_vec=...) at adaR.cpp:28

It shows you where exactly the segfault is triggered. It is much more efficient than the Caveman approach.

Q: Can you explain the memory model of C++?

R programmers probably don’t need to care about memory issues. R is a GC language and uses garbage collection to manage memory automatically. Objects stored in the memory will be checked regularly for whether or not they are still needed. The unused objects (“garbage”) will be automatically removed and the occupied memory is freed.

C++ is not a GC language. It uses static and dynamic memory allocations ⁴. The adjective “dynamic” sounds so nice but it actually means “semi-automatic” at best, i.e. manual. And in order to be a better C++ programmer you need to understand the memory model. At least, you should know the difference between stack and heap memory.

When you do something like:

void test() {
    int i;
    i = 123;
    std::cout << i << std::endl;
}

This is what most beginner C++ tutorials would start teaching by doing variable assignment like that. And by doing this, i is allocated in the stack memory. You can also call this operation static memory allocation. The good things about stack memory are that 1) it is in general faster; 2) when the thing in the stack memory is out of scope (e.g. when the execution of the function ends), the memory is automatically freed. So, you don’t need to care about memory management and it is in general quite safe. However, it can also be a drawback because another function might want to access the value in that memory address after the thing is out of scope. Another drawback is that there is only a very limited amount of stack memory. On a typical Linux machine, it’s about 8MB (try: ulimit -s). You probably cannot use stack memory to handle data larger than that. That leads to stack overflow (now you know where the term comes from).

However, when you do something like:

void test2() {
    int* ptr;
    ptr = new int;
    *ptr = 123;
    std::cout << ptr << std::endl;
    std::cout << *ptr << std::endl;
}

You are allocating something to the heap memory. Or dynamic memory allocation. The pointer variable ptr itself is actually in the stack memory. However, that’s just a memory address. The memory address, let’s say 0x1000, is an address in the heap memory. When you put 123 in that address you put it in the heap memory.

The nice thing about the heap memory is that it is much larger than the stack. It’s very much the available physical memory of your machine. Also, it is persistent. As long as you don’t manually overwrite or free up the memory 0x1000, you will always get back the value 123 from it. However, it can also be an issue. The allocated heap memory that is no longer needed but never gets freed up is called memory leak. The function test2() leads to memory leak. To do it correctly, you should free up the memory using delete afterwards manually.

void test3() {
    int* ptr;
    ptr = new int;
    *ptr = 123;
    std::cout << ptr << std::endl;
    std::cout << *ptr << std::endl;
    delete ptr;
}

You might also know malloc (memory allocation) and free. That’s how C does dynamic memory allocation and you can certainly use malloc and free in C++. But please use new and delete in C++ unless you know what you are doing (or see R.10).

void test4() {
    int *ptr = (int *) malloc(sizeof(int));
    *ptr = 123;
    std::cout << ptr << std::endl;
    std::cout << *ptr << std::endl;
    free(ptr);
}

Q: Do I need to care about dynamic memory allocation if I only use Rcpp / cpp11?

A: Jein, yes or no. It depends on what do you mean by “only use”. It would be better to explain what SEXP is.

All R data structures being brought over to C++ (or created in C++) are SEXPs (S Expressions). For example, Rcpp::CharacterVector or cpp11::strings is SEXP. A SEXP looks like something from the stack memory, because you initialize it by:

CharacterVector x;

In fact, SEXP is an abstraction containing a pointer to the data stored in the heap. As I said previously, R is a GC language and the (heap) memory allocated for storing data (in other words, pointed to by SEXPs) but no longer required will be automatically freed up. So, you probably don’t need to care about dynamic memory allocation if all you do is to manipulate SEXPs in C++.

However, you probably don’t just use Rcpp or cpp11 to manipulate SEXPs. You might want to use other C++ libraries to do something and then return the result to R as those SEXPs, for example, using RapidXML to parse XML files or using ada-url to parse URLs. In those cases, you will need to care about dynamic memory allocation.

For example, you must fix any memory leak. Actually for modern computers, you can barely see the consequences of memory leak, especially if the program runs just for a short time and the leak is not big. But, if you run the same leaky program over and over again for a long time, the leaked memory usually would accumulate and reduce the free (heap) memory available. If it is too severe, your computer might either swap (using your storage media as temporary memory, which is super slow); or your OS would do something (e.g. Linux’s OOM killer) to stop further leakage.

If you submit an R package to CRAN which leaks memory, your package can potentially be removed. The regular R checks such as R CMD check and devtools::check() cannot detect memory issues. However, you must know that there are something called CRAN Additional Checks. There is a member of the CRAN team (you know who) has a valgrind fleet to check memory access issues. Therefore, you must check your R package with C++ code using valgrind before submitting your package to CRAN. Then it naturally comes the next question, which is:

Q: How to check and debug memory issues with valgrind?

A: Valgrind does a lot of things and it is actually a suite of many tools. But I will only talk about memory leak detection here, which is memcheck. It is also the default tool of valgrind. To get an intuition of how memcheck aka valgrind works, we need a task. Suppose we want to prove that test2() above is actually leaking memory.

Similar to gdb above, you can launch R with valgrind, i.e. i.e. R -d valgrind. As an exercise, directly quit R by q(). valgrind will provide a report like this:

==3930734== 
==3930734== HEAP SUMMARY:
==3930734==     in use at exit: 42,928,567 bytes in 9,961 blocks
==3930734==   total heap usage: 21,759 allocs, 11,798 frees, 66,131,089 bytes allocated
==3930734== 
==3930734== LEAK SUMMARY:
==3930734==    definitely lost: 0 bytes in 0 blocks
==3930734==    indirectly lost: 0 bytes in 0 blocks
==3930734==      possibly lost: 0 bytes in 0 blocks
==3930734==    still reachable: 42,928,567 bytes in 9,961 blocks
==3930734==                       of which reachable via heuristic:
==3930734==                         newarray           : 4,264 bytes in 1 blocks
==3930734==         suppressed: 0 bytes in 0 blocks
==3930734== Rerun with --leak-check=full to see details of leaked memory
==3930734== 
==3930734== For lists of detected and suppressed errors, rerun with: -s
==3930734== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

As we do nothing here, we should have no memory leak. The “LEAK SUMMARY” actually shows just that.

However, if you launch R with valgrind again but run this:

Rcpp::cppFunction("void test2() {
    int* ptr;
    ptr = new int;
    *ptr = 123;
    std::cout << ptr << std::endl;
    std::cout << *ptr << std::endl;
}")
test2()
q()

The report will then be:

==3930888== 
==3930888== HEAP SUMMARY:
==3930888==     in use at exit: 57,217,224 bytes in 11,382 blocks
==3930888==   total heap usage: 32,951 allocs, 21,569 frees, 98,413,897 bytes allocated
==3930888== 
==3930888== LEAK SUMMARY:
==3930888==    definitely lost: 4 bytes in 1 blocks
==3930888==    indirectly lost: 0 bytes in 0 blocks
==3930888==      possibly lost: 0 bytes in 0 blocks
==3930888==    still reachable: 57,217,220 bytes in 11,381 blocks
==3930888==                       of which reachable via heuristic:
==3930888==                         newarray           : 4,264 bytes in 1 blocks
==3930888==         suppressed: 0 bytes in 0 blocks
==3930888== Rerun with --leak-check=full to see details of leaked memory
==3930888== 
==3930888== For lists of detected and suppressed errors, rerun with: -s
==3930888== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

4 bytes of memory are leaked because an int takes 4 bytes. So, test2() definitely leaks memory.

My strategy for R packages is to launch R with valgrind and then run devtools::test() to conduct all the unit tests and then look at the report ⁵. In order for this to work, your R package must have high test coverage in the first place. You should make sure at least “definitely lost” be 0 bytes. The best is triple zero.

When valgrind says your R package does leak memory, what should you do next? You should then rerun the same check, but this time with R -d "valgrind --leak-check=yes". It will point to you which lines are actually leaking memory. You now at least know what to fix. How to fix memory leak is beyond the scope of this Q&A, especially when the leaking is caused by the underlying C++ library. It is usually not as easy as adding delete. My advice is to read either the API or the source of the C++ library ⁶.

Q: My R package does not leak memory. But valgrind reports other errors. What should I do?

A: I like this summary. But it won’t help much. I can’t help much either, because there is no universal fix. During the development of adaR, it usually triggers by cstring.

Q: Can you explain the ampersand?

A: I wrote about it previously, actually. The ampersand can mean different things. If it’s in the RHS of an assignment operator, it means “get the address of a variable”. I just give an example below and that’s usually not where the ampersand is used.

void test5() {
    double pi = 3.1416;
    double* pi_ptr;
    pi_ptr = &pi;
    std::cout << pi_ptr << std::endl;
    // dereferencing
    std::cout << *pi_ptr << std::endl;
}

Usually, it is used to denote passing an object by reference to a function. But let’s talk about pass by copy first. This is an example of pass by value.

#include <vector>
#include <iostream>

unsigned int len(std::vector<int> input) {
    return input.size();
}

// [[Rcpp::export]]
void test6() {
    std::vector<int> v;
    v.resize(1000000);
    std::fill(v.begin(), v.end(), 1);
    std::cout << len(v) << "\n";
}

The C++ function len() here is pass by value (I think it’s easy to think about it as “pass by copy”). When test6() calls len(), it makes a copy of v from the current memory location and passes the new copy of v in another memory location for len() to do its calculation. On my machine: bench::mark(test6()) takes 2.4ms. Let’s make a small change to len():

#include <vector>
#include <iostream>

unsigned int len(std::vector<int>& input) {
    return input.size();
}

// [[Rcpp::export]]
void test7() {
    std::vector<int> v;
    v.resize(1000000);
    std::fill(v.begin(), v.end(), 1);
    std::cout << len(v) << "\n";
}

Now, by adding an ampersand to the input type of len(), the parameter is now pass by reference. When calling the function by len(v), it does not make a copy of v and passes it to len(). Instead, it asks len() to take the value from the current memory address of v (a reference of v). It prevents the data copy. If the data is large, it can make the program faster. On my machine: bench::mark(test7()) takes 425µs only, or 5.6x. You might also add const to make sure input won’t get modified. It won’t improve the performance in this case but usually a good practice.

That being said, however, passing SEXPs by reference gives you very little performance gain because there is no deep copy (as mentioned earlier, SEXPs are pointers). The only gain is less protection and thus a slight speed gain. See this answer by Romain Francois.

Q: Why do you make those *.h (or *.hpp) files?

A: Usually, I put the so-called function declarations there; as well as maybe something I need to repeat multiple times in many C++ files.

Imagine a situation you have two different .cpp files. eat.cpp has the function (and its implementation) eat().

void eat(std::string food) {
	std::cout << "I ate" << food << "\n"
}

sleep.cpp has the function (and its implementation) sleep().

void sleep(int n_hours) {
    std::cout << "I slept for " << n_hours << "hours." << "\n"
}

Now, I want to implement a function called live() in the file live.cpp. And I need to use eat() and sleep(). You implement live() in the file live.cpp like this:

void live(int n_hours, std::string food) {
    sleep(n_hours);
    eat(food);
}

When you try to compile all three files, it won’t work. Specifically, the compilation of live.cpp won’t work. Why? It’s because each .cpp file (or Single Compilation Unit, SCU) is compiled independently. And the compiler has no idea what sleep() and eat() are when it is compiling live.cpp.

However, you can put the declarations of sleep() and eat() in live.cpp and keep the implementations in other files, like this:

void eat(std::string food);
void sleep(int n_hours);
void live(int n_hours, std::string food) {
    sleep(n_hours);
    eat(food);
}

And now all three files can compile. Another way to do that is to create a (user-defined) header file with the declarations. Let say: myheader.h.

void eat(std::string food);
void sleep(int n_hours);

In live.cpp:

#include "myheader.h"

void live(int n_hours, std::string food) {
    sleep(n_hours);
    eat(food);
}

#include is a preprocessor directive. And in this case, it just copy and paste the content of myheader.h here. In short, I usually create these declaration-only header files and include them so I can use the functions defined in another file. This is the grossly simplified answer, but it should cover 80% of the situations. One can use header files for other things, of course.

Q: Hey, Dr C++. Why are there still that many raw loops in your C++ code?

A: Yes, I surely know that I should use algorithms. I will fix those later (if ever).

In general, one should avoid using raw loops. It’s like the lapply() (or purr or whatnot) vs for loop kind-of discussion in the C++ community.

double test8(NumericVector& x) {
    unsigned int n = x.size();
    double sum = 0.0;
    for (unsigned int i = 0; i < n; i++) {
        sum += x[i];
    }
    return sum / n;
}

double test9(NumericVector& x) {
    double sum = std::accumulate(x.begin(), x.end(), 0.0);
    return sum / x.size();
}

Q: Hey, Dr C++. Why are there still that many if-statements in your C++ code?

A: Not everyone likes ? the ternary conditional operator. I’m kind of okay with it. I use it sometimes in readODS. ? has been widely used in many C++ codebases, including ada-url. But it can make code unreadble sometimes.

// Not the best way to write fizzbuzz
std::string test10(unsigned int& x) {
    std::string output = std::to_string(x);
    output = x % 3 == 0 ? "fizz" : output;
    output = x % 5 == 0 ? "buzz" : output;
    output = x % 3 == 0 && x % 5 == 0 ? "fizzbuzz" : output;
    return output;
}

// also not the best way to write fizzbuzz
// but the same logic as the above
std::string test11(unsigned int& x) {
    std::string output = std::to_string(x);
    if (x % 3 == 0) {
        output = "fizz";
    } else {
        output = output;
    }
    if (x % 5 == 0) {
        output = "buzz";
    } else {
        output = output;
    }
    if (x % 3 == 0 && x % 5) {
        output = "fizzbuzz";
    } else {
        output = output;
    }
    return output;
}

Unless you have your own Makevars; or you override the default debug = TRUE argument of pkgbuild::compile_dll(). ↩
My first idea was to build template. But that’s a bit too complex to use it in Rcpp. This one is simpler. ↩
In gdb, you can enter “tui enable” to enable the text user interface. But I don’t usually use that. ↩
By default at least. Actually you can make C++ a GC language, but that’s not the topic of today. Also, it is always a good idea to study the memory model of any low level language. For example, Java and Go use GC. Rust uses an ownership model. ↩
Actually, if your C++ program produces segfaults, valgrind will also print out which part actually produces the segfaults. You can even use valgrind as a first line of check for any runtime error, rather than gdb. ↩
For example, during the development of adaR, I can only fix a severe memory leak by reading the source code of ada-url. I only know what the output of the C++ function ada_parse has a new, i.e. allocating something to heap. And the output must be explicitly free. Use the source, Luke. ↩