Werden wir Helden für einen Tag

Home | About | Archive

rtoot and functional programming

Posted on Nov 12, 2022 by Chung-hong Chan

I am now part of the development team of rtoot, yet another R package. The “cre” (maintainer) of the package is actually my team lead David Schoch at GESIS. He have written a blog post about rtoot. I don’t think I’ll need to repeat what he has said about the package.

I want to talk about rtoot from another angle. I want to use the code of rtoot to talk about some functional programming (FP) concepts. 1 But a big disclaimer first: Just as R isn’t as pure in functional programming as Haskell, the code of rtoot is neither purely functional. There are some side effects (e.g. saving of the access token) in between the code. Also, the code isn’t stateless: it depends on the state of the API and the state of what the token is. Of course, the token situation can be fixed to make it purely functional. But that’s for users’ convienence we don’t make it purely functional.

With this disclaimer which not everyone can understand (but probably you will after reading this post in entirety), let’s talk about FP.

Empowerment statement

In this post, I am going to explain some basic concepts of FP.

Functional programming

I don’t think I can give a complete introduction of FP. A Wikipedia-inspired elevator pitch is: “programs are constructed by applying and composing (pure) functions.”

Let’s start from the back: What is a pure function? A pure function is a function that always gives the same output when the input is the same. For example, the R function mean() is a pure function because it always gives the same output when the input is the same. For example, if you run: mean(1:100) for 100 times, the output will always be 50.5. However, sample() is not a pure function. Running sample(1:100, 1) for several times will give you different results. Similarly, functions such as read.csv is also not a pure function. The output of read.csv("mycsvfile.csv") depends on what you have in the file “mycsvfile.csv”; or even whether “mycsvfile.csv” exists or not.

Also, a pure function should always give a real output, not a side effect. For example, mean(1:100) gives you a real output of 50.5. However, saveRDS(iris, "iris.RDS") won’t give you any real output. Instead, it gives you a “side effect” of saving iris as a file. Another example is a function modifying an entry in a database. 2

I put pure inside a pair of parentheses because pure functions without any side effect are often not realistic. So, I would consider FP as the business of applying and composing functions. Punkt. There would be not much confusion about “applying functions”, the problem would be “composing functions”. To me, it is not the same as merely writing functions, as in the sense of “I compose a Shakespearean sonnet”. But in the “a whole is composed of parts” sense. “Composing functions” means composing a functions by combining functions to work together. In order to compose functions in this sense, a programming language supporting FP needs to treat functions differently. You might have heard about “functions as first-class citizens” or “first-class functions”. I again opened up a can of worms by introducing the concept of “First-class citizens”. And I need to explain it: A thing is a “first-class citizen” in a programming language needs to satisfy four criteria. In almost all programming languages, a singular number (or “numeric” in R) is a first-class citizen. Let’s use it as an example.

It can be the subject of assignment statements

one <- 1
one

It can be the actual parameter of functions

myfun <- function(x) {
    if (x == 1) {
        return("It's one.")
    } else {
        return("It isn't one.")
    }
}
myfun(1)

It can be returned as results of functions

myfun <- function(x) {
    if (x == "one") {
        return(1)
    } else {
        return(0)
    }
}
myfun("one")

It can be tested for equality.

one <- 1
one == 1
1 == 2
identical(one, 1)

So far so good. In order for functions to be first-class citizens, functions should also be satisfying the above four criteria. Okay, when I first learned about it from books, it blowed my mind (Maybe I was naive). It is because we usually have a mental model of “data being the things and functions operate on those things”. But the first-classness of functions abandons this mental model. Functions in functional programming can be the “things” to be operated on. In fact, some FP authors (such as Brian Harvey) use the notion of “Functions as data”. I wish that if I could understand this thing easier, I might fare better in my calculus by thinking about calculus is just a way of operating on functions. But unfortunately, my history has spoken.

In order for one to really think functionally, one really shouldn’t limit oneself in the mentioned mental model. The case in point, rtoot, actually uses the 3 of the above criteria to reduce the complexity in the code. So that we can ship it from the first commit to CRAN in 6 days.

A function as the subject of assignment statements

In rtoot, we utilize the ability for a function being the subject of assignment statements. We have lines like this in this file:

passFun <- readline
if (requireNamespace("rstudioapi", quietly = TRUE)) {
    if (rstudioapi::isAvailable()) {
        passFun <- rstudioapi::askForPassword
    }
}
auth_code <- passFun(prompt = "enter authorization code: ")

For this kind of operations, probably the usual way of doing this is to get the auth_code differently depending on whether the package rstudioapi is available and whether this code is run under RStudio. But instead, we take advantage of the fact that we can assign functions (e.g. readline and rstudioapi::askForPassword) to the variable passFun in assignment statements. Also, we also take advantage of the fact that both functions use the same prompt parameter (or we can say: the same “signature”). And of course, the output of both functions are the same, a character.

We can launch the assigned passFun like a regular function to get the auth_code.

A function as the actual parameter of functions

This is the part that allegedly scares many R learners away when they learn about higher-order functions, e.g. lapply. A higher-order function is a function accepting other functions as its parameters. lapply, for example, has the FUN parameter, which a function can be passed as an argument (I think this sentence also explains the difference between a parameter and an argument).

There are actually two types of higher-order functions. Let’s talk about the first type, which lapply also belongs to. This kind of higher-order function is similar to the first-order function for outputing some data, but the higher-order function works differently depending on the additional functions provided. Again, using lapply as the example, the behavior of lapply changes according to the FUN argument provided.

my_list <- list(c(1, 2, 3), c(5, 6, 7))
lapply(my_list, mean) ## 2 and 6
lapply(my_list, max) ## 3 and 7

In many languages, this function would be called map. Some higher-order functions are like fixture: map, reduce a.k.a. fold (thus there is a thing called MapReduce), filter, etc. They feature in many different programming languages.

We can also create this kind of higher-order function too. In rtoot, the helper function process_request is one such higher-order function.

process_request <- function(token = NULL,
                path,
                instance = NULL,
                params,
                anonymous = FALSE,
                parse = TRUE,
                FUN = identity
){
  output <- make_get_request(token = token,path = path,
                             instance = instance, params = params,
                             anonymous = anonymous)
  if (isTRUE(parse)) {
    header <- attr(output,"headers")


    output <- FUN(output)
    attr(output,"headers") <- header
  }
  return(output)
}

Similar to lapply, it has a FUN parameter too. We can supply different functions to process_request so that different types of data are parsed differently.

For example, in get_status, we launch process_request this way

process_request(token = token,path = path,instance = instance,params = params,
                anonymous = anonymous,parse = parse,FUN = parse_status)

where parse_status is a parser of the status from the Mastodon API.

In get_account, this:

process_request(token = token,path = path,instance = instance,
                params = params, anonymous = anonymous,
                parse = parse, FUN = parse_account)

This pattern reduces a lot of complexity and cuts a lot of repeated code.

A function be returned as results of functions

Here, another type of higher-order function comes in. This kind of higher-order function still accepts other function(s) as parameter(s), but it returns also a function. But the higher-order function somehow modifies the behaviour of the function(s).

A typical example is the Negate function. For example, running Negate like this returns a function.

Negate(is.null)

And this function does opposite of what is.null does.

is.null(NULL)
Negate(is.null)(NULL)

And actually, you can write yourself a Negate function like so:

mynegate <- function(f) {
    negated_f <- function(...) {
        !f(...)
    }
    return(negated_f)
}
is.null(NULL)
mynegate(is.null)(NULL)

We use the same technique to create a vectorized version of our parser.

v <- function(FUN) {
  v_FUN <- function(x) {
    dplyr::bind_rows(lapply(x, FUN))
  }
  return(v_FUN)
}

And then we can pass the vectorized version of the parser to process_request.

process_request(token = token,path = path,params = params,
                parse = parse,FUN = v(parse_status))

A function be tested for equality

In rtoot, we didnt’ use the equality comparison yet. But you can quick verify yourself:

## you can't use == here.
identical(mean, mean)
identical(mean, median)

Some associated FP concepts

λ

A λ (lambda) is an anonymous function. For our purposes, we didn’t use anonymous function in rtoot. But you can see a λ in constructs like this:

my_list <- c(1, 2, 3)
lapply(my_list, function(x) x^2) ## 1, 4, 9

In this case, function(x) x^2 is a λ. It doesn’t have a name. Of course, you can also use lapply with a named function too.

my_list <- c(1, 2, 3)
square <- function(x) {
    x^2
}
lapply(my_list, square) ## 1, 4, 9

But naming things is one of the two hard things in computer science. You can even do crazy thing with λ such as

## naming a function is difficult
(function(x) { x ^ 2 })(3)

The package purrr provides a syntactic sugar for writing λ even quicker.

require(purrr)
my_list <- c(1, 2, 3)
purrr::map_dbl(my_list, ~.^2) ## 1, 4, 9

Prefix and infix operators

In R, there are actually two kinds of operators. But all operators are just functions.

The most common ones are prefix operators. The operator comes first. mean is a prefix operator. Some languages, notably Lisp, have only prefix operators.

mean(c(1, 2, 3))

But there are also infix operators (also called “binary operators”). + and %in% are infix operators. When calling them, they doesn’t come first.

1 + 3
1 %in% c(1, 3, 5)

But actually, almost all infix operators are just some special prefix operators. 3 You can use these infix operators this way if you prefer

`+`(1, 3)
`%in%`(1, c(1, 3, 5))

It is also possible to write your own infix / binary operators. But that’s beyond the scope of this blog post.

Mutability, “copy-on-modify” behavior, and side effect

Many data structures in R feel mutable. The state of these data structures is mutable.

x <- c(1, 2, 3)
x[2] <- "two"
x

In this example, we mutate the 2nd element of the vector x. First, the state of the vector is changed because the 2nd element is changed. Also, the entire vector is forced to change to a character vector. Mutable state is usually a big source of headache and creates a lot of complexity. OK. In reality, most objects in R are actually immutable in the memory. However, it creates a new modified object in the memory with the same name everytime you use “<-“ to modify. The term is called “copy-on-modify” behavior.

You can test this with

adam <- c("a", "bad", "dog")
tracemem(adam)
billy <- adam
tracemem(billy) # same as adam
adam[2] <- "good"
tracemem(adam) # It doesn't share the same memory address with the original adam and billy
billy # is adam still here?
tracemem(billy) # billy is actually the original adam!

You can see that the memory addresses of the first adam and the second adam are different. It’s like, Adam is your dog and you can’t change him. But how about you have a new dog and name it also Adam. Then you hide the previous Adam in the cellar. This kind of behavior is inhumane but you would “feel” like you have modified your dog. Using Rcpp, however, you can really mutate a created data structure in the memory without the “copy-on-modify” behavior (Therefore, Rcpp can be quite a risky business). This kind of operations is also called a “side effect”. It not only gives you a returned value (the “main effect”), but also something else (modify the data in the memory).

Rcpp::cppFunction('CharacterVector cpp_mutate(CharacterVector x) {
      x[1] = "good";
  return x;
}')
adam <- c("a", "bad", "dog")
tracemem(adam)
billy <- adam
tracemem(billy)
adam <- cpp_mutate(adam)
adam ## Adam is good now!
tracemem(adam) # The same adam
tracemem(billy) # The same billy
billy ## wat? Billy is also good now?

In FP, we want to have data structures with an immutable state and avoid mutation and “side effect” as much as possible. In Python, for example, tuple is really immutable.

adam = ("a", "bad", "dog")
adam[1] = "good" ## error

I am ignorant, but there is no default way of preventing the assignment operatory “<-“ from doing the “copy-on-modify” behavior. So, it is up to your choice to mutate as rarely as possible.

Partial application

Many functions have more than one parameter. You can also say that those functions have an arity of at least 2. For example, if you run the following, you will have errors

`*`(10) ## but not: `*`(10, 2)
`^`(10) ## but not: `^`(10, 2)
`/`(10) ## but not: `/`(10, 2)

A multi-arity function can be thought of as a higher-order function of multiple single-arity function. i.e. f(x, y, z) = ((g(x))(y))(z).

Partial application refers to the operation of running a multi-arity function would produce another function with a smaller arity. As you can see from above, if you run the above, it gives you errors instead of functions with a smaller arity. So, R doesn’t do partial application by default.

One way of doing partial application is currying. The R package curry can do that:

curry::curry(`*`, 10) ## it actually returns a function with a smaller arity
curry::curry(`*`, 10)(2) ## will give you 20
times10 <- curry::curry(`*`, 10)
times10(3)

Piping

Well, I would say piping has now been a quite established concept in the realm of R already. But you might know that the original pipe, the magrittr pipe operator (%>%), is borrowed from the |> operator from the functional language F#. The “new” R pipe, |>, is actually not very like the F# |> operator. Piping can be thought also about as a way to compose (pure) functions.

Conclusion

In this blog post, I show you some FP concepts using the code of rtoot.


  1. I can see that I have a tendency to say after this with things like I have some prior experience about this by quoting this blog or pointing to a previous talk I gave. Well, I don’t want to do it this time in the main text, but with an significant historical footnote. I gave a talk on FP at the Hong Kong Opence Source Conference in 2015 (OMG! 8 years ago) using Racket. And really, this blog has several entries on FP, but mostly in Chinese. 

  2. The usual example in the realm for other programming languages would be the manuipulation of data in the memory by pointer arithmetic. I don’t want to go into this. This blog post is intended for R programmers. 

  3. You see, I say “almost all”, not “all”. It means there are exceptions. The notably one is the new pipe operator |>


Powered by Jekyll and profdr theme