chainsawriot

Current thinking about writing R (1/2): Clarity

Posted on Sep 17, 2023 by Chung-hong Chan

Increasingly often, more people are now working on my R programs. Partly, it is because of my current position. I am surrounded by some of the coolest coworkers ever and we can REALLY write code together as a team. Rather than I am being the only person working on my programming shit.

I also need to work on others’ codebase. Sure. And I usually try to imitate the Modus Operandi as reflected in the codebase. But for the projects I initiated, I program in my ever evolving Modus Operandi. The oldest (reusable) of my R code is on my Github. To be honest, it hurts my eyes even to look at it now. And I am glad that I can see the differences between my old shit and my new shit. Some good things from the past, however, are still there. Back then, I didn’t know how to use a unit test framework. But still, I tested my code by writing some tests. I am still obsessed with software testing.

I read a book called Rust for Rustaceans. The author said a good Rust programmer should have a good taste. Well, I don’t claim myself to be a good R programmer, so I can’t say I have a good taste. But writing R almost daily for more than a decade certainly makes me have a certain taste. Whether or not it’s good or bad taste is in the eyes of the beholders.

Sommeliers can tell a bottle of good wine from the bad ones. They don’t just have the palate, i.e. taste, but also have the communication skill to explain why a good wine is good with their vocabulary: aromas, notes, body, tannin, acidity, dryness, finish, fruitiness, smokiness… If you think your programming style is good, you should also be able to communicate why your style is good.

I highlighted the word “good” above, because nothing is good for all purposes. A really good programmer in general knows about trade offs. One can only optimize for at most two things from these three: clarity, maintainability, and speed of development. From the three, I discard speed of development (e.g. typing less) and hopefully optimize for both clarity and maintainability. If I can’t even optimize for two, I discard maintainability and opt for clarity. But I think that code that is clear should also be sufficiently maintainable.

I don’t want to frame this blog post as if I am an expert like “As a communication researcher, …” or something like that. I think programming is a form of communication. But I am hopeless as a communicator. Those who really know me know that I am socially awkward and I don’t like to talk (to other people). Programming to me is like talking to myself, a thing I am constantly doing. Even right now when I am in front of my emacs session typing this.

Let’s take an example to talk about how I talk to myself. This is Exhibit A.

.read_ods <- function(path, range, sheet = 1) {
    ## many code before this
    ## ...
    sheets <- list_ods_sheets(path)
    sheet_name <- cellranger::as.cell_limits(range)[["sheet"]]
    if(!is.null(range) && !is.na(sheet_name)) {
        if(sheet != 1) {
            warning("Sheet suggested in range and using sheet argument. Defaulting to range",
                call. = FALSE)
        }
        is_in_sheet_names <- stringi::stri_cmp(sheet_name, sheets) == 0
        if(any(is_in_sheet_names)) {
            sheet <- which(is_in_sheet_names)
        } else {
            stop(paste0("No sheet found with name '", sheet_name, "'", sep = ""),
                call. = FALSE)
        }
    } else {
        is_in_sheet_names <- stringi::stri_cmp(sheet, sheets) == 0
        if (!is.numeric(sheet) && any(is_in_sheet_names)) {
            sheet <- which(is_in_sheet_names)
        } else if (!is.numeric(sheet)) {
            stop(paste0("No sheet found with name '", sheet, "'", sep = ""), 
                call. = FALSE)
        }
        if (sheet > length(sheets)) {
            stop(paste0("File contains only ", length(sheets), " sheets. Sheet index out of range.",
                call. = FALSE))
        }
    }
    ## many code after this
    ## ...
}

And Exhibit B.

.read_ods <- function(path, range, sheet = 1) {
    ## many code before this
    ## ...
    sheet_names <- list_ods_sheets(path)
    sheet_index <- .standardise_sheet(sheet, sheet_names, range)
    ## many code after this
    ## ...
}

## standardise `sheet` parameter as a number, i.e. sheet_index
.standardise_sheet <- function(sheet, sheet_names, range) {
    sheet_from_range <- cellranger::as.cell_limits(range)[["sheet"]]
    if (!is.null(range) && !is.na(sheet_from_range)) {
        if (sheet != 1) {
            warning("Sheet suggested in range and using sheet argument. Defaulting to range",
                    call. = FALSE)
        }
        sheet <- sheet_from_range ## override; should be a sheet_name
    }
    is_in_sheet_names <- stringi::stri_cmp(e1 = sheet, e2 = sheet_names) == 0
    if (!is.numeric(sheet) && !any(is_in_sheet_names)) {
        stop(paste0("No sheet found with name '", sheet, "'"),
             call. = FALSE)
    }
    if (is.numeric(sheet) && sheet > length(sheet_names)) {
        stop("File contains only ", length(sheet_names), " sheets. Sheet index out of range.",
                    call. = FALSE)
    }
    if (!is.numeric(sheet)) {
        sheet_index <- which(is_in_sheet_names)
    } else {
        sheet_index <- sheet
    }
    return(sheet_index)
}

You might not know exactly what this function does without much context. It is (a slightly modified) part of the code that readODS uses to check sheet and range. sheet is a user provided argument that can be an integer such as 1 indicating the first sheet, or a string such as “Sheet1”. range is like the range you would use for Excel, e.g. “A1:B2”. But you can also specify it with a sheet name such as “Sheet2!A2:B7”. Under the hood, the reading function only needs the integer. Therefore, we need a mechanism to check sheet and range and then come up with an integer.

You should notice several things:

1. Rumpelstilzchen

The helper function .standardise_sheet in Exhibit B is created to give a name to a bunch of code with no name. And this name explains what this bunch of code does: It standardizes (well, we use British English in readODS) the sheet argument (which can be integer or string) as an numeric index. I like to call this Rumpelstilzchen; or as Professor Patrick Henry Winston (1943 - 2019): “Naming something gives you control over it.” As CS Brogrammers would tell you “As a former Google, former Facebook programming billionaire, there are only two hard things in Computer Science: cache invalidation and naming things.” But if you follow some principles (e.g. always naming a function as a verb and don’t be afraid to use long function name), the name communicates the intend of your code.

2. “2 deep” max and fail early

There are big if and else blocks in Exhibit A. If the first if gives FALSE, you will need to scroll a bit to read what will be executed in that else block. Inside both the if and else blocks, there are still more if-else blocks. There is some code with three levels of indentation. Someone says this is “3 deep”. “3 deep” usually causes me mental overload (don’t show me “4 deep” code, thank you). Also, exceptions are thrown usually at the bottom. So when reading this sequentially (admit that, we read code sequentially. Unlike computer, which reads code really conditionally), one will need to switch contexts between exceptions handling and real computational. A good way to identify patterns like this is to look for duplicated code in both if and else. is_in_sheet_names <- stringi::stri_cmp(sheet, sheets) == 0 is repeated.

In Exhibit B, there is almost no else (except in one place; I will tell you why I keep that). Exceptions are thrown as early as possible so that they are like checking the correctness of the input sheet and range. You can do this usually by flipping the logic to fail early.

A caveat is that there is still one if after another if and the warning line is in “3 deep”, which I don’t like. In other R packages, e.g. oolong, I have a helper function called .cwarn() or .conditional_warnings. It’s because I know conditional warning is a very common reason for introducing “3 deep” code.

.conditional_warnings(condition, ...) {
    if (isFALSE(condition)) {
        return(invisible(NULL))
    }
    warnings(..., call. = FALSE)
}

.standardise_sheet <- function(sheet, sheet_names, range) {
    sheet_from_range <- cellranger::as.cell_limits(range)[["sheet"]]
    if (!is.null(range) && !is.na(sheet_from_range)) {
        .conditional_warnings(sheet != 1, "Sheet suggested in range and using sheet argument. Defaulting to range")
        sheet <- sheet_from_range ## override; should be a sheet_name
    }
    ## ...
}

But not everyone likes this. Another option is like this, which is slight convoluted:

.standardise_sheet <- function(sheet, sheet_names, range) {
    sheet_from_range <- cellranger::as.cell_limits(range)[["sheet"]]
    is_sheet_name_in_range <- !is.null(range) && !is.na(sheet_from_range)
    if (is_sheet_name_in_range && sheet != 1) {
        warning("Sheet suggested in range and using sheet argument. Defaulting to range",
                call. = FALSE)
    }
    if (is_sheet_name_in_range) {
        sheet <- sheet_from_range ## override; should be a sheet_name
    }
    ##
}

When the code is getting longer and bigger without improving clarity, it is a signal of overoptimization and should stop. And therefore, I keep that “3 deep” warning.

A general tip:

If you write something like if + stop + else.

if (it_doesnt_work) {
    stop("oh, it doesn't work")
} else {
    ## do something
}

You can remove the else block surely without sacrificing anything.

if (it_doesnt_work) {
    stop("oh, it doesn't work")
}
## do something

Again, it is possible to eliminate this with either stopifnot() (if the error message is not important), a homegrown function like .conditional_stop() or existing solutions such as checkmate.

3. Rumpelstilzchen, again

In Exhibit A, the variable sheet is overwritten during the entire course of the execution. In Exhibit B, it is clear what sheet you are talking about: the ambiguous sheet, sheet_names (vs sheets), or sheet_index. And it’s clear that the final output is sheet_index.

Actually, the last part of the code can be written as such:

.standardise_sheet <- function(sheet, sheet_names, range) {
    ## ...
    if (!is.numeric(sheet)) {
        return(which(is_in_sheet_names))
    }
    return(sheet)
}

It is faster to write code this way and it eliminates the else block, which I usually like (see point 2). But as I said previously, I optimize for clarity and in my opinion it somehow hides the intent. Therefore, I invoke Rumpelstilzchen by conditionally creating a new variable called sheet_index and ultimately returns it. Again, not an ambiguous sheet. And actually, the code is still “2 deep”.

Next time I will talk about smells.