Werden wir Helden für einen Tag

Home | About | Archive

2021 Wrapped: R edition

Posted on Dec 2, 2021 by Chung-hong Chan

Spotify releases to the users every year “a Wrapped” on Dec 1st: it is a descriptive summary of listening habit. My Spotify Wrappeds are quite embarassing because they reveal that I listen to music more when I am down.

And then I came across this tweet by Emily Riederer.

Well, it sounds fun. I think it would be possible to generate a Wrapped for my R code because I have all my projects in my ~/dev directory. My first idea was to just check the require and library lines 1. And then I know that it is a stupid idea because Javascript also uses require. So I should focus on R files (*.R and *.Rmd).

My initial try was pure R. The idea is to use the fs package to list out all R files with a modification date after 2021-01-01, select all require and library lines, and then count. Then I knew that it is super slow using the readLines function to read all R files and then extract the lines I need. Even with multicore computation of furrr, it is still slow.

My second try was to not to use R entirely. Instead, I used the wonderful ripgrep to search for require and library lines. I usually use ripgrep from within emacs via the rg mode. So I needed to learn how to use it from the command line.

rg '^library\(|^require\(' -g '*.{R,rmd,r,Rmd}'

So, it searched for all the require and library lines in all of my R files. Then I redirected the result to a text file.

rg '^library\(|^require\(' -g '*.{R,rmd,r,Rmd}' > rgresults.txt

The file is incredibly small and is machine readable. I then analyzed the data from there. Here are my most used R packages in 2021.

I notice one problem here: the large usage of the datasets package. As it turns out, those uses are from the test cases of rio and readODS. So, after filtering out all test cases. I think the following figure is more accurate.

I think it reflects my laziness to load tidyverse instead of individual packages from tidyverse. Actually, it reflects who I am: doing a lot of text analyses with quanteda (and its friends such as quanteda.dictionaries, spacyr, LSX, quanteda.textstats etc.) and Bayesian modeling with brms. You might think that I do twitter analysis with academictwitteR. But the fact is that I don’t do (actually …. like…) twitter research. I just happen to contribute some code. Also, I need to use the package to answer the questions in the dicussion forum.

I do dogfooding too: I use my own packages - oolong, sweater, textplex. rio is kind of that too. The very reason to package these oft-repeat code into packages is for using it in research, right?

It also reveals some packages I really like: here, papaja, text2vec, parameters, etc. These are massive time savers.

  1. From my rstyle project, I know that the most accurate way to do that is to parse the code using base::parse into expressions and then actually count from there. Also, code that use the namespace operator, e.g. rio::import, is not counted as a use. But I am too lazy and I don’t need the most accurate result. 

Powered by Jekyll and profdr theme