Let me chronicle a random discovery of a bug and in my opinion, its impact; and the fixes (or lack thereof).
The story started from here: my R package oolong
allows the user to do coding (or annotating). After coding, oolong
calculates the inter-rater reliability of the coding. There are several ways to quantify reliability. Krippendorff’s Alpha is the standard in my field of communication science.
For a long time, my R package oolong
used the long-existed R package irr
(as of writing, version 0.84.1) to calculate Krippendorff’s Alpha. Suddenly in one day, I thought it would be a nice idea to write a C++ implementation of Krippendorff’s Alpha, because why not? This thinking was purely random. Before I would reinvent the wheel, I discovered that there is actually one relatively new C++ implementation out there, icr
. Then of course, the natural decision is to give icr
a try. Initial benchmark suggested that icr
is much much much faster than irr
, as expected. But then when I used icr
in oolong
, there was one unit test that does not pass. The calculation of Krippendorff’s Alpha by icr
offs by a lot (0.072), compared to irr
(0.056).
It must mean one of them is wrong. My first suspicion was that icr
might be wrong. And then I used an implementation online (which has been peer reviewed) and it matches the calculation of icr
. Now, the evidence pointed to irr
was incorrect. I could not say for sure, so I dug up the literature and calculated it by hand.
Recently, I have been reading a lot of the various papers and books by Professor Klaus Krippendorff (1932-2022) on his thinking about content analysis. So, I know where to look at to learn how to calculate his namesaked reliability value.
The test case is a matrix like this:
U1 | U2 | U3 | U4 | U5 | U6 | U7 | U8 | U9 | U10 | |
---|---|---|---|---|---|---|---|---|---|---|
R1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
R2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
R3 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 |
So, three raters, 10 units, two possible nominal values (1 or 2), no missing value. The first step is to convert this into what Krippendorff called “Coincidence Matrix” (or \(o\)). It should be like this:
1 | 2 | |
---|---|---|
1 | 1 | 4 |
2 | 4 | 21 |
An extremely simple explanation to this is: there are 1 pair of \([1,1]\) (in unit 1), 4 pairs of \([1,2]\) or \([2,1]\) (in unit 1, 2, 3, 4) and 21 pairs of \([2,2]\). In total, there are 30 pairable values (10 units \(\times\) 3 raters). \(o\) is symmetric. A nice thing about the calculation of Krippendorff’s Alpha is that it only uses one of the two off-diagonal triangles. And in this simple case, it just means one value: 4 pairs of \([1,2]\) or \([2,1]\).
Another set of values we need is \(n_v\). But that is just the row or column sum, i.e. \([5, 25]\). For this case, Krippendorff’s Alpha is just this:
\[\alpha = 1 - \frac{4}{\frac{1}{(5 + 25) - 1}(5 \times 25)} = 0.072\]So, now my hand calculation also pointed to the fact that irr
was incorrect. And then I looked at the code and discovered the bug. The calculation of \(o\) by irr
is incorrect.
mat <- structure(c(2, 1, 1,
2, 2, 1,
2, 2, 1,
2, 2, 1,
2, 2, 2,
2, 2, 2,
2, 2, 2,
2, 2, 2,
2, 2, 2,
2, 2, 2),
dim = c(3L, 10L),
dimnames = list(c("R1", "R2", "R3"), NULL))
mat
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> R1 2 2 2 2 2 2 2 2 2 2
#> R2 1 2 2 2 2 2 2 2 2 2
#> R3 1 1 1 1 2 2 2 2 2 2
##The feature of this matrix is that it does not have any NA in it. I think the calculation is incorrect.
res <- irr::kripp.alpha(mat, method = "nominal")
res
#> Krippendorff's alpha
#>
#> Subjects = 10
#> Raters = 3
#> alpha = 0.056
## specifically, the coincidence matrix is incorrect, i.e. doubled
res$cm
#> [,1] [,2]
#> [1,] 2 8
#> [2,] 8 42
Strangely, and as expected by reading the code, irr
gives the correct answer when there are some missing values.
res2 <- irr::kripp.alpha(cbind(mat, matrix(c(NA, NA, NA), ncol = 1)), method = "nominal")
res2
#> Krippendorff's alpha
#>
#> Subjects = 11
#> Raters = 3
#> alpha = 0.072
res2$cm
#> [,1] [,2]
#> [1,] 1 4
#> [2,] 4 21
irr
is not on any shared coding platform, such as GitHub or whatnot. The last CRAN release was in 2019. And the evidence from the GitHub mirror of CRAN suggests that the update was actually done by a member of the CRAN Team. The last update by the maintainer was actually version 0.84 in 2012 (Obama vs Romney, do you remember that?). So, the code was now 13 years old.
I e-mailed the maintainer immediately after this random discovery of the bug. My first e-mail was bounced, because the address was not found. And it actually made me wonder, wasn’t CRAN regularly checking whether the maintainer’s e-mail is up to date? I googled and found the up-to-date e-mail address of the maintainer. As of writing, there is radio silence.
Both irr
and icr
actually do not have any unit test. It would be nice to add some. I can find some expected values in the Hayes & Krippendorff paper (2007), which is a “software paper” of the time, announcing the SPSS and SAS implementations of Krippendorff’s Alpha. I also contacted the fellow communication researcher Professor Deen Freelon, because he has also published a paper on his web implementation of Krippendorff’s Alpha. In the paper, there are also some expected values. Professor Freelon is also very generous to provide me all the data.
I contacted both the maintainers of irr
and icr
about submitting an improvement with unit tests. Again, radio silence. Update (2024-10-27): The maintainer of icr
gave me a response about adding tests. I will be working with them.
I tested with Professor Freelon’s data. I think the good news is: In general irr
systematically underestimates Krippendorff’s alpha. In most real life data, the discrepancy is in the second or third digit. For many people (e.g. content analysts), the miscalculation should not bring any bad news.
I made the switch for oolong
to use icr
. But I still want to improve the accuracy of irr
and icr
. I tried (see above). And I don’t want to fork anything.
I think this discovery of exactly the bug in irr
is not the first time. I think at least one other group should have identified the bug. This paper reports a modified implementation of Krippendoff’s Alpha, which was said to be based on irr
(I can also confirm by reading the code, it is almost like 1-to-1 copy in some cases), was actually bug-free. The reason why the fix was not added upstream was unknown. Or perhaps the response were also radio silence.
If I did not make that random discovery, I would still trust irr
and use it. This experience actually makes me wonder: Should I still trust any software without tests?