Werden wir Helden für einen Tag

Home | About | Archive

Vectorization of lookup tasks in R

Posted on May 9, 2012 by Chung-hong Chan

A task that come up regularly during data munging is like this: you have a list of user ID (uid). Then, you have another table that has the user ID(lookup.uid) and the value of interest(lookup.value). You would like to fetch the list of value (lookup.value) with the uid = lookup.uid and retaining the order of uid.
I don't know the terminology for this task. I usually call it "lookup".
Programmatically, lookup can be done with a simple for-loop to loop through the uid and match the uid with the lookup.uid one by one. In R, the following is the usual pattern for this for-loop.

[code]
#< -
forlooplookup <- function(uid, lookup) {
uid.value <- rep(NA, length(uid))
for (i in 1:length(uid)) {
uid.value[uid == lookup[i,1]] <- lookup[i,2]
}
return(uid.value)
}
[/code]

However, for loop is slow. The whole point of R programming is vectorization ((I would say the whole point of R programming is functional programming, but I guess OO people may not agree.)) . The above for-loop can be rewritten using the vectorized match() function.

[code]
vectorlookup <- function(uid, lookup) {
return(lookup[match(uid, lookup[,1]),2])
}
[/code]

Let's benchmark the for-loop version and vectorized version of lookup.

[code]
lookup <- data.frame(ID=1:30000, value=rnorm(30000))
uid <- sample(1:30000)
system.time(forlooplookup(uid, lookup))
system.time(vectorlookup(uid, lookup))
[/code]

For-loop version took 29.4s to complete but only 0.007s for vectorized version. The performance was increased by 4200 folds.
We can also assert two functions are doing the same thing to check for correctness.

[code]
sum(forlooplookup(uid, lookup) - vectorlookup(uid, lookup)) == 0
[/code]


Powered by Jekyll and profdr theme