Wednesday, February 11, 2015

R's Tricky == Operator, or "It depends on what the meaning of the word 'is' is"

One scenario where R can trip up a programmer is when using the == operator or its relatives. The help page notes that "NA values are regarded as non-comparable", which introduces some potentially unexpected behavior.

As a toy example, look what happens when trying to subset on a column that includes NA values.
df <- data.frame(a=11:15,b=c(3,NA,4,4,NA))
df
df[df$b==4,]
df[df$b<=4,]
In each case, rows with an NA in the b column are returned. This might be surprising and not obvious if wrapped inside of a an aggregation such as nrow or sum. A safer way to accomplish this subsetting is by using the %in% operator. Like so:
df[df$b %in% 4,]

SQL Interjections

When I have to re-learn how to use PARTITION BY, the interjections get more colorful