Wednesday, February 11, 2015

R's Tricky == Operator, or "It depends on what the meaning of the word 'is' is"

One scenario where R can trip up a programmer is when using the == operator or its relatives. The help page notes that "NA values are regarded as non-comparable", which introduces some potentially unexpected behavior.

As a toy example, look what happens when trying to subset on a column that includes NA values.
df <- data.frame(a=11:15,b=c(3,NA,4,4,NA))
df
df[df$b==4,]
df[df$b<=4,]
In each case, rows with an NA in the b column are returned. This might be surprising and not obvious if wrapped inside of a an aggregation such as nrow or sum. A safer way to accomplish this subsetting is by using the %in% operator. Like so:
df[df$b %in% 4,]

6 comments:

  1. I don't know what b is, but I want you to tell me whether it's equal to four.

    ReplyDelete
  2. df[which(df$b == 4),] is probably faster.

    ReplyDelete
  3. With data.table you don't have this problem.

    ReplyDelete
  4. subset(df, df$b == 4)
    subset(df, df$b <= 4)

    Always safer!!

    ReplyDelete
  5. Careful with subset!
    http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset

    ReplyDelete
    Replies
    1. A good point.

      When using subset, ALWAYS include the data frame name by using df$b or df[['b']].

      Delete