Sunday, August 17, 2014

A Look at Random Seeds in R... Or: “85, why can’t you be more like 548?”

Have you ever wondered whether the set.seed() function in R has any quirkiness? This analysis was inspired by a Stack Overflow posting by Wolfgang and I incorporate some of his code.

For each seed (1-1000, for this analysis), I took the mean and standard deviation of the first 1,000 random numbers. Then I get the percent of the density function that intersects with the normal curve as well as a distance from the origin function (0,1 in this case).

With the resulting points, I find the most interesting ones based on min/max mean, min/max sd, max distance from shifted origin for points in each of the quadrants, overall max distance, and the point closest to the center.

Below is the summary of interesting points.

##      type seed     mu    sd  dist intersect
## 1  mu_min   85 -0.110 1.008 0.110     0.956
## 2  mu_max  501  0.104 1.002 0.104     0.959
## 3  sd_min  180 -0.005 0.921 0.079     0.960
## 4  sd_max  168  0.002 1.065 0.065     0.969
## 5      q1  501  0.104 1.002 0.104     0.959
## 6      q2   85 -0.110 1.008 0.110     0.956
## 7      q3  713 -0.075 0.935 0.100     0.957
## 8      q4  394  0.090 0.988 0.091     0.964
## 9     out   85 -0.110 1.008 0.110     0.956
## 10     in  548  0.000 1.000 0.000     1.000
## 11    sim  548  0.000 1.000 0.000     1.000
## 12   diff   85 -0.110 1.008 0.110     0.956


Below is a chart showing the overlap of the most similar point and a chart showing the overlap of the least similar point. Thanks again to Wolfgang for this code chunk.
Top:  Seed 548; Bottom:  Seed 85
As you can see, even the worst point has an overlap of 95.6%. Point 548 is almost perfect. However since there are values that could cause issues, it might be a good practice to pick different seeds over time. You could throw a dart or a manager could assign the seed as part of the requirements document. This practice might mitigate the risk of an analyst’s intentionally biased seed selection.

3 comments:

  1. I believe the only reason to allow an externally selected seed is to allow the user to verify results (or errors) with the identical input data set. I would never manually select a seed when running actual pseudo-random statistical analyses. Also, keep in mind that a relatively small sample such as you used in fact should not have perfect moments. The statistical probability works out that it won't.

    ReplyDelete
  2. Interesting analysis. Would you expect them to all have a mean of 0 and variance of 1?
    Would a better test be to check that the means and variances for each given seed correspond to their appropriate sampling distributions?

    ReplyDelete
  3. Both good points and food for thought. I don't see any issues for the honest analyst; however, there could be opportunity to "bias" an analysis in marginal cases if so disposed.

    ReplyDelete