Monday, December 29, 2014

First Day of the Month, Using R

Future-proofing is an important concept when designing automated reports. One thing that can get out of hand over time is when you accumulate so many periods of data that your charts start to look overcrowded. You can solve for this by limiting the number of periods to, say, 13 (I like 13 for monthly data, because you get a full year of data, plus you can compare the month-over-month of the most recent data).

You could approach this by limiting your data to anything in the last 390 days (30 days x 13 months), but your starting period will likely be cut-off. You can fix this by finding the first day of the month for each record, then going back to get a full 13 months of data.

Here's a quick one-liner to get the first day of the month for a given date:  subtract the day of the month from the full date, then add 1.

# get some dates for the toy example:
df1 <- data.frame(YourDate=as.Date("2012-01-01")+seq(from=1,to=900,by=11))
df1$DayOne <- df1$YourDate - as.POSIXlt(df1$YourDate)$mday + 1

Wednesday, December 24, 2014

Democracy, War, & Statistical Modeling

Democracy can be thought of as a technique for summarizing the prevailing sentiment of a population. Democracy may be admirable, but why is its usage so prevalent in modern governments? Why is the prevailing sentiment so important?

Perhaps prevailing sentiment is not directly important, but it closely tracks something else that does. What could be happening under the sheen of participatory democracy is an uglier calculation of who might win a war. This can explain why there seems to be more stability in democratic countries. Rebels are not magically restrained from attempting overthrow, but their failure becomes more predictable and, thus, less likely to be tried in the first place.

This theory hinges on how well the election results match the prevailing sentiment. Voters would need candidates that are not too far off from the population's currents of sentiment. Two problems:

  1. It would be impossible to have perfect, universal alignment with voters and candidates,
  2. It can also be difficult to quantify the distance between a voter's sentiment and that of each candidates'.

To describe how each of these complications is addressed in practice, the discipline of statistics can be helpful:

Candidates can use the concept of k-means clustering to determine optimal positioning; less cynically, this could happen evolutionarily as the best-positioned candidate survives. A k of 2 typically seems to occur because going to a k of 3 most likely means that the third candidate crowds into the larger cluster's area, leading to the advantage of the second-most favorite candidate. These two candidates would tend to stake out positions that lead to a roughly equal split of the population.

Understanding the sentiment space is imperfect because sentiment changes and it is difficult to quantify into specific coordinates. An analogue to the voting population's distance measurement technique is the random forest learning method. Each voter can be thought of one of the decision trees, selecting a different subset of factors to consider. Each voter's output is aggregated and the mode is selected as the system output, or candidate in this case.
Random forests are known for being highly predictive, but not transparent. A professional pollster would probably agree wholeheartedly.