Sunday, March 11, 2018


MS SQL (and various other flavors) allows the user quite wide latitude in defining the result set. The risk with this is unintended results based on an incomplete understanding of the underlying data. For example, a straightforward LEFT JOIN with 1,000,000 records in the left table might return 1,000,001 records due to an unexpected partial duplicate. Worse, the LEFT JOIN might return 1,000,000 records today and 1,000,001 records a few days later after the user is satisfied (wrongly) that the query is correct. It seems that queries could be written more explicitly, which could lead to easier built-in operations to facilitate the direct query writing. For example, one model of how an explicit query could be written is below using a fictitious repurposing of the DISTINCT clause and a new LEFT AGGREGATE JOIN join type. Of course these actions can be accomplished with the current functionality, but require much more convoluted language that can lead to unintended consequences if not fully understood.

[Fields in a table are named as letters]

SELECT (DISTINCT 1.A, 1.B, 1.C)  [this is the grain; no duplicate combinations]
   2.D BY MAX, [if multiple records returned, take the maximum value of 2.D]
   2.E BY COLLAPSE(,) [if multiple records returned, collapse by concatenating & separating with ","]
   2.F BY COUNT [if multiple records returned, take the count of records]
   2.G BY RANDOM [take one random record]
   2.H BY ERROR_IF_>1 [if multiple records returned, throw an error]
FROM TABLE customer 1
   LEFT AGGREGATE JOIN address 2 ON 1.addressid = 2.addressid

I claim no special expertise with RDBMS languages, so I welcome any feedback on whether this feature exists in other database systems.

Thursday, August 31, 2017

Why are Racing Drivers Born on March 23

Wikipedia has pages for each day of the year (e.g., January 1, April 25). Each page contains a list of names of famous people with that birthday along with a short description of each person. I wrote an R script to scrape these lists from each wikipedia page and then analyze the data to find which days have unusually high occurrences of certain words (based on the tf-idf statistic).

Three days stood out:

  • May 2: columnists
  • March 16: bicycle racers
  • March 23: racing drivers
So, why are Racing Drivers so often born on March 23?

R script:

2:  # Prepare -----------------------------------------------------------------  
3:  rm(list = ls())  
4:  gc()  
5:  pkg <- c("tidyverse", "rvest", "tidytext", "ggplot2", "beepr")  
6:  inst <- pkg %in% installed.packages()  
7:  if(length(pkg[!inst]) > 0) install.packages(pkg[!inst])  
8:  lapply(pkg, library, character.only = TRUE)  
9:  rm(list = c("inst", "pkg"))  
10:  setwd("/Users/danieldunn/Dropbox/DD Cloud/R/birthday_types")  
11:  set.seed(4444)  
14:  # Download wiki birthday data ---------------------------------------------  
15:  births <- seq(as.POSIXct("2004-01-01"), as.POSIXct("2004-12-31"), by = "days") %>%  
16:   format(format = "%B_%e") %>%  
17:   gsub(pattern = "_ ", replacement = "_") %>%  
18:   as.list()  
19:  #births <- births[1:5] # remove when ready for whole year  
20:  urls <- paste0("", births)  
21:  ndays <- length(urls)  
23:  for(i in 1:ndays) {  
24:   cat("Trying", i, "of", length(births), "urls.\n")  
25:   section <- ifelse(urls[i] == "",   
26:            2,   
27:            read_html(urls[i]) %>%  
28:             html_nodes(xpath = '//*[@id="toc"]/ul') %>%  
29:             html_text() %>%  
30:             gsub(pattern = " Births.*$", replacement = "") %>%  
31:             gsub(pattern = "^.*\n", replacement = ""))  
32:   births[[i]][3] <- read_html(urls[i]) %>%  
33:    html_nodes(xpath = paste0('//*[@id="mw-content-text"]/div/ul[', section, ']')) %>%  
34:    html_text()  
35:   births[[i]][2] <- gregexpr(pattern = "\\n", text = births[[i]][3]) %>%  
36:    regmatches(x = births[[i]][3]) %>%  
37:    lengths()  
38:   Sys.sleep(time = 0.1)  
39:  }  
42:  # Find outlier word frequencies -------------------------------------------  
43:  data(stop_words)  
44:  termset <- data.frame(word = as.character(), n = as.integer(),   
45:             day = as.character(), tf = as.numeric(),   
46:             stringsAsFactors = FALSE)  
47:  for(i in 1:ndays) {  
48:   terms <- data_frame(text = births[[i]][3]) %>%  
49:    unnest_tokens(word, text, to_lower = FALSE) %>%  
50:    anti_join(y = stop_words, by = "word") %>%  
51:    count(word, sort = TRUE) %>%  
52:    filter(!grepl(pattern = "[[:digit:]].*$", x = word)) %>%  
53:    filter(!grepl(pattern = "[[:upper:]].*$", x = word)) %>%  
54:    mutate(day = births[[i]][1], tf = n / sum(n))  
55:   termset <- rbind(termset, terms)  
56:  }  
58:  full <- termset %>%  
59:   count(word, sort = TRUE) %>%  
60:   mutate(idf = log(x = ndays / nn), base = 10)  
62:  termset <- termset %>%  
63:   left_join(y = full, by = "word") %>%  
64:   mutate(tfidf = round(tf * idf, 3)) %>%  
65:   arrange(desc(tfidf)) %>%  
66:   mutate(dayN = as.Date(paste0(gsub(pattern = "_", replacement = " ", x = day), ", 2004"), format = "%B %e, %Y"))  
69:  # Roll up to word level ---------------------------------------------------  
70:  wordset <- termset %>%  
71:   group_by(word) %>%  
72:   summarize(frequency = sum(n)) %>%  
73:   ungroup() %>%  
74:   arrange(desc(frequency))  
77:  # Display extremes --------------------------------------------------------  
78:  topT <- termset %>%  
79:   group_by(day) %>%  
80:   top_n(n = 1, wt = tfidf) %>%   
81:   ungroup() %>%  
82:   filter(n >= 2) %>%  
83:   arrange(dayN) %>%  
84:   print  
86:  topW <- wordset %>%  
87:   top_n(n = 10, wt = frequency) %>%   
88:   filter(frequency >= 1) %>%  
89:   arrange(desc(frequency)) %>%  
90:   print  
93:  # Finish ------------------------------------------------------------------  
94:  beep()  

Saturday, March 4, 2017

Monopoly War

Can you imagine a future whereby nations agree to an alternative to war? Disputes may never go away, but the resolution mechanisms might change. The mechanism's effect would have to be quite similar to the forgone war's would-be effect for both nations to agree.

Presumably, some wars happen when each side believes it would win even though they both cannot
be correct. The inputs would be the quantities of soldiers, armaments, ammunition, technology, etc. Who has the higher total? Now, go back one step: who has an economy capable of generating more of these inputs? Now, go back another step: who has a society that is more conducive to having superior economic output (think productivity, raw materials, skills, incentives)? Effectively, the better equipped social system is likely to be the eventual winner. So, how to simulate this without bloodshed?

Allow countries to buy parts of territories adjacent to its borders; similarly, countries could pledge its own adjacent land as collateral when borrowing money. This needs much more thought, but could have some interesting outcomes.

Does Time of Possession Matter in Football?

Does time of possession matter in American football?

Your first thought might envisage a mighty Alabama team running on each play, gobbling up chunks of 4, 5, 10 yards each time. Moving the chains, wearing down the opponent's defense. But is this helping to win?

One way to think about time of possession is like baseball's innings and their length. In baseball, each team gets up to nine half-innings to score some runs. With football, we might call a team's half-inning a drive. The team with the ball has a drive to score 3 or 7 points (usually). If a team eats up a lot of clock on a lengthy drive, both teams have a reduced amount of time possible with the ball. If a big-play team scores a touchdown on a mere three plays, then there is only a small amount of time reduced to both teams.

Effectively, the length of a drive's main effect is on the number of drives possible in a half. The better team should want as many drives as possible since this reduces the expected impact of bad luck for themselves and lucky breaks for the opponent; the variance in expected points should decrease leading to a more likely outcome of the better team scoring more points. Conversely, the worse team should want as few drives as possible. Anything can happen on one drive, but maintaining the luck is near impossible over several drives.

This concept has long been acknowledged in basketball. Teams such as North Carolina know they are usually superior and play a high-tempo style so that a few lucky three pointers against them won't sink their hopes. In football, some of the top teams stubbornly persist with the idea that longer time of possession causes more wins; this analysis suggests they would be even better if they adapted.

The bottom line is that coaches of stronger football teams should emphasize a high-tempo style, while weaker teams should focus on tactics that lead to fewer drives. They can do this by recruiting and game-planning accordingly.

Thursday, December 1, 2016

Election Results vs. Benford's Law and the Return of City-States?

From Wikipedia:  Benford's law, also called the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small...

Benford's Law is a technique to screen for fraud in a large number of numeric records. Let's apply this to the 2016 U.S. election:

This chart of overall county vote totals by leading digit looks like it passes Benford's Law, but what about state-by-state results?

The states are sorted by number of counties since Benford's law breaks down with fewer records. The last couple of rows of states have zig-zag lines, probably due to this. There are four interesting states that do not have low county counts:

  • Virginia (VA) looks a bit off [Clinton won this state by 5%]
  • Kentucky (KY), Iowa (IA), and Mississippi (MA) look a bit off [Trump won these states by 30%, 10%, and 19%, respectively]
  • The dataset does not have records for Alaska, by the way.

Basically, the states with the oddest results were typically so lopsided as to scarcely matter.

An interesting observation is that the states with more counties clearly tilt toward Trump, while the states with fewer counties clearly tilt toward Clinton. Why might that be? Maybe the number of counties in a state is a proxy for average population size per county. If true, then the Clinton vote should correlate with the county population size. Does it? Indeed, the correlation is 35%.

Perhaps a theme to the election was the divergent preferences between counties with high populations versus the rest of the country. That suggests a potential solution: density-based laws, which might effectively be the return of city-states cooperating under one national flag. Individuals could pick their preferred city based on its basket of laws, obviating the imposition of a heavy blanket of laws from a national government that seesaws left and right, alienating an alternative half of its citizens each cycle.

Maybe the Constitution even effectively implemented this flavor of federalism in 1789, since state populations were then the size of present day cities. Just food for thought, let's not get carried away.

Wednesday, September 14, 2016

Estate Tax: Give + Receive

The Estate Tax (a/k/a Death Tax) inspires conflicting impulses. On one hand, it treats a lifelong spender's income differently from that of a lifelong saver just because there is leftover income in the bank (excellent exposition here from Greg Mankiw). On the other hand, it seems unfair for the lucky recipient of the funds to start off life with so much of a leg up.

How can this apparent contradiction of unfairness be resolved?

Notice that an inheritance is really a transaction with two parts:  a give and a receive. One potential resolution is to remove the tax on giving, but to enact a tax on receiving funds over a certain threshold (say, $50,000).

With this knowledge, the giver would know that the gift is more effective when less concentrated. Expect the giver to give lower amounts to more people. Be careful not to set the threshold too low or the giver would find it impractical to stay under it and so might just ignore its intended dispersal effect.

There may be some unintended consequences so consider this food for thought.

Thursday, August 11, 2016

Form Follows Function, A-N-G-E-?

Five letters
Starts with A-N-G-E

Could either be ANGEL or ANGER

For ANGEL, the G is soft
For ANGER, the G is hard

I'd post a picture of Samael, but don't want to give you nightmares.