Monday, December 29, 2014

First Day of the Month, Using R

Future-proofing is an important concept when designing automated reports. One thing that can get out of hand over time is when you accumulate so many periods of data that your charts start to look overcrowded. You can solve for this by limiting the number of periods to, say, 13 (I like 13 for monthly data, because you get a full year of data, plus you can compare the month-over-month of the most recent data).


You could approach this by limiting your data to anything in the last 390 days (30 days x 13 months), but your starting period will likely be cut-off. You can fix this by finding the first day of the month for each record, then going back to get a full 13 months of data.

Here's a quick one-liner to get the first day of the month for a given date:  subtract the day of the month from the full date, then add 1.

# get some dates for the toy example:
df1 <- data.frame(YourDate=as.Date("2012-01-01")+seq(from=1,to=900,by=11))
df1$DayOne <- df1$YourDate - as.POSIXlt(df1$YourDate)$mday + 1

Wednesday, December 24, 2014

Democracy, War, & Statistical Modeling

Democracy can be thought of as a technique for summarizing the prevailing sentiment of a population. Democracy may be admirable, but why is its usage so prevalent in modern governments? Why is the prevailing sentiment so important?

Perhaps prevailing sentiment is not directly important, but it closely tracks something else that does. What could be happening under the sheen of participatory democracy is an uglier calculation of who might win a war. This can explain why there seems to be more stability in democratic countries. Rebels are not magically restrained from attempting overthrow, but their failure becomes more predictable and, thus, less likely to be tried in the first place.

This theory hinges on how well the election results match the prevailing sentiment. Voters would need candidates that are not too far off from the population's currents of sentiment. Two problems:

  1. It would be impossible to have perfect, universal alignment with voters and candidates,
  2. It can also be difficult to quantify the distance between a voter's sentiment and that of each candidates'.


To describe how each of these complications is addressed in practice, the discipline of statistics can be helpful:

Candidates can use the concept of k-means clustering to determine optimal positioning; less cynically, this could happen evolutionarily as the best-positioned candidate survives. A k of 2 typically seems to occur because going to a k of 3 most likely means that the third candidate crowds into the larger cluster's area, leading to the advantage of the second-most favorite candidate. These two candidates would tend to stake out positions that lead to a roughly equal split of the population.



Understanding the sentiment space is imperfect because sentiment changes and it is difficult to quantify into specific coordinates. An analogue to the voting population's distance measurement technique is the random forest learning method. Each voter can be thought of one of the decision trees, selecting a different subset of factors to consider. Each voter's output is aggregated and the mode is selected as the system output, or candidate in this case.
Random forests are known for being highly predictive, but not transparent. A professional pollster would probably agree wholeheartedly.


Monday, September 29, 2014

English & Context

Context is key in English. For example, the most interesting and least interesting thing I can think of is space bar.

Friday, September 26, 2014

FIFA 15 Analysis with R

Several months ago, I used R to analyze professional soccer players based on their attributes from the video game, FIFA14. Now that FIFA15 is upon us, let's take a similar look.


FIFA 15 is a video game by EA Sports that mimics the experience of managing and playing for a soccer team. The game uses the likenesses and attributes of real players and this is part of the appeal. Although I rarely play video games, I am an avid soccer player and got curious about what could be learned by taking a closer look at the game-assigned player attributes.

www.futhead.com is a good source of FIFA 15 data. I scraped the html from the two hundred-plus pages of player attributes and then munged them into a useful table. Players have an overall rating and they have six specific stats (pace, shooting, passing, dribbling, defending, and physicality (replacing last year's "heading"). Each player has an assigned position; I collapsed the positions into a “type” category (Defense, Midfield, Forward). The modern game effectively has four lines of players but the position names still carry the naming conventions of the days of the three line formations, such as 4-4-2.

Player Positions and Position Types

Below is a chart summarizing player rating by position. The charted is sorted in ascending median rating. There is a great deal of spread, but generally the center midfielder and fullbacks are a bit lower than the wingers and wingbacks.
The collapsed view below corresponds with the above chart:  a slight bias as the position becomes more offensive-minded, but not dramatically different.

Modeling Player Ratings

I built a linear model for each position “type” and found R-squared values ranging 89%-96%. Each model used all six attributes as predictors with overall rating as the dependent variable. I speculate that player age/experience may account for the unexplained variance. Below is a look at the performance of each position type’s model. Both images visually support the models’ validity.



Position Type Models

Each position type’s model naturally has a different mix of attribute weights. Below are charts showing these weights.
Forwards need to be good at shooting and this is expressed in the above graph. Interestingly, passing is actually negatively correlated with a forward’s rating. I can think of several great forwards I have played with that fit this category!
Midfield ratings are more balanced than that of defense but dribbling and passing are the two most important skills for this position type.

Mismatches

Each player’s position is assigned in the database. This leads to the possibility of having a player being theoretically higher rated in a different position. I found some evidence of this. Below is a table of the top three mismatches by position.
Best Rating
Forward
Leonel Vangioni
Denis Epstein
Mohammed Qasim
Neymar
Theo Walcott
Taison

Best Rating
Midfield
Marcelo
Dani Alves
David Alaba

Ricardo Alvarez
Sebastian Giovinco
Jérémy Ménez
Best Rating
Defense

Philipp Lahm
Sami Khedira
Sergio Busquets
Darius Charles
Andrew Weideman
Arkadiusz Gajewski


Defense
Midfield
Forward


Assigned
Assigned
Assigned


After comparing with last year's version of this mismatch table, it looks like Neymar and Lahm are hard to pigeonhole:  last year, the model thought they should both be midfielders; this year, it puts them back to forward and defense, respectively. Theo Walcott, once healthy, will want to show these statistics to Arsene Wenger in a bid to move from winger to forward.
As someone who has watched countless matches, I venture that the positions should be thought of in terms of where the player is expected to defend not necessarily where he is expected to attack; it is common for wingers to cut inside and act like forwards once the opponent’s defenders are occupied by the true forwards. Likewise, the rise of the offensive-minded wing backs can cause trouble for defenses that have to cope with a late runner joining the attack.

Model Outliers

The model does a good job of predicting a player’s overall rating, but there are a few exceptions.

At Assigned Position
At Best Position
Better than Predicted
Stefan Reinartz
Raoul Cedric Loé
José Cañas  
Borja Fernández
Jesús Navas
Marco Rojas
Worse than Predicted
Murat Akin
Musharraf Al Ruwaili
Geir Ludvig Fevang
Francesco Totti
Ian Harte
Ruslan Adzhindzhal
The players in the top row must have magic not captured in the regular six attributes; one might call this the X Factor. Unbelievably, the top left box did not change! There must be something about these players more than their underlying attributes suggest. Game developer friends, perhaps? They are all defensive midfielders, but not sure what other commonalities they have.

Clustering

There is some evidence that the player attributes lead to a few common clusters. Below is a chart showing the weighted sum of squares for a given cluster count. This is a bit of visual confirmation that there are three or four general styles of player; past that the WSS does not change as much.

Player Tree

Finally, I clustered the top field players (overall rating at least 84) hierarchically. What developed was an insightful way to visualize how different players are stylistically related to each other.


Football / Soccer's very own family tree. The most interesting leaves are the players positionally mixed in with other positions.  Philipp Lahm resurfaces as the midfielder who should be a defender, or vice-versa.  Maybe Germany should consider moving Mats Hummels to midfield and restoring Lahm to defense. Sounds crazy, but this same analysis pointed to moving Lahm to midfield even though many thought him to be the best right back in the world and, sure enough, Pep moved him to midfield. I'm sure Pep had been thinking about that move far earlier.

Santi Cazorla and Vincent Kompany look to be the furthest apart. That sounds spot on!


Below is the code for screen-scraping. Be nice to others' sites when scraping.

 ### Prepare ####  
 pkg <- c("cluster","fpc","digest","ggplot2","foreign","ggdendro","reshape2")  
 inst <- pkg %in% installed.packages()  
 if(length(pkg[!inst]) > 0) install.packages(pkg[!inst])  
 lapply(pkg,library,character.only=TRUE)  
 rm(inst,pkg)  
 set.seed(4444)  
 ### Control panel for screen-scraping ####  
 sleep.time <- 0.01  
 pagecount <- 237  
 pc.ignore <- 0  
 names.page <- 48  
 names.lastpage <- 38  
 name.gaplines <- 71  
 namLine1 <- 1723  
 posLine1 <- 1723+5  
 RATLine1 <- 1723+10  
 PACLine1 <- 1723+15  
 SHOLine1 <- 1723+20  
 PASLine1 <- 1723+25  
 DRILine1 <- 1723+30  
 DEFLine1 <- 1723+34  
 PHYLine1 <- 1723+37  
 ### Create custom urls to scrape ####  
 pageSeq <- seq(from=1,to=pagecount,by=1)  
 urls.df <- data.frame(pageSeq)  
 for(i in 1:length(urls.df$pageSeq)){  
  urls.df$url[i] <- paste0("http://www.futhead.com/15/players/?page=",  
               urls.df$pageSeq[i],  
               "&sort_direction=desc")  
 }  
 ### Scrape html from custom urls ####  
 pages <- as.list("na")  
 for(j in 1:length(urls.df$pageSeq)){  
  pages[[j]] <- urls.df$pageSeq[j]  
 }  
 for(j in 1:length(urls.df$pageSeq)){  
  download.file(urls.df$url[j],destfile=paste0(urls.df$pageSeq[j],".txt"))  
  Sys.sleep(sleep.time)  
 }  
 ### Identify which lines store player statistics ####  
 namSeq <- seq(from=namLine1,by=name.gaplines,length.out=names.page)  
 posSeq <- seq(from=posLine1,by=name.gaplines,length.out=names.page)  
 RATSeq <- seq(from=RATLine1,by=name.gaplines,length.out=names.page)  
 PACSeq <- seq(from=PACLine1,by=name.gaplines,length.out=names.page)  
 SHOSeq <- seq(from=SHOLine1,by=name.gaplines,length.out=names.page)  
 PASSeq <- seq(from=PASLine1,by=name.gaplines,length.out=names.page)  
 DRISeq <- seq(from=DRILine1,by=name.gaplines,length.out=names.page)  
 DEFSeq <- seq(from=DEFLine1,by=name.gaplines,length.out=names.page)  
 PHYSeq <- seq(from=PHYLine1,by=name.gaplines,length.out=names.page)  
 ### Create empty dataframe for storing player stats  
 attribs <- data.frame(matrix(nrow=names.page*(pagecount-1)+names.lastpage,ncol=9))  
 colnames(attribs) <- c("Name","Position","RAT","PAC","SHO","PAS","DRI","DEF","PHY")  
 ### Store lines from full pages containing player stats to dataframe ####  
 for(m in 1:(pagecount-1-pc.ignore)){  
  page <- readLines(paste0(urls.df$pageSeq[m],".txt"))  
  for(k in 1:names.page){  
   n <- (m-1)*names.page+k  
   attribs$Name[n] <- page[namSeq[k]]  
   attribs$Position[n] <- page[posSeq[k]]  
   attribs$RAT[n] <- page[RATSeq[k]]  
   attribs$PAC[n] <- page[PACSeq[k]]  
   attribs$SHO[n] <- page[SHOSeq[k]]  
   attribs$PAS[n] <- page[PASSeq[k]]  
   attribs$DRI[n] <- page[DRISeq[k]]  
   attribs$DEF[n] <- page[DEFSeq[k]]  
   attribs$PHY[n] <- page[PHYSeq[k]]  
  }  
 }  
 ### Store lines from partial last page containing player stats to dataframe ####  
 pagelast <- readLines(paste0(urls.df$pageSeq[pagecount],".txt"))  
 for(p in 1:names.lastpage){  
  q <- (pagecount-1)*names.page+p  
  attribs$Name[q] <- pagelast[namSeq[p]]  
  attribs$Position[q] <- pagelast[posSeq[p]]  
  attribs$RAT[q] <- pagelast[RATSeq[p]]  
  attribs$PAC[q] <- pagelast[PACSeq[p]]  
  attribs$SHO[q] <- pagelast[SHOSeq[p]]  
  attribs$PAS[q] <- pagelast[PASSeq[p]]  
  attribs$DRI[q] <- pagelast[DRISeq[p]]  
  attribs$DEF[q] <- pagelast[DEFSeq[p]]  
  attribs$PHY[q] <- pagelast[PHYSeq[p]]  
 }  
 ### Remove html wrapped around player stats in each line ####  
 attribs$Name <- gsub("^.*<span class=\"name\">","",attribs$Name)  
 attribs$Name <- gsub("</span>.*$","",attribs$Name)  
 attribs$Name <- gsub("^\\s+|\\s+$","",attribs$Name)  
 attribs$Position <- gsub("^ *","",attribs$Position)  
 attribs$Position <- gsub("^\\s+|\\s+$","",attribs$Position)  
 attribs$RAT <- gsub("^.*<span>","",attribs$RAT)  
 attribs$RAT <- gsub("</span>.*$","",attribs$RAT)  
 attribs$RAT <- gsub("^\\s+|\\s+$","",attribs$RAT)  
 attribs$PAC <- gsub("^.*<span class=\"attribute\">","",attribs$PAC)  
 attribs$PAC <- gsub("</span>.*$","",attribs$PAC)  
 attribs$PAC <- gsub("^\\s+|\\s+$","",attribs$PAC)  
 attribs$SHO <- gsub("^.*<span class=\"attribute\">","",attribs$SHO)  
 attribs$SHO <- gsub("</span>.*$","",attribs$SHO)  
 attribs$SHO <- gsub("^\\s+|\\s+$","",attribs$SHO)  
 attribs$PAS <- gsub("^.*<span class=\"attribute\">","",attribs$PAS)  
 attribs$PAS <- gsub("</span>.*$","",attribs$PAS)  
 attribs$PAS <- gsub("^\\s+|\\s+$","",attribs$PAS)  
 attribs$DRI <- gsub("^.*<span class=\"attribute\">","",attribs$DRI)  
 attribs$DRI <- gsub("</span>.*$","",attribs$DRI)  
 attribs$DRI <- gsub("^\\s+|\\s+$","",attribs$DRI)  
 attribs$DEF <- gsub("^.*<span class=\"attribute\">","",attribs$DEF)  
 attribs$DEF <- gsub("</span>.*$","",attribs$DEF)  
 attribs$DEF <- gsub("^\\s+|\\s+$","",attribs$DEF)  
 attribs$PHY <- gsub("^.*<span class=\"attribute\">","",attribs$PHY)  
 attribs$PHY <- gsub("</span>.*$","",attribs$PHY)  
 attribs$PHY <- gsub("^\\s+|\\s+$","",attribs$PHY)  
 ### Remove statistics from duplicated players ####  
 attribs <- attribs[!(attribs$Name=="Cristiano Ronaldo"&attribs$RAT=="93"),]  
 attribs <- attribs[!duplicated(attribs$Name),]  
 rownames(attribs) <- NULL  
 ### Clean up foreign characters in names ####  
 Encoding(attribs$Name) <- "UTF-8"  
 attribs$Name <- iconv(attribs$Name,"UTF-8","UTF-8",sub='')  
 ### Create general position type ####  
 attribs$Type[attribs$Position %in% c("CF","LF","RF","ST")] <- "Forward"  
 attribs$Type[attribs$Position %in% c("LM","RM","CDM","CM","CAM","LW","RW")] <- "Midfield"  
 attribs$Type[attribs$Position %in% c("LB","RB","CB","LWB","RWB")] <- "Defense"  
 attribs$Type[attribs$Position %in% c("GK")] <- "Keeper"  
 ### Change each stat to the appropriate data type ####  
 attribs$Name <- as.character(attribs$Name)  
 attribs$Position <- as.factor(attribs$Position)  
 attribs$RAT <- as.integer(attribs$RAT)  
 attribs$PAC <- as.integer(attribs$PAC)  
 attribs$SHO <- as.integer(attribs$SHO)  
 attribs$PAS <- as.integer(attribs$PAS)  
 attribs$DRI <- as.integer(attribs$DRI)  
 attribs$DEF <- as.integer(attribs$DEF)  
 attribs$PHY <- as.integer(attribs$PHY)  
 attribs$Type <- ordered(attribs$Type,levels=c("Forward","Midfield","Defense","Keeper"))  

Sunday, August 17, 2014

A Look at Random Seeds in R... Or: “85, why can’t you be more like 548?”

Have you ever wondered whether the set.seed() function in R has any quirkiness? This analysis was inspired by a Stack Overflow posting by Wolfgang and I incorporate some of his code.

For each seed (1-1000, for this analysis), I took the mean and standard deviation of the first 1,000 random numbers. Then I get the percent of the density function that intersects with the normal curve as well as a distance from the origin function (0,1 in this case).

With the resulting points, I find the most interesting ones based on min/max mean, min/max sd, max distance from shifted origin for points in each of the quadrants, overall max distance, and the point closest to the center.

Below is the summary of interesting points.

##      type seed     mu    sd  dist intersect
## 1  mu_min   85 -0.110 1.008 0.110     0.956
## 2  mu_max  501  0.104 1.002 0.104     0.959
## 3  sd_min  180 -0.005 0.921 0.079     0.960
## 4  sd_max  168  0.002 1.065 0.065     0.969
## 5      q1  501  0.104 1.002 0.104     0.959
## 6      q2   85 -0.110 1.008 0.110     0.956
## 7      q3  713 -0.075 0.935 0.100     0.957
## 8      q4  394  0.090 0.988 0.091     0.964
## 9     out   85 -0.110 1.008 0.110     0.956
## 10     in  548  0.000 1.000 0.000     1.000
## 11    sim  548  0.000 1.000 0.000     1.000
## 12   diff   85 -0.110 1.008 0.110     0.956


Below is a chart showing the overlap of the most similar point and a chart showing the overlap of the least similar point. Thanks again to Wolfgang for this code chunk.
Top:  Seed 548; Bottom:  Seed 85
As you can see, even the worst point has an overlap of 95.6%. Point 548 is almost perfect. However since there are values that could cause issues, it might be a good practice to pick different seeds over time. You could throw a dart or a manager could assign the seed as part of the requirements document. This practice might mitigate the risk of an analyst’s intentionally biased seed selection.

Sunday, May 18, 2014

R is short for SSIS

R is Short for SSIS Data scientists often identify a need to join data from different, unlinked servers. One standard tool for accomplishing this is an SSIS package to consolidate the data onto one of the servers. For the analyst who wants to keep everything in one file for simplicity and repeatabililty, there is another option: the RODBC package (authored by Brian Ripley).

To avoid issues of uninstalled packages, I use this general method.
pkg <- c("RODBC", "ggplot2")
inst <- pkg %in% installed.packages()
if (length(pkg[!inst]) > 0) install.packages(pkg[!inst])
lapply(pkg, library, character.only = TRUE)
First query data from the first server:
channel1 <- odbcDriverConnect(connection = "Driver={SQL Server};Server=yourserver;Database=yourdatabase;Trusted_Connection=Yes;")
query1 <- "select * from customers where contractID IS NOT NULL"
data1 <- sqlQuery(channel1, query1)
odbcClose(channel1)
Then query data from the second server:
channel2 <- odbcDriverConnect(connection = "Driver={SQL Server};Server=yourserver;Database=yourdatabase;Trusted_Connection=Yes;")
query2 <- "select * from products"
data2 <- sqlQuery(channel2, query2)
odbcClose(channel2)
Join (merge) the two resulting dataframes:
data.merge <- merge(data1, data2, by = "InvoiceID")
Do something interesting and save the results:
p <- ggplot(data = data.merge) + something_worth_graphing...
p
ggsave(filename = "NeatChart.pdf", plot = p)
Now you will have one R file that pulls all of the data you need, processes it, and saves the output.
Bonus idea: The tables to query may be quite large. Peel out the limiting factor (such as a list of customer IDs) and use the paste0 command to assemble a dynamic second query (using the WHERE clause).

Wednesday, April 23, 2014

Dislike Facebook's News Feed

Facebook recently changed its News Feed to display certain postings based on factors besides chronological posting time. This is a major negative for users. A major source of Facebook's value to people was that one could keep with all friends. Now, since the posts with the most activity are pushed toward the top, one gets info in the real-world proportion of loudest/most extroverted/most friended. However, quality of content does not necessarily correlate with extroversion. Additionally, a major input of a specific post's activity is the number of likes. Serious postings and postings revealing bad news are much less likely to get liked due to "liking" being an inappropriate response. Again, Facebook's algorithm prioritizes easily likable news to the detriment of other posts. Bad move, Facebook.

Friday, April 18, 2014

Insight from FIFA 14’s Player Attributes (Using R)

FIFA 14 is a video game by EA Sports that mimics the experience of managing and playing for a soccer team. The game uses the likenesses and attributes of real players and this is part of the appeal. Although I rarely play video games, I am an avid soccer player and got curious about what could be learned by taking a closer look at the game-assigned player attributes.

www.futhead.com is a good source of FIFA 14 data. I scraped the html from the two hundred-plus pages of player attributes and then munged them into a useful table. Players have an overall rating and they have six specific stats (pace, shooting, passing, dribbling, defending, and heading). Each player has an assigned position; I collapsed the positions into a “type” category (Defense, Midfield, Forward). The modern game effectively has four lines of players but the position names still carry the naming conventions of the days of the three line formations, such as 4-4-2.

Player Positions and Position Types

Below is a chart summarizing player rating by position. The charted is sorted in ascending median rating. There is a great deal of spread, but generally the center midfielder and fullbacks are a bit lower than the wingers and wingbacks.
The collapsed view below corresponds with the above chart:  a slight bias as the position becomes more offensive-minded.

Modeling Player Ratings

I built a linear model for each position “type” and found R-squared values ranging 88%-99%. Each model used all six attributes as predictors with overall rating as the dependent variable. I speculate that player age/experience may account for the unexplained variance. Below is a look at the performance of each position type’s model. Both images visually support the models’ validity.

Position Type Models

Each position type’s model naturally has a different mix of attribute weights. Below are charts showing these weights.
Forwards need to be good at shooting and this is expressed in the above graph. Interestingly, passing is actually negatively correlated with a forward’s rating. I can think of several great forwards I have played with that fit this category!
Midfield ratings are more balanced than that of defense but dribbling and passing are the two most important skills for this position type.
It is clear that defending is far and away the most important skill for defenders; this is less an insight than an indictment of the game developers for not breaking defense into its own attributes such as tackling and positioning.
Goalkeepers are specialists so these skills are not as directly relevant, but I included them for completeness.

Mismatches

Each player’s position is assigned in the database. This leads to the possibility of having a player being theoretically higher rated in a different position. I found some evidence of this. Below is a table of the top three mismatches by position.
Best Rating
Forward
Craig Gardner
Guillaume Gillet
Steven Reed
Cristiano Ronaldo
Arjen Robben
Thomas Müller

Best Rating
Midfield
Philipp Lahm
Dani Alves
Marcelo

Neymar
Antonio Cassano
Sebast. Giovinco
Best Rating
Defense

Yaya Touré
Sergio Busquets
Xabi Alonso
Karim Guédé
Lee McCulloch
Mikael Dahlberg


Defense
Midfield
Forward


Assigned
Assigned
Assigned

Defenders better as midfielders are an impressive crew:  Lahm, Alves, and Marcelo are three of the top players. Midfielders better as defenders are known for their holding prowess and their enforcer reputation. Midfielders better as forwards are often impressive wingers who can use their speed as a weapon in the open wide spaces. Forwards better as midfielders are represented by two Italians and Neymar, which is surprising since he is viewed as a potent striker. As someone who has watched countless matches, I venture that the positions should be thought of in terms of where the player is expected to defend not necessarily where they are expected to attack; it is common for wingers to cut inside and act like forwards once the opponent’s defenders are occupied by the true forwards. Likewise, the rise of the offensive-minded wing backs can cause trouble for defenses that have to cope with a late runner joining the attack.

Model Outliers

The model does a good job of predicting a player’s overall rating, but there are a few exceptions.

At Assigned Position
At Best Position
Better than Predicted
Raoul Cedric Loé
Stefan Reinartz
Cañas
Mesut Özil
Franck Ribéry 
Luca Toni
Worse than Predicted
Greg Tempest
Musharraf Al Ruwaili
Jacob Shoop
Nicholas Gotfredsen
Don Anding
Josh Ford
Most of these players are lesser known, with the exception of the top right box. These players must have magic not captured in the regular six attributes; one might call this the X Factor.

Clustering

There is some evidence that the player attributes lead to a few common clusters. Below is a chart showing the weighted sum of squares for a given cluster count. This is a bit of visual confirmation that there are three or four general styles of player; past that the WSS does not change as much.

Player Tree

Finally, I clustered the top field players (overall rating at least 85) hierarchically. What developed was an insightful way to visualize how different players are stylistically related to each other.
Football / Soccer's very own family tree. The forward Gareth Bale is mixed in between the midfielders and defenders. The forward Lionel Messi is mixed in with the midfielders.These are two of the most talked about players today. Maybe being mixed in with different position players in the tree is predictive of being an important, interesting player. If so, keep your eyes on Thomas Müller.

Saturday, April 5, 2014

Should e-Commerce Ad Spend per Sale Decrease?

Moe's Bar Graph?
You run an e-commerce website with one durable product. With a proud smile, your marketing spend guru, Spike, shares a chart showing how ad spend per unit sold is steadily dropping. He gets confused when he sees a look of frustration on your face. What's the problem, this seems like 100% good news?

Well, Spike is comparing his results to prior months. Seems reasonable. What happens when you compare his results to the optimal scenario, though?

First, let's flesh out what "optimal" my mean. You know, with certainty, the following:

  1. the list of people who will buy your product,
  2. the marketing mix strategy (content, site, cadence) that will trigger a purchase from each of these people, and
  3. the cost of this ad strategy for each customer.
Knowing all of this and given a monthly budget to spend on marketing, Spike should target sales from the people with the cheapest required marketing mix. Next month, he should target sales from the REMAINING people with the cheapest required marketing mix. These remaining people are, by definition, not as cheap to market to as the first set. Continuing this process, Spike's results would actually show an INCREASING cost per sale!

Thursday, March 27, 2014

Assign n Email Addresses to x Cells, Intrinsically (Part II)

Part I showed the concept and general technique of a method of assigning n email addresses to x cells pseudo-randomly, without the need for maintaining a log of each assignment.

The earlier post considered the basic case of each cell being assigned approximately the same quantity of email addresses. In practice, cell sizes often vary. Below is a technique that works well when the total number of email addresses needed is less than the product of the cell sizes' greatest common divisor and the average email address length. For example, cell sizes are 500, 500, & 1,000; so 2,000 < 500*25ish.

Assign n Email Addresses to x Cells, Intrinsically; Part 2 (variable Cell Sizes)

Assign n Email Addresses to x Cells, Intrinsically; Part 2 (Variable Cell Sizes)

Sample Use Case:
Marketing requests that an email address list be divided randomly into a given number of cells so that each cell would receive a different version of copy.
Below is a technique that takes n email addresses and pseudo-randomly assigns each to one of x cells. The advantage of this method is that the user does not need to maintain a log of each email address's assigned cell since the cell assignment can be reproduced at any time.
This technique is extended from Part 1 to accommodate cells of varying sizes.
First, load in a randomly generated list of email addresses.
set.seed(4444)
library(numbers)

fict.email <- function(n = 5) {
    fict.emails <- data.frame(email = NA)
    for (i in 1:n) {
        fict.emails[i, "email"] <- paste0(paste(sample(letters, sample(3:25, 
            1, TRUE), TRUE), collapse = ""), "@", paste(sample(letters, sample(3:15, 
            1, TRUE), TRUE), collapse = ""), ".", paste(sample(letters, sample(2:3, 
            1, TRUE), TRUE), collapse = ""))
    }
    fict.emails
}
emails <- sample(fict.email(10000))
Next, assign the cell sizes.
cell.sizes <- c(500, 500, 1500, 2000)
Get the number of characters of each email address; this is important because this will remain constant for each entry. Next, find the greatest common divisor of the cell sizes. Use the modulo function to calculate the remainders.
cells <- length(cell.sizes)
cell.gcd <- mGCD(cell.sizes)
em.len <- sapply(emails, nchar)
em.mod <- em.len%%(sum(cell.sizes)/cell.gcd)
Combine mod values into cell numbers.
ranges <- data.frame(start = 0, end = 0)
for (j in 1:cells) {
    ranges[j, "start"] <- (sum(cell.sizes[1:j]) - cell.sizes[j])/cell.gcd + 
        1
    ranges[j, "end"] <- sum(cell.sizes[1:j])/cell.gcd
}

for (k in 1:cells) {
    emails$cell[em.mod >= ranges$start[k] & em.mod <= ranges$end[k]] <- k
}
Split the data frame into the required cell sizes. These lists are the final output.
email.lists <- split(emails, emails$cell)
for (l in 1:cells) {
    email.lists[[l]] <- email.lists[[l]][[1]][1:cell.sizes[l]]
}
Now each email address has been assigned to a specific cell.
Each email address will always belong to the current cell because the number of characters it has will not change.