Friday, September 26, 2014

FIFA 15 Analysis with R

Several months ago, I used R to analyze professional soccer players based on their attributes from the video game, FIFA14. Now that FIFA15 is upon us, let's take a similar look.

FIFA 15 is a video game by EA Sports that mimics the experience of managing and playing for a soccer team. The game uses the likenesses and attributes of real players and this is part of the appeal. Although I rarely play video games, I am an avid soccer player and got curious about what could be learned by taking a closer look at the game-assigned player attributes. is a good source of FIFA 15 data. I scraped the html from the two hundred-plus pages of player attributes and then munged them into a useful table. Players have an overall rating and they have six specific stats (pace, shooting, passing, dribbling, defending, and physicality (replacing last year's "heading"). Each player has an assigned position; I collapsed the positions into a “type” category (Defense, Midfield, Forward). The modern game effectively has four lines of players but the position names still carry the naming conventions of the days of the three line formations, such as 4-4-2.

Player Positions and Position Types

Below is a chart summarizing player rating by position. The charted is sorted in ascending median rating. There is a great deal of spread, but generally the center midfielder and fullbacks are a bit lower than the wingers and wingbacks.
The collapsed view below corresponds with the above chart:  a slight bias as the position becomes more offensive-minded, but not dramatically different.

Modeling Player Ratings

I built a linear model for each position “type” and found R-squared values ranging 89%-96%. Each model used all six attributes as predictors with overall rating as the dependent variable. I speculate that player age/experience may account for the unexplained variance. Below is a look at the performance of each position type’s model. Both images visually support the models’ validity.

Position Type Models

Each position type’s model naturally has a different mix of attribute weights. Below are charts showing these weights.
Forwards need to be good at shooting and this is expressed in the above graph. Interestingly, passing is actually negatively correlated with a forward’s rating. I can think of several great forwards I have played with that fit this category!
Midfield ratings are more balanced than that of defense but dribbling and passing are the two most important skills for this position type.


Each player’s position is assigned in the database. This leads to the possibility of having a player being theoretically higher rated in a different position. I found some evidence of this. Below is a table of the top three mismatches by position.
Best Rating
Leonel Vangioni
Denis Epstein
Mohammed Qasim
Theo Walcott

Best Rating
Dani Alves
David Alaba

Ricardo Alvarez
Sebastian Giovinco
Jérémy Ménez
Best Rating

Philipp Lahm
Sami Khedira
Sergio Busquets
Darius Charles
Andrew Weideman
Arkadiusz Gajewski



After comparing with last year's version of this mismatch table, it looks like Neymar and Lahm are hard to pigeonhole:  last year, the model thought they should both be midfielders; this year, it puts them back to forward and defense, respectively. Theo Walcott, once healthy, will want to show these statistics to Arsene Wenger in a bid to move from winger to forward.
As someone who has watched countless matches, I venture that the positions should be thought of in terms of where the player is expected to defend not necessarily where he is expected to attack; it is common for wingers to cut inside and act like forwards once the opponent’s defenders are occupied by the true forwards. Likewise, the rise of the offensive-minded wing backs can cause trouble for defenses that have to cope with a late runner joining the attack.

Model Outliers

The model does a good job of predicting a player’s overall rating, but there are a few exceptions.

At Assigned Position
At Best Position
Better than Predicted
Stefan Reinartz
Raoul Cedric Loé
José Cañas  
Borja Fernández
Jesús Navas
Marco Rojas
Worse than Predicted
Murat Akin
Musharraf Al Ruwaili
Geir Ludvig Fevang
Francesco Totti
Ian Harte
Ruslan Adzhindzhal
The players in the top row must have magic not captured in the regular six attributes; one might call this the X Factor. Unbelievably, the top left box did not change! There must be something about these players more than their underlying attributes suggest. Game developer friends, perhaps? They are all defensive midfielders, but not sure what other commonalities they have.


There is some evidence that the player attributes lead to a few common clusters. Below is a chart showing the weighted sum of squares for a given cluster count. This is a bit of visual confirmation that there are three or four general styles of player; past that the WSS does not change as much.

Player Tree

Finally, I clustered the top field players (overall rating at least 84) hierarchically. What developed was an insightful way to visualize how different players are stylistically related to each other.

Football / Soccer's very own family tree. The most interesting leaves are the players positionally mixed in with other positions.  Philipp Lahm resurfaces as the midfielder who should be a defender, or vice-versa.  Maybe Germany should consider moving Mats Hummels to midfield and restoring Lahm to defense. Sounds crazy, but this same analysis pointed to moving Lahm to midfield even though many thought him to be the best right back in the world and, sure enough, Pep moved him to midfield. I'm sure Pep had been thinking about that move far earlier.

Santi Cazorla and Vincent Kompany look to be the furthest apart. That sounds spot on!

Below is the code for screen-scraping. Be nice to others' sites when scraping.

 ### Prepare ####  
 pkg <- c("cluster","fpc","digest","ggplot2","foreign","ggdendro","reshape2")  
 inst <- pkg %in% installed.packages()  
 if(length(pkg[!inst]) > 0) install.packages(pkg[!inst])  
 ### Control panel for screen-scraping ####  
 sleep.time <- 0.01  
 pagecount <- 237  
 pc.ignore <- 0 <- 48  
 names.lastpage <- 38  
 name.gaplines <- 71  
 namLine1 <- 1723  
 posLine1 <- 1723+5  
 RATLine1 <- 1723+10  
 PACLine1 <- 1723+15  
 SHOLine1 <- 1723+20  
 PASLine1 <- 1723+25  
 DRILine1 <- 1723+30  
 DEFLine1 <- 1723+34  
 PHYLine1 <- 1723+37  
 ### Create custom urls to scrape ####  
 pageSeq <- seq(from=1,to=pagecount,by=1)  
 urls.df <- data.frame(pageSeq)  
 for(i in 1:length(urls.df$pageSeq)){  
  urls.df$url[i] <- paste0("",  
 ### Scrape html from custom urls ####  
 pages <- as.list("na")  
 for(j in 1:length(urls.df$pageSeq)){  
  pages[[j]] <- urls.df$pageSeq[j]  
 for(j in 1:length(urls.df$pageSeq)){  
 ### Identify which lines store player statistics ####  
 namSeq <- seq(from=namLine1,by=name.gaplines,  
 posSeq <- seq(from=posLine1,by=name.gaplines,  
 RATSeq <- seq(from=RATLine1,by=name.gaplines,  
 PACSeq <- seq(from=PACLine1,by=name.gaplines,  
 SHOSeq <- seq(from=SHOLine1,by=name.gaplines,  
 PASSeq <- seq(from=PASLine1,by=name.gaplines,  
 DRISeq <- seq(from=DRILine1,by=name.gaplines,  
 DEFSeq <- seq(from=DEFLine1,by=name.gaplines,  
 PHYSeq <- seq(from=PHYLine1,by=name.gaplines,  
 ### Create empty dataframe for storing player stats  
 attribs <- data.frame(matrix(*(pagecount-1)+names.lastpage,ncol=9))  
 colnames(attribs) <- c("Name","Position","RAT","PAC","SHO","PAS","DRI","DEF","PHY")  
 ### Store lines from full pages containing player stats to dataframe ####  
 for(m in 1:(pagecount-1-pc.ignore)){  
  page <- readLines(paste0(urls.df$pageSeq[m],".txt"))  
  for(k in{  
   n <- (m-1)*  
   attribs$Name[n] <- page[namSeq[k]]  
   attribs$Position[n] <- page[posSeq[k]]  
   attribs$RAT[n] <- page[RATSeq[k]]  
   attribs$PAC[n] <- page[PACSeq[k]]  
   attribs$SHO[n] <- page[SHOSeq[k]]  
   attribs$PAS[n] <- page[PASSeq[k]]  
   attribs$DRI[n] <- page[DRISeq[k]]  
   attribs$DEF[n] <- page[DEFSeq[k]]  
   attribs$PHY[n] <- page[PHYSeq[k]]  
 ### Store lines from partial last page containing player stats to dataframe ####  
 pagelast <- readLines(paste0(urls.df$pageSeq[pagecount],".txt"))  
 for(p in 1:names.lastpage){  
  q <- (pagecount-1)*  
  attribs$Name[q] <- pagelast[namSeq[p]]  
  attribs$Position[q] <- pagelast[posSeq[p]]  
  attribs$RAT[q] <- pagelast[RATSeq[p]]  
  attribs$PAC[q] <- pagelast[PACSeq[p]]  
  attribs$SHO[q] <- pagelast[SHOSeq[p]]  
  attribs$PAS[q] <- pagelast[PASSeq[p]]  
  attribs$DRI[q] <- pagelast[DRISeq[p]]  
  attribs$DEF[q] <- pagelast[DEFSeq[p]]  
  attribs$PHY[q] <- pagelast[PHYSeq[p]]  
 ### Remove html wrapped around player stats in each line ####  
 attribs$Name <- gsub("^.*<span class=\"name\">","",attribs$Name)  
 attribs$Name <- gsub("</span>.*$","",attribs$Name)  
 attribs$Name <- gsub("^\\s+|\\s+$","",attribs$Name)  
 attribs$Position <- gsub("^ *","",attribs$Position)  
 attribs$Position <- gsub("^\\s+|\\s+$","",attribs$Position)  
 attribs$RAT <- gsub("^.*<span>","",attribs$RAT)  
 attribs$RAT <- gsub("</span>.*$","",attribs$RAT)  
 attribs$RAT <- gsub("^\\s+|\\s+$","",attribs$RAT)  
 attribs$PAC <- gsub("^.*<span class=\"attribute\">","",attribs$PAC)  
 attribs$PAC <- gsub("</span>.*$","",attribs$PAC)  
 attribs$PAC <- gsub("^\\s+|\\s+$","",attribs$PAC)  
 attribs$SHO <- gsub("^.*<span class=\"attribute\">","",attribs$SHO)  
 attribs$SHO <- gsub("</span>.*$","",attribs$SHO)  
 attribs$SHO <- gsub("^\\s+|\\s+$","",attribs$SHO)  
 attribs$PAS <- gsub("^.*<span class=\"attribute\">","",attribs$PAS)  
 attribs$PAS <- gsub("</span>.*$","",attribs$PAS)  
 attribs$PAS <- gsub("^\\s+|\\s+$","",attribs$PAS)  
 attribs$DRI <- gsub("^.*<span class=\"attribute\">","",attribs$DRI)  
 attribs$DRI <- gsub("</span>.*$","",attribs$DRI)  
 attribs$DRI <- gsub("^\\s+|\\s+$","",attribs$DRI)  
 attribs$DEF <- gsub("^.*<span class=\"attribute\">","",attribs$DEF)  
 attribs$DEF <- gsub("</span>.*$","",attribs$DEF)  
 attribs$DEF <- gsub("^\\s+|\\s+$","",attribs$DEF)  
 attribs$PHY <- gsub("^.*<span class=\"attribute\">","",attribs$PHY)  
 attribs$PHY <- gsub("</span>.*$","",attribs$PHY)  
 attribs$PHY <- gsub("^\\s+|\\s+$","",attribs$PHY)  
 ### Remove statistics from duplicated players ####  
 attribs <- attribs[!(attribs$Name=="Cristiano Ronaldo"&attribs$RAT=="93"),]  
 attribs <- attribs[!duplicated(attribs$Name),]  
 rownames(attribs) <- NULL  
 ### Clean up foreign characters in names ####  
 Encoding(attribs$Name) <- "UTF-8"  
 attribs$Name <- iconv(attribs$Name,"UTF-8","UTF-8",sub='')  
 ### Create general position type ####  
 attribs$Type[attribs$Position %in% c("CF","LF","RF","ST")] <- "Forward"  
 attribs$Type[attribs$Position %in% c("LM","RM","CDM","CM","CAM","LW","RW")] <- "Midfield"  
 attribs$Type[attribs$Position %in% c("LB","RB","CB","LWB","RWB")] <- "Defense"  
 attribs$Type[attribs$Position %in% c("GK")] <- "Keeper"  
 ### Change each stat to the appropriate data type ####  
 attribs$Name <- as.character(attribs$Name)  
 attribs$Position <- as.factor(attribs$Position)  
 attribs$RAT <- as.integer(attribs$RAT)  
 attribs$PAC <- as.integer(attribs$PAC)  
 attribs$SHO <- as.integer(attribs$SHO)  
 attribs$PAS <- as.integer(attribs$PAS)  
 attribs$DRI <- as.integer(attribs$DRI)  
 attribs$DEF <- as.integer(attribs$DEF)  
 attribs$PHY <- as.integer(attribs$PHY)  
 attribs$Type <- ordered(attribs$Type,levels=c("Forward","Midfield","Defense","Keeper"))  


  1. Congratulations! Amazing post
    Could you share the R code?

  2. Could you share the R code?

  3. Hello, Thank you for sharing the code. However, looking at the attrib data (before or after removing the html wrapper) there is something strange going on. The least, (as far as I can tell) not all players are in the scrapped data. Just wanted to let you know.

  4. Amazing work!

    I got a warning message (after trying to remove the html wrapper) and don't see the mistake. Maybe you could help me ://

    Error: unexpected symbol in "attribs$Name <- gsub("^.*<span class="name"

  5. Just include a \ before each quotation mark so that R knows you are looking for a " and not trying to end a string. And thank you.


  6. Can we have your entire code for learning purpose?

  7. hello there, i am student from malaysia, im doing a research to clustering a player position using r, can you let me know entire of r code, please, i hope could help me :)

    1. Here is the link to my code on github (it's a couple of years old, so it may need a few tweaks).

    You probably need MORE COINS!
    Start auto-trading using FUT Millionaire.