Read about how we analyzed the data set and what we found.
We hypothesized that features that contribute to the most variability within the data also best explain our outcome of interest. To test this hypothesis, we performed principal component analysis (PCA).
In PCA, the first principal component explains about a third of the variability we see in our features (35.6%), while the second to sixth principal component all explain between 5 to 10%. Together, the first three components cover over 50% of the variation we see in our data. Plotting the top 3 PCs and coloring the points by finish percentile (Figures 1-3 on the right), we see that higher values on PC 1 correspond to higher win place percentages. As hypothesized, the largest variability we see in our features also appears to be predictive of our outcome variable.
From the visualization of each feature's contribution to the first two principal components (Figure 4), we conclude that there are two primary dimensions along which our observations vary in terms of their features. The first dimension relates to player actions, specifically how much damage you dealt to other players. The second dimension relates to match duration and, to a lesser extent, distance travelled. From a player perspective, we are more interested in the first principal component, since how many players you attack is something that is more within your control.
Additionally, we were hoping to identify which playstyles lead to the greatest chance of winning. Since the probability of winning increases as the number of players decrease, it is to the players' advantage to eliminate players. Thus, we can hypothesize that successful players may take two approaches to the game:
We first attempted to identify these strategies in our data by looking at the distribution of the percentage of total players killed by winners, as shown below.
Most winners kill a substantial proportion of their opponents -- on average, they eliminate ~ 7% of players. In comparison, non-winners eliminate ~1% of players on average. If there was a clear distinction between the two strategies in terms of proportion of players killed, we would expect to see a multi-modal distribution in the proportion of players killed for match winners. It is worth noting that there is a slight bump for winners with 0 to 0.1 proportion of players killed, but further data analysis revealed that this small cluster consists of games with fewer than fifty players.
Upon further consideration, even if the playing styles we described accurately reflect reality, we may not have the necessary data that would’ve helped us cluster them. Some examples include drop location and time of kills. Since we only have data on the total number of kills (relative and absolute), that may not be a sufficient signal to separate playing strategies. On the other hand, it may also be possible that clusters for playing strategies don’t exist at all. Perhaps all winners pursue what we would consider an aggressive strategy, in which case we would not see separate clusters either.
Thus, while we were unable to identify strategies due to data limitations, our data analysis indicates that aggressive players who have more kills relative to other players do perform better on average.
Upon further consideration, even if the playing styles we described accurately reflect reality, we may not have the necessary data that would’ve helped us cluster them. Some examples include drop location and time of kills. Since we only have data on the total number of kills (relative and absolute), that may not be a sufficient signal to separate playing strategies. On the other hand, it may also be possible that clusters for playing strategies don’t exist at all. Perhaps all winners pursue what we would consider an aggressive strategy, in which case we would not see separate clusters either.
Thus, while we were unable to identify strategies due to data limitations, our data analysis indicates that winners tend to kill a high number of players.
For our data analysis, we were primarily interested in answering:
To answer these questions, we will fit the following models, test our model on an independent set, and examine feature importance scores:
We fit all models using five-fold cross validation and included all features except for the match ID and player ID, since neither feature is going to generalize for predicting new games, which is what we are interested in. Additionally, since the win place percentage is a value between 0 and 1, we apply a log transformation (adding 1 to ensure all values are defined), so that it better fulfills the assumption that the outcome is a continuous variable on the real line. However, this still doesn't guarantee that the predicted values will be between 0 and 1, so we constrain the predicted value to be 0 if it is negative and 1 if it is above 1.
A visual inspection of plots of the predicted win place percentile versus the actual win place percentile suggests that the random forest model performs the best of the three models. In the random forest model, the spread of points is smaller for players that place lower (actual win place percentage approaches 0), indicating that we are able to predict the finish percentile more accurately for players that place lower.
To quantitatively compare our three models, we used the following metrics on an independent validation set:
x
. If the predicted outcome is within x
% of the actual win place percentage, we classify it as a "correct" prediction. Otherwise, it is an incorrect prediction. x
%. For different cutoff values, we can then compute the sensitivity (true positive rate, or the proportion of actual winners we classify as such) and specificity (true negative rate, or the proportion of actual losers we classify as such).The performance of the three models for each metric is shown below.
From the above plots, we can see that random forest performs best on all three metrics. Its MAE is around 0.05, which means that on average, the predicted win place percentage is 5% off from the true finish percentile. Its SDAM is consistently higher than the SDAM for linear regression or elastic net regression. For example, its SDAM(5) is around 0.65, which means that for 65% of observations in our validation set, the predicted value is within 5% of the true win place percentage. Lastly, the area under its ROC curve is the largest, which indicates that it performs best at classifying who the winners are (sensitivity approximately 0.94, specificity approximately 0.86).
We can look at the relative importance of features within each model as shown below.
Unexpectedly, there is not much agreement in the features that each model regards to be important. One explanation for this is the presence of highly correlated features in our data; for example, if we look at the features kill_place
and kill_streaks
, the former is rated as highly important by the linear regression and random forest, while the latter is rated as highly important by elastic net regression. However, the two features are known to be highly correlated with each other. While random forest performs best in prediction, the caveat with its importance score metric is that if multiple features are all predictive of the outcome but are highly correlated with the outcome, the importance score of each feature is going to be suppressed.
Since elastic net regression accounts for correlation between features, we will primarily use its variable importance values to summarize our findings for ranking the importance of player characteristics and actions:
kill_streaks
as the most predictive among the different features related to kills. What this suggests is that while some players may succeed with less confrontational playing strategies, you need to kill other players or you will be eliminated.weapons_acquired
and boosts
are the strongest predictors of outcome. This is interesting because one would expect that among the features related to item acquisition and usage, weapons would be the dominant predictor. However, the acquisition of weapons may plateau over time once you have strong weapons. On the other hand, boosts enable increased health regeneration over time and have a small movement speed bonus when the amount of boosts consumed is beyond a particular threshold. Players tend to save boosts as an additional advantage in the later stages of the game when most players have powerful weapons. Consequently, the number of boosts consumed is also a strong indicator of a successful player. Our data analysis has a number of limitations. The first is that for computational purposes, we only used around 20% of the total amount of solo-player data. In the future, we could consider repeating the analysis with a larger data set. Second, we only examined solo player data, so our conclusions are not generalizable to group gameplay, which is complicated by aspects of teamwork such as revival and positioning. Finally, we could have explored using the Beta distribution to model our outcome of interest since the final win place percentile is a value between 0 and 1.
We were unable to identify clusters of different playing strategies among winners, due to the limitations of the data we have. However, our data analysis indicates that aggressive players who have more kills relative to other players do perform better on average.