Analysis

Dimension Reduction

We hypothesized that features that contribute to the most variability within the data also best explain our outcome of interest. To test this hypothesis, we performed principal component analysis (PCA).

In PCA, the first principal component explains about a third of the variability we see in our features (35.6%), while the second to sixth principal component all explain between 5 to 10%. Together, the first three components cover over 50% of the variation we see in our data. Plotting the top 3 PCs and coloring the points by finish percentile (Figures 1-3 on the right), we see that higher values on PC 1 correspond to higher win place percentages. As hypothesized, the largest variability we see in our features also appears to be predictive of our outcome variable.

From the visualization of each feature's contribution to the first two principal components (Figure 4), we conclude that there are two primary dimensions along which our observations vary in terms of their features. The first dimension relates to player actions, specifically how much damage you dealt to other players. The second dimension relates to match duration and, to a lesser extent, distance travelled. From a player perspective, we are more interested in the first principal component, since how many players you attack is something that is more within your control.

Identifying Successful Playstyles

Additionally, we were hoping to identify which playstyles lead to the greatest chance of winning. Since the probability of winning increases as the number of players decrease, it is to the players' advantage to eliminate players. Thus, we can hypothesize that successful players may take two approaches to the game:

Aggressive Strategy: Players drop in a populated location and attempt to take control of the location. This is a high-risk and high-reward strategy as players are more likely to die early in the game due to an increased number of encounters but are rewarded with superior weapons and boosts.
Passive Strategy: Rather than eliminating other players yourself, the passive strategy relies on surviving until other players have been eliminated. This typically involves hiding to minimize encounters until only a few players remain, then selectively taking fights until the player is the only one remaining.

We first attempted to identify these strategies in our data by looking at the distribution of the percentage of total players killed by winners, as shown below.

Most winners kill a substantial proportion of their opponents -- on average, they eliminate ~ 7% of players. In comparison, non-winners eliminate ~1% of players on average. If there was a clear distinction between the two strategies in terms of proportion of players killed, we would expect to see a multi-modal distribution in the proportion of players killed for match winners. It is worth noting that there is a slight bump for winners with 0 to 0.1 proportion of players killed, but further data analysis revealed that this small cluster consists of games with fewer than fifty players.

Upon further consideration, even if the playing styles we described accurately reflect reality, we may not have the necessary data that would’ve helped us cluster them. Some examples include drop location and time of kills. Since we only have data on the total number of kills (relative and absolute), that may not be a sufficient signal to separate playing strategies. On the other hand, it may also be possible that clusters for playing strategies don’t exist at all. Perhaps all winners pursue what we would consider an aggressive strategy, in which case we would not see separate clusters either.

Thus, while we were unable to identify strategies due to data limitations, our data analysis indicates that aggressive players who have more kills relative to other players do perform better on average.

Thus, while we were unable to identify strategies due to data limitations, our data analysis indicates that winners tend to kill a high number of players.

Predicting Players' Final Placement Percentile

For our data analysis, we were primarily interested in answering:

Model Prediction: How well can we predict a player's finish placement? How well can we classify winners?
Feature Ranking: What player actions or statistics are most predictive of their finish placement?

To answer these questions, we will fit the following models, test our model on an independent set, and examine feature importance scores:

Linear regression
Elastic net regression: as we noted earlier in the data exploration, some of the features are highly correlated and/or seem to provide little signal (such as distance traveled). To solve these issues, we implement elastic net regression which uses a ridge-regression-like penalty to adjust for correlated features and lasso penalty to shrink non-informative features to zero.
Random forest: to account for interactions between features, we can use a random forest model. Ensemble methods like random forest are known to generally perform better than regression models.

We fit all models using five-fold cross validation and included all features except for the match ID and player ID, since neither feature is going to generalize for predicting new games, which is what we are interested in. Additionally, since the win place percentage is a value between 0 and 1, we apply a log transformation (adding 1 to ensure all values are defined), so that it better fulfills the assumption that the outcome is a continuous variable on the real line. However, this still doesn't guarantee that the predicted values will be between 0 and 1, so we constrain the predicted value to be 0 if it is negative and 1 if it is above 1.

A visual inspection of plots of the predicted win place percentile versus the actual win place percentile suggests that the random forest model performs the best of the three models. In the random forest model, the spread of points is smaller for players that place lower (actual win place percentage approaches 0), indicating that we are able to predict the finish percentile more accurately for players that place lower.

Comparing Model Performance

To quantitatively compare our three models, we used the following metrics on an independent validation set:

Mean absolute error (MAE): the average absolute deviation between the predicted and actual win place percentages.
Self-defined accuracy metric (SDAM(x)): This metric is a function of a cutoff value, x. If the predicted outcome is within x% of the actual win place percentage, we classify it as a "correct" prediction. Otherwise, it is an incorrect prediction.
Classification of Winners: We can compute the ROC curve by turning our predictions into a classification problem. Given a predicted win place percentage, we classify the player as a winner if its predicted value is less than a cutoff value x%. For different cutoff values, we can then compute the sensitivity (true positive rate, or the proportion of actual winners we classify as such) and specificity (true negative rate, or the proportion of actual losers we classify as such).

The performance of the three models for each metric is shown below.

From the above plots, we can see that random forest performs best on all three metrics. Its MAE is around 0.05, which means that on average, the predicted win place percentage is 5% off from the true finish percentile. Its SDAM is consistently higher than the SDAM for linear regression or elastic net regression. For example, its SDAM(5) is around 0.65, which means that for 65% of observations in our validation set, the predicted value is within 5% of the true win place percentage. Lastly, the area under its ROC curve is the largest, which indicates that it performs best at classifying who the winners are (sensitivity approximately 0.94, specificity approximately 0.86).

Feature Ranking

We can look at the relative importance of features within each model as shown below.

Unexpectedly, there is not much agreement in the features that each model regards to be important. One explanation for this is the presence of highly correlated features in our data; for example, if we look at the features kill_place and kill_streaks, the former is rated as highly important by the linear regression and random forest, while the latter is rated as highly important by elastic net regression. However, the two features are known to be highly correlated with each other. While random forest performs best in prediction, the caveat with its importance score metric is that if multiple features are all predictive of the outcome but are highly correlated with the outcome, the importance score of each feature is going to be suppressed.

Since elastic net regression accounts for correlation between features, we will primarily use its variable importance values to summarize our findings for ranking the importance of player characteristics and actions: