Machine Learning and Sports: Data Science’s Best Example of a Class Imbalance

3 min read

green and black rope

As a budding data scientist and incoming MS candidate at the University of Chicago studying Applied Data Science, I wanted to engage in a project where I could combine my love of sports with the power of analytics. Specifically, given 20 years of athlete data spanning college, draft, and the National Football League (NFL), I wanted to see if I could develop a machine learning model to predict how wide receivers would perform, and uncover which factors most influence NFL success.

I sourced my data from two open-source APIs, which together provided me with three datasets consisting of college football, draft, and NFL statistics by athlete for wide receivers between 2004 and 2023. Initially, I was convinced that there would be a correlation between college and NFL performance across the key quantitative metrics by which wide receivers are primarily evaluated: receiving yards, receptions, touchdowns, and average yards per reception. However, when plotting the correlation between the sum of athletes’ z-scores across those metrics, I quickly realized that the datasets did not share a strong linear trend — the relationship between college and NFL performance required a more subtle analysis. Thus, pivoting away from a regression-lead project, I instead used K-Means clustering to group the athletes, where an elbow chart revealed three performance classes: low, average, and high performers.

However, as I explored these three performance clusters further, it was quickly apparent this grouping would result in a severe class imbalance. As seen by the pie chart below, which represents the proportion of athletes per performance class, using these three clusters resulted in nearly six times as many low performer data points as high performers, arguably the most important class to predict.

At first, I thought that the problem was with how I had grouped the athletes. To test this hypothesis, I further analyzed the within cluster performance by developing a two-dimensional visualization of the clusters. Specifically, the figure below depicts the K-Mean’s suggested clustering by plotting the performance by athlete within each class for all the unique combinations of the included standardized metrics (receiving yards, receptions, touchdowns, and average yards per reception).

From the figure, it is clear that the K-Mean’s suggested classes did a good job of creating distinct clusters of performance. Specifically, the defined separation between the three clusters of data points within each scatterplot reinforced the validity of this partitioning of player performance. For football fans like me, it also highlights just how wide the margin of success is between the best and worst wide receivers — the centroids indicate that the performance of the NFL’s best wide receivers since 2004 is roughly 3.33 standard deviations higher than the NFL’s worst wide receivers.

Confirming that the way I had clustered the athletes was not the problem, I realized that properly grouping athletes by performance means that encountering a class imbalance is actually very natural. In reality, the proportion of athletes who become the most successful, and are retrospectively worth drafting, are always going to be in the minority.

Thus, confident in my clusters, but still cognizant of the class imbalance obstacle that lied ahead, I knew that developing an accurate model would be dependent on my ability to address that issue. This was further confirmed by the baseline decision tree and random forests I constructed:

The two confusion matrices above visualize how often the decision tree and random forest models correctly and incorrectly predicted the class of each athlete in my dataset. As seen from the matrices, without addressing the class imbalance, these supervised learning models were unable to correctly classify any of the high performers in Class 1. Constrained by an already limited number of overall data points from using open APIs, I chose to use SMOTE random oversampling to handle the class imbalance.

Ultimately, this decision drastically improved the accuracy of the models, which was particularly clear when comparing the performance of the random forests. This was especially the case for the class of high performing athletes, which the base random forest model failed to correctly classify at all without SMOTE. Seen from the compared classification reports (graphical representations of a confusion matrix) below, using SMOTE improved both the precision and recall for nearly every class, and more importantly, ensured that at least some instances of Class 1 were correctly identified.

Interestingly, among the available features such as draft round, college receiving yards, and height, the attributes that contributed the most to the SMOTE random forest’s accuracy was ESPN’s annual pre-draft grade and wide receiver positional rankings.

While the SMOTE random forest was certainly much more accurate than the base version, the outcome from this project is not that SMOTE is the best way to overcome a class imbalance (although I would highly implore you to look at ESPN when considering which wide receiver your favorite football team should draft next year). Instead, the primary takeaway from my project journey is that the bedrock of predictive athlete accuracy in sports analytics rests on the ability to handle a class imbalance.

Due to the nature of how society views success, any classification of athletes is going to mean that there are far more data points representing those who are unsuccessful. While model accuracy will of course improve merely by obtaining more total data points, even with hundreds of thousands of rows of data, a class distribution such as mine will always constrain a supervised learning model’s full potential.

Conclusion:

In this project, I only applied one technique to handle the class imbalance, and there are many others out there (the decision among which to use forms the basis of an entirely separate, yet interesting question). However, this project serves as a reminder that addressing class imbalances can drastically improve supervised learning method outcomes, and when it comes to predicting athlete success in sports analytics, it is simply unavoidable.

For football fans and lovers of data science like myself, additional interesting insights can be found in my project GitHub.

Bradley Stoller Aspiring data scientist and MS Applied Data Science Candidate at the University of Chicago.

Leave a Reply

Your email address will not be published. Required fields are marked *