Batting average is not a particularly useful tool for player analysis, but it something that fans and commentators often use. I thought adding an end of season projection to that conversation would be interesting and attempted to build a model that could use data from March and April of a season to predict end of year batting average.
This post will summarize my initial attempt, some things I have learned since then, and ideas to further improve model performance.
Understanding the Data
I had data for 300 random MLB players during the 2018 season. The only limitation on the list of players was there were no pitchers. The features were available for analysis:
In addition, I had the actual EoY batting averages for model evaluation. Before starting, I created a distribution plot of the March/April Batting Averages to see if the was a true random selection of the league. The distribution below looks good, and I assumed each of the other rows had normal distributions as well.
General Approach and Assumptions
The first month of data for a batter is not a great indicator of full season success, so I needed to think about how I was going to approach this problem to account for players getting off to different starts. To do this, I applied the theory of regression to the mean. In this situation it means that everyone's final statistics will land somewhere between the league average and their performance in March/April. Applying this to just batting average was too simple, but applying it to the other statistics that are good predictors of batting average provides a more robust way to predict full season batting average. This assumption of regression to the mean gave me the general approach to this project:
Step 1 - Identify statistics that are good predictors of batting average using March/April statistics and March/April batting averages
Step 2 - Build a predictive model using those predictive statistics
Step 3 - Regress those statistics from March/April to the mean as a way of projecting full season performance
Step 4 - Use the regressed statistics to predict EoY batting averages
Step 5 - Use the actual EoY batting averages to evaluate model performance
Following these steps, I created and evaluated an XGBoost model to predict batting averages. I chose to use MAE as my key performance metric because it penalizes each error equally. This is ideal because a single player who significantly outperforms/underperforms expectations in a year is not uncommon but would make a model that otherwise predicts batting average well look poor. Additionally, MAE is easy to use to explain the model's performance. A MAE of 0.010 means each prediction will be 0.010 off on average.
Summary of Model Built for Submission
I followed the steps listed above to build an XGBoost model. I chose an XGBoost model because it typically outperforms more simple models and I have had success using it in the past.
Step 1 - Identifying Predictive Features
Of the statistics available, 16 were relevant to hitting performance and might be useful for model creation. I didn't want to use all 16 as that would make the model too hard to understand. To determine the best features, I build an XGBoost regression model will all 16 features and evaluated the gain each feature provided. I then looked for natural cutoffs in the feature importance. The feature importance from the full 16 feature model is shown below:
There were two natural cutoffs I could identify and I tested them to see how the MAE changed compared to using all 16 features. The first combination was OBP and BABIP only. All of the other features had much less gain than these two, but the MAE increased in this model from -0.0213 to -0.032. My next combination was OBP, BABIP, K%, ISO, and BB%. This combination had a MAE of -0.0205 which was better than using all 16 features. To ensure this was the best combination, I tried adding and subtracting a few features but ultimately decided the top 5 features was best.
Step 2 - Building and Tuning the Model
Now that I had my 5 features to use in the model, I needed to build and tune it so it would be ready to make predictions. Using some strategies for model tuning from a Towards Data Science article, I tuned the model using trial and error until MAE was optimized. This gave me the following parameters for my XGBoost Regression Model:
learning rate = 0.05,
number of estimators = 1000
max depth = 5
subsample = 0.8
gamma = 0
Step 3 - Regressing Statistics to the Mean
With a model built, I needed end of year statistics to feed into it and make batting average predictions. I found league average values for each statistic online:
OBP - 0.320
BABIP - 0.300
K% - 0.200
BB% - 0.080
ISO - 0.140
I then created a variable constant called 'Multiplier' for each player based on the number of plate appearances they had in March/April and an average number of plate appearances in a season. I estimated the number of plate appearances in a season to be 500. The Multiplier determined how much weight the March/April stats would have compared to league averages.
This was important because a player with 100 plate appearances and good statistics is more likely to stay above average than a player with 10 plate appearances and good statistics. A higher Multiplier created this effect, regressing stats of those with high plate appearances less.
For example, Elias Diaz only had 34 plate appearances while Manny Machado had 125. Diaz's multiplier was 0.068 while Machado's was 0.250. This causes Diaz's final stats to be 6.8% from his March/April performance and 93.2% from league averages. Machado's stats are 25% from from his March/April performance and 75% from league averages.
I calculated a weighted average of the March/April stats and the league average using the variable Multiplier as the weighting constant.
Step 4 - Making Final Predictions
With my model from Step 2 and regressed statistics from Step 3, I made my predictions for EoY batting average. However, before comparing them to the actual results I wanted to make sure they made sense.
I was concerned about the regression techniques I used and how they would effect the stats for each player. To see how they affected the predictions I looked at the range of values predicted and a distribution plot of the predictions. Ideally, the predictions would have a normal distribution and a range similar to the 2017 range which was 0.203 to 0.342 (0.139 range).
The values were more heavily regressed to the mean than I had expected and the distribution was not normal. This told me I was undervaluing the March/April stats. To adjust the regression effects, I changed the expected plate appearances from Step 3 to 100 which produced a good normal distribution. Then I pushed the value up slowly until I started to lose that normal distribution. I wanted the value to be as high as possible since 100 plate appearances weighs the March/April stats very heavily. I settled on a value of 180 which adjusted the range of predictions to 0.182 to 0.321 (0.139 range) while keeping a normal distribution.
To compare to the original constant, Diaz's March/April stats went from having a weight of 6.8% to a weight of 18.8% and Machado's March/April weight increased from 25% to 69.4%.
Once the weighting constant was adjusted, I reran the model to get my final predictions. The new predictions had a normal distribution and a range of 0.139. Below is the distribution plot and range for the adjusted weighting:
Step 5 - Evaluating Model Performance
The final step was to compare the predictions to the actual full season batting averages. My primary evaluation was MAE which came out to 0.022.
To help visualize my predictions I plotted the predicted values vs the actual values.
I also wanted to see if the model tended to overestimate or underestimate batting averages. The overestimate/underestimate ratio was 55%/45% indicating the model doesn't have any strong bias in one direction.
The performance of the model wasn't great. The difference of 22 batting average points can be significant over the course of a season. However, Sarah R. Bailey, Jason Loeppky, and Tim B. Swartz did a study comparing Statcast and PECOTA predictions for 2017 batting average that performed similarly. PECOTA's MAE was 0.0236 and Statcast's was 0.0209. Those values indicate this model is performing in the same range as other batting average prediction models.
Learning from the Experience
After completing the first XGBoost model I have learned more about modeling and reflected on how I could have done a better job. Input from others as well as reading about how others have made useful prediction models using baseball data have led me to the following main points of learning.
- Check simpler models to consider the tradeoff between model performance and interpretability with XGBoost. (More on this below)
- Feature selection can be done iteratively and visualized looking for a point of diminishing returns. I happened to pick a good combination but the process could have been easier and more mathematically sound.
- Model tuning can be done efficiently using Random Grid Search instead of guess and check techniques. This is all about efficiency and ensuring the best combination of parameters is selected.
The main takeaway from my learning is to check on performance of easier to interpret models before using XGBoost. XGBoost can improve model performance, but in situations like this it can be very helpful to understand what is driving a model to predict things the way they do. There isn't a set rule on when to choose which model, but it is good to check and see if the ambiguity of XGBoost is worth increased performance.
This is especially true for this project after I checked the performance of some multiple linear regression models and found them to perform equally as well, if not better than, my XGBoost model. These models are much easier to understand, so using one of them would have been a better decision. Below is a comparison of the model performance for my original XGBoost model, a multiple linear regression model using all 16 features, and a multiple linear regression model using the same 5 features as the XGBoost model:
Although it is disappointing to see that I clearly made the wrong choice for a model in this situation, it has provided a valuable lesson for me moving forward.
Further Model Improvements
The MAE score for all three of the models I created was 0.022. This would not be an ideal score if this model were to be implemented within an organization. In order to improve model, two improvements could be made:
1) Use each player's historical averages for regression instead of league averages. This would work especially well with OBP, K%, BB%, and ISO because they have been shown to have a strong year-to-year correlation which means they stay consistent for each player and therefore we can expect each player to regress to their own averages with more confidence than the league average. For a stat like BABIP, the year-to-year correlation is much lower, so a league average would be better. This more personalized model would likely give better results, especially for players who have stats significantly different than league averages.
2) Add additional stats and see if they became top features and improve final predictions. Statcast data like exit velocity and launch angle would be two features I would start with as they have been shown to be good indicators of hitter talent.
The following sources were used in my creation of the original model:
becominghuman.ai (https://becominghuman.ai/understand-regression-performance-metrics-bdb0e7fcc1b3) - using MAE for model performance evaluation
Beyond the Boxscore (https://www.beyondtheboxscore.com/2011/9/1/2393318/what-hitting-metrics-are-consistent-year-to-year) - year-over-year correlation of hitting statistics
Beyond the Boxscore (https://www.beyondtheboxscore.com/2017/12/26/16815098/babip-mlb-batting-average-on-balls-in-play-stats-statcast) - BABIP league average
Fangraphs (https://library.fangraphs.com/offense/rate-stats/) - K% and BB% league averages
Fangraphs (https://library.fangraphs.com/offense/iso/) - ISO league average
Fangraphs (https://library.fangraphs.com/offense/obp/) - OBP league average
Medium (https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d) - choosing MAE for model performance evaluation
The Prediction of Batting Averages in Major League Baseball by Sarah R. Bailey, Jason Loeppky and Tim B. Swartz (http://people.stat.sfu.ca/~tim/papers/sarah.pdf) - using Statcast data to predict batting averages and a comparison for MAE performance
pydata.org (https://seaborn.pydata.org/tutorial/distributions.html) - distribution plot code
Towards Data Science (https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e) - tuning a XGBoost model