Batting average is not a particularly useful tool for player analysis, but it something that fans and commentators often use. I thought adding an end of season projection to that conversation would be interesting and attempted to build a model that could use data from March and April of a season to predict end of year batting average.
This post will summarize my initial attempt, some things I have learned since then, and ideas to further improve model performance.
Understanding the Data
I had data for 300 random MLB players during the 2018 season. The only limitation on the list of players was there were no pitchers. The features were available for analysis:
In addition, I had the actual EoY batting averages for model evaluation. Before starting, I created a distribution plot of the March/April Batting Averages to see if the was a true random selection of the league. The distribution below looks good, and I assumed each of the other rows had normal distributions as well.
General Approach and Assumptions
The first month of data for a batter is not a great indicator of full season success, so I needed to think about how I was going to approach this problem to account for players getting off to different starts. To do this, I applied the theory of regression to the mean. In this situation it means that everyone's final statistics will land somewhere between the league average and their performance in March/April. Applying this to just batting average was too simple, but applying it to the other statistics that are good predictors of batting average provides a more robust way to predict full season batting average. This assumption of regression to the mean gave me the general approach to this project:
Step 1 - Identify statistics that are good predictors of batting average using March/April statistics and March/April batting averages
Step 2 - Build a predictive model using those predictive statistics
Step 3 - Regress those statistics from March/April to the mean as a way of projecting full season performance
Step 4 - Use the regressed statistics to predict EoY batting averages
Step 5 - Use the actual EoY batting averages to evaluate model performance
Following these steps, I created and evaluated an XGBoost model to predict batting averages. I chose to use MAE as my key performance metric because it penalizes each error equally. This is ideal because a single player who significantly outperforms/underperforms expectations in a year is not uncommon but would make a model that otherwise predicts batting average well look poor. Additionally, MAE is easy to use to explain the model's performance. A MAE of 0.010 means each prediction will be 0.010 off on average.
Summary of Model Built for Submission
I followed the steps listed above to build an XGBoost model. I chose an XGBoost model because it typically outperforms more simple models and I have had success using it in the past.
Step 1 - Identifying Predictive Features
Of the statistics available, 16 were relevant to hitting performance and might be useful for model creation. I didn't want to use all 16 as that would make the model too hard to understand. To determine the best features, I build an XGBoost regression model will all 16 features and evaluated the gain each feature provided. I then looked for natural cutoffs in the feature importance. The feature importance from the full 16 feature model is shown below:
There were two natural cutoffs I could identify and I tested them to see how the MAE changed compared to using all 16 features. The first combination was OBP and BABIP only. All of the other features had much less gain than these two, but the MAE increased in this model from -0.0213 to -0.032. My next combination was OBP, BABIP, K%, ISO, and BB%. This combination had a MAE of -0.0205 which was better than using all 16 features. To ensure this was the best combination, I tried adding and subtracting a few features but ultimately decided the top 5 features was best.
Step 2 - Building and Tuning the Model
Now that I had my 5 features to use in the model, I needed to build and tune it so it would be ready to make predictions. Using some strategies for model tuning from a Towards Data Science article, I tuned the model using trial and error until MAE was optimized. This gave me the following parameters for my XGBoost Regression Model:
learning rate = 0.05,
number of estimators = 1000
max depth = 5
subsample = 0.8
gamma = 0
Step 3 - Regressing Statistics to the Mean
With a model built, I needed end of year statistics to feed into it and make batting average predictions. I found league average values for each statistic online:
OBP - 0.320
BABIP - 0.300
K% - 0.200
BB% - 0.080
ISO - 0.140
I then created a variable constant called 'Multiplier' for each player based on the number of plate appearances they had in March/April and an average number of plate appearances in a season. I estimated the number of plate appearances in a season to be 500. The Multiplier determined how much weight the March/April stats would have compared to league averages.
This was important because a player with 100 plate appearances and good statistics is more likely to stay above average than a player with 10 plate appearances and good statistics. A higher Multiplier created this effect, regressing stats of those with high plate appearances less.
For example, Elias Diaz only had 34 plate appearances while Manny Machado had 125. Diaz's multiplier was 0.068 while Machado's was 0.250. This causes Diaz's final stats to be 6.8% from his March/April performance and 93.2% from league averages. Machado's stats are 25% from from his March/April performance and 75% from league averages.
I calculated a weighted average of the March/April stats and the league average using the variable Multiplier as the weighting constant.
Step 4 - Making Final Predictions
With my model from Step 2 and regressed statistics from Step 3, I made my predictions for EoY batting average. However, before comparing them to the actual results I wanted to make sure they made sense.
I was concerned about the regression techniques I used and how they would effect the stats for each player. To see how they affected the predictions I looked at the range of values predicted and a distribution plot of the predictions. Ideally, the predictions would have a normal distribution and a range similar to the 2017 range which was 0.203 to 0.342 (0.139 range).
The values were more heavily regressed to the mean than I had expected and the distribution was not normal. This told me I was undervaluing the March/April stats. To adjust the regression effects, I changed the expected plate appearances from Step 3 to 100 which produced a good normal distribution. Then I pushed the value up slowly until I started to lose that normal distribution. I wanted the value to be as high as possible since 100 plate appearances weighs the March/April stats very heavily. I settled on a value of 180 which adjusted the range of predictions to 0.182 to 0.321 (0.139 range) while keeping a normal distribution.
To compare to the original constant, Diaz's March/April stats went from having a weight of 6.8% to a weight of 18.8% and Machado's March/April weight increased from 25% to 69.4%.
Once the weighting constant was adjusted, I reran the model to get my final predictions. The new predictions had a normal distribution and a range of 0.139. Below is the distribution plot and range for the adjusted weighting:
Step 5 - Evaluating Model Performance
The final step was to compare the predictions to the actual full season batting averages. My primary evaluation was MAE which came out to 0.022.
To help visualize my predictions I plotted the predicted values vs the actual values.
I also wanted to see if the model tended to overestimate or underestimate batting averages. The overestimate/underestimate ratio was 55%/45% indicating the model doesn't have any strong bias in one direction.
The performance of the model wasn't great. The difference of 22 batting average points can be significant over the course of a season. However, Sarah R. Bailey, Jason Loeppky, and Tim B. Swartz did a study comparing Statcast and PECOTA predictions for 2017 batting average that performed similarly. PECOTA's MAE was 0.0236 and Statcast's was 0.0209. Those values indicate this model is performing in the same range as other batting average prediction models.
Learning from the Experience
After completing the first XGBoost model I have learned more about modeling and reflected on how I could have done a better job. Input from others as well as reading about how others have made useful prediction models using baseball data have led me to the following main points of learning.
- Check simpler models to consider the tradeoff between model performance and interpretability with XGBoost. (More on this below)
- Feature selection can be done iteratively and visualized looking for a point of diminishing returns. I happened to pick a good combination but the process could have been easier and more mathematically sound.
- Model tuning can be done efficiently using Random Grid Search instead of guess and check techniques. This is all about efficiency and ensuring the best combination of parameters is selected.
The main takeaway from my learning is to check on performance of easier to interpret models before using XGBoost. XGBoost can improve model performance, but in situations like this it can be very helpful to understand what is driving a model to predict things the way they do. There isn't a set rule on when to choose which model, but it is good to check and see if the ambiguity of XGBoost is worth increased performance.
This is especially true for this project after I checked the performance of some multiple linear regression models and found them to perform equally as well, if not better than, my XGBoost model. These models are much easier to understand, so using one of them would have been a better decision. Below is a comparison of the model performance for my original XGBoost model, a multiple linear regression model using all 16 features, and a multiple linear regression model using the same 5 features as the XGBoost model:
Although it is disappointing to see that I clearly made the wrong choice for a model in this situation, it has provided a valuable lesson for me moving forward.
Further Model Improvements
The MAE score for all three of the models I created was 0.022. This would not be an ideal score if this model were to be implemented within an organization. In order to improve model, two improvements could be made:
1) Use each player's historical averages for regression instead of league averages. This would work especially well with OBP, K%, BB%, and ISO because they have been shown to have a strong year-to-year correlation which means they stay consistent for each player and therefore we can expect each player to regress to their own averages with more confidence than the league average. For a stat like BABIP, the year-to-year correlation is much lower, so a league average would be better. This more personalized model would likely give better results, especially for players who have stats significantly different than league averages.
2) Add additional stats and see if they became top features and improve final predictions. Statcast data like exit velocity and launch angle would be two features I would start with as they have been shown to be good indicators of hitter talent.
The following sources were used in my creation of the original model:
becominghuman.ai (https://becominghuman.ai/understand-regression-performance-metrics-bdb0e7fcc1b3) - using MAE for model performance evaluation
Beyond the Boxscore (https://www.beyondtheboxscore.com/2011/9/1/2393318/what-hitting-metrics-are-consistent-year-to-year) - year-over-year correlation of hitting statistics
Beyond the Boxscore (https://www.beyondtheboxscore.com/2017/12/26/16815098/babip-mlb-batting-average-on-balls-in-play-stats-statcast) - BABIP league average
Fangraphs (https://library.fangraphs.com/offense/rate-stats/) - K% and BB% league averages
Fangraphs (https://library.fangraphs.com/offense/iso/) - ISO league average
Fangraphs (https://library.fangraphs.com/offense/obp/) - OBP league average
Medium (https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d) - choosing MAE for model performance evaluation
The Prediction of Batting Averages in Major League Baseball by Sarah R. Bailey, Jason Loeppky and Tim B. Swartz (http://people.stat.sfu.ca/~tim/papers/sarah.pdf) - using Statcast data to predict batting averages and a comparison for MAE performance
pydata.org (https://seaborn.pydata.org/tutorial/distributions.html) - distribution plot code
Towards Data Science (https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e) - tuning a XGBoost model
The Houston Astros are a league leader in using data analytics to identify and develop talent. Recently, their use of spin rate has become more public and the success they have had finding pitching talent speaks for itself. There is more to the Astros' success than finding players with certain spin rate characteristics, but it is likely the first step they take when evaluating pitchers. With the 2020 season wrapping up, I wanted to look ahead to the free agent pitchers and see if any are good fits when using this method of assessment.
Using Past Astros Acquisitions to Identify an Ideal Pitcher Profile
There are numerous articles and reports to get started building a profile that the Astros prefer. Here are a few articles that discuss the Astros' preferences and how they use those pitch characteristics to be successful:
Highest Team Spin in 2017-2018 - https://www.crawfishboxes.com/2018/5/3/17316840/digging-into-the-data-astros-spin-rates
Ryan Pressly's Transformation - https://www.theringer.com/mlb/2019/6/3/18644512/mvp-machine-how-houston-astros-became-great-scouting
Gerrit Cole's Transformation - https://fivethirtyeight.com/features/how-gerrit-cole-went-from-so-so-to-unhittable/
From this information I identified some characteristics the Astros are using to look for pitching potential:
- Above average four-seam fastball spin
- Above average curveball spin
- High use of a below average two-seam fastball/sinker, especially with poor performance
To validate these characteristics and their utility in identifying pitching talent, I looked at the pitchers on the Astros 2019 ALCS roster that were acquired after 2015 when spin rate data became available. There were seven pitchers who met these criteria, and Joe Smith was removed because he throws from a sidearm slot. Below is a breakdown of the pitchers used for the initial analysis and when they were acquired by the Astros:
Pitchers Traded for:
Justin Verlander - 2017
Gerrit Cole - 2018
Ryan Pressly - 2018
Roberto Osuna - 2018
Zack Greinke - 2019
Pitchers Signed in Free Agency:
Hector Rondon - 2017
To analyze the Astros' preferences for a high spin four-seam fastball I used Statcast data for each player 2 years before joining the Astros, and all available data since. I also used league wide pitch data from 2017 to 2019 to get a league average and standard deviation spin rate (2270 +/- 175 rpm). The figure below shows the results:
The data on four-seam fastball spin rates provides the following insights:
- 4 of the 6 pitchers had spin rates greater than the league average before joining the Astros, and 3 had spin rates more than one standard deviation greater than the league average.
- 5 of the 6 pitchers had spin rates greater than the league average after joining the Astros, and 4 had spin rates more than one standard deviation greater than the league average.
I used the same analysis methods to look at curveball spin rates. To make the analysis a little easier, I combined all pitches tagged as a Knuckle Curveball into this category. The league average values were 2505 +/- 295 rpm. The following figure shows the results:
Hector Rondon and Roberto Osuna don't throw a curveball so their spin rates were zero leaving only 4 pitchers for comparison. The data on curveball spin rates provides the following insights:
- 3 of the 4 pitchers had spin rates greater than the league average before joining the Astros, and 1 had a spin rate more than one standard deviation greater than the league average.
- 3 of the 4 pitchers had spin rates greater than the league average after joining the Astros, and 3 had spin rates more than one standard deviation greater than the league average.
The final characteristic identified in the articles focused on two-seam fastballs/sinkers. This characteristic really has three parts:
- Spin rate. With a two-seam fastball pitchers actually want lower spin to generate more downward movement. The league average values were 2148 +/- 175 rpm.
- Use rate. A pitcher who uses an above average spin two-seam fastball will have more to gain by using it less.
- Poor performance. I used wOBA because it is readily available with Statcast data.
I analyzed two-seam fastball data for each of the six Astros pitchers in those three categories, seen in the figures below.
Justin Verlander doesn't throw a two-seam fastball so there are only 5 pitchers to compare. Looking at the three sets of data together we can see some common trends:
- 3 of the 5 pitchers had two-seam fastball spin rates that were higher than the league average and 2 had spin rates greater than one standard deviation (175 rpm).
- All 5 of the pitchers have decreased their two-seam fastball use, and everyone except Hector Rondon have decreased their use significantly.
- All 5 of the pitchers had high wOBA on two-seam fastballs before joining the Astros indicating the pitch was not effective.
Using the four-seam fastball, curveball, and two-seam fastball data together we can validate the pitcher profile identified in the articles. With the exception of Justin Verlander and Hector Rondon, all of the pitchers were using a two-seam fastball that was ineffective. Spin rates were high, use was high, and wOBA was high for that pitch. After joining the Astros, two-seam fastball use decreased. The Astros instead had those pitchers rely on their high spin four-seam fastball and curveball. Justin Verlander doesn't throw a two-seam fastball but had exceptional spin rates on his four-seam fastball and curveball. Even Zack Greinke, whose spin rates were around league average, decreased use of his least effective pitch. These characteristics helped the Astros identify pitching potential and build their pitching staff.
Identifying Free Agents that Match the Pitcher Profile
Now that the characteristics were validated, I wanted to see if any of the 2020 free agents matched this profile and could benefit from a change in approach. I wanted to find pitchers who met all three criteria. These pitchers will have the most to gain and therefore offer the most value.
There are 47 starters and 53 relievers on the mlb.com list of potential 2020 free agents so I broke them into groups of 10 to keep the graphs readable. I will present them below (as a slide show) using the past two seasons of data for each pitcher and the same methods as above. You can look through each figure yourself or skip to the end and see the pitchers I identified and a closer look at their numbers.
Four-seam fastball spin rate data:
Curveball spin rate data:
Two-seam fastball spin rate:
Two-seam fastball use data:
Two-seam fastball wOBA data:
To filter through all 100 pitchers, I tried various filter combinations that gave me a short list of pitchers. Ultimately I landed on parameters that required a pitcher to have either an elite four-seam fastball or curveball and two-seam fastball characteristics that indicate decreased use would benefit them. The parameters used were:
- Four-seam fastball spin in the 75th percentile or better (> 2388 rpm) OR
Curveball spin in the 75th percentile or better (> 2704 rpm)
- Two-seam fastball spin greater than the league average (> 2148 rpm)
- Two-seam fastball use greater than 10%
- Two-seam fastball wOBA greater 0.300
This cut the list down to 10:
This group's four-seam fastball data:
Adam Wainwright and Trevor Cahill's four-seam fastball spin is below the league average which could be problematic for this profile. I will eliminate them from further analysis.
All of the other pitchers have good spin on their four-seam fastballs which meets the ideal pitcher profile. Yu Darvish, Rick Porcello, and Tommy Hunter have especially good four-seam fastball spin.
Curveball spin data:
Steve Cishek and Arodys Vizcaino don't throw a curveball. They don't exactly fit the Astros profile I've identified and will be eliminated from further analysis. Further analysis into their other pitches could still make them attractive to a similar process, but that is outside the scope of this post.
Rick Porcello and Tommy Hunter have good four-seam fastball spin and also have good curveball spin. Jake Arrieta also has curveball spin that is well above the league average.
Two-seam fastball spin data:
Everyone's two-seam fastball spin is high because of the filter I used, so this metric doesn't provide a ton of insight into potential gains. However, Rick Porcello, and Tommy Hunter have especially poor spin on their two-seam fastballs. Combining this information with two-seam fastball use and wOBA will be more insightful.
Two-seam fastball use data:
Yu Darvish uses his two-seam fastball the least out of this group and therefore might have the least to gain from a decrease in throwing his two-seam fastball. Jake Arrieta used his two-seam fastball more than 50% of the time and would require to greatest change in approach.
Two-seam fastball wOBA:
wOBA on two-seam fastballs were all poor - again because of my filtering process - but Edinson Volquez has especially poor performance. Everyone in the group had a higher wOBA on two-seam fastball than overall. Eliminating two-seam fastball use for a more effective pitch or pitches would benefit their performance.
Based on this assessment I've ranked the top 5 'Astros Type' free agent pitchers for 2020. Each pitcher has above average spin on four-seam fastballs and curveballs and would benefit greatly from decreasing use of a poor two-seam fastball.
1 - Rick Porcello
2 - Tommy Hunter
3 - Edinson Volquez
4 - Yu Darvish
5 - Jake Arrieta
The ability to teach higher or lower spin rates would give a team a huge edge in player development and identifying high value players who could improve their spin rates on certain pitches. Gerrit Cole saw increased spin rate on his four-seam fastball after joining the Astros. He attributes this to adjustments in grip and release but the spike has caused controversy around the league. Spin is currently considered a trait that isn't alterable, but more research with high speed cameras like Edgertronic or Rapsoto could crack the code to perfecting a pitch's spin rate.
Identifying these characteristics is far from all the Astros do. Pitchers have to commit to changing their pitch selection. Additionally, locating a fastball up in the zone, like the Astros prefer, requires solid command and confidence from a pitcher. Maybe the Astros or another team will sign one of these pitchers this offseason and convince them to change their pitching style. I think it would benefit their career and in turn the team who signs them.
More analysis could be done on this group of pitchers. Steve Cishek and Arodys Vizcaino might have great spin on their sliders and adding slider spin could reveal other potential pitchers. Additionally, the Cubs have implemented an approach that is exactly the opposite, focusing on pitchers with especially low spin two-seam fastballs and changeups which leads to better movement in the bottom of the zone. A club could optimize each pitcher based on their spin profiles and find great value in the free agent market.
Who do you think I missed as a high spin pitching target this offseason?
Each play in the playoffs holds extra weight compared to the regular season. An error can change a game and a loss can doom a series. In close games and series, it is often the team that executes the small plays that comes out on top.
A particular play in Game 2 of the NLDS between Washington and Los Angeles stood out in this context: Asdrúbal Cabrera singled to right field, driving in Ryan Zimmerman. However, the throw from the outfield held up Kurt Suzuki at third base and Cabrera was thrown out trying to advance to second base on the throw. Although the Nationals still won the game, the base running error was not inconsequential in the series.
Evaluating the Result with WE and RE24
Two statistics - Win Expectancy (WE) and RE24 - can be used to show why trying to advance was a bad decision.
Win Expectancy is the probability a team will win given the specific circumstances. Greg Stoll's Win Expectancy Calculator  shows how different base running outcomes by Cabrera change Washington's WE in Table 1, below.
The Nationals' Win Percentage before Cabrera's single was 82.7%. The highest WE is 93.4% and results when Cabrera gets to second base, however, staying at first only decreases Washington's WE by 0.9%. In comparison, getting thrown out decreases WE by 6.7% compared to staying at first. A 0.9% increase in WE is not worth risking 6.7%, especially in a playoff game where you have the lead.
RE24 captures the change in Run Expectancy (RE) while considering runs scored during the play. RE is the same concept as WE except focusing on the probability of a run being scored instead of a team winning the game. The equation for RE24 is:
Using Fangraphs' RE24 matrix the change in run expectancy can be evaluated. Results are shown in Table 2.
RE24 also shows how risky Cabrera' base running was. Making it to second increases his RE24 by 0.212, but getting thrown out decreases it by 0.727. Staying at first would have given the National's a good opportunity to score more runs and pad their lead in a big road playoff game.
Burning Scherzer in Relief
It is easy to dismiss this base running error because the Nationals won the game and were never really at risk of losing after. In fact, their WE never dropped below 80% for the rest of the game. But the difference in a close win and a blowout win can be huge later in the series.
Dave Martinez used ace Max Scherzer (2.45 FIP) in relief in the bottom of the 8th inning to shutdown the Dodgers. This puts his projected start Sunday in question and could cause Anibal Sanchez (4.44 FIP) to start in Scherzer's place. A bump Sunday means everyone gets pushed back and either Patrick Corbin (3.49 FIP) or Stephen Strasburg (3.25 FIP) only makes one appearance in the series.
If Cabrera stays at first, Washington has a better chance to score more runs in the 8th inning. If they do, maybe Martinez skips Scherzer in relief and the three man rotation remains intact for Games 3-5.
There is no way to know the direct effect of Cabrera's base running error, but in the playoffs execution of baseball fundamentals become all the more important. Washington will need to avoid more mistakes like these if they want to knockoff the NL favorite Dodgers.
 Greg Stoll's Win Expectancy Calculator -
 Adjusting RE24 for baserunning -
 RE24 -