The Idiot’s Guide to Baseball Projections: By Nate Rawlings
SECTION I: NEWTON’S LAW
Patterns are an inevitable, inescapable even, fact of life. If the patterns of people are carefully observed, certain outcomes can be quite accurately anticipated (and often gambled on), even before they occur.
When it comes to baseball, much of the same logic applies. Sometimes, what our “gut” is telling us about the team we follow is merely a subconscious observation of a pattern we don’t even know we caught onto. Much of what I do revolves around unearthing patterns in the game and finding (or creating) stats that illustrate what our guts haven't been able to put into words for so long.
With respect to any sort of statistical model, the K.I.S.S. method (Keep It Simple, Stupid) is always ideal. The more extraneous or unnecessary variables one introduces, the more potential points of failure there are for your model to go wrong. Nothing gets introduced unless the results of its inclusion can justify its use.
Projection models are always an interesting pursuit, since the “best” design doesn’t always yield the most correct results. Every season doesn’t always boil down to the most predictable (and statistically likely) result unfolding. The top overall seed doesn't win every March Madness Tournament, the favorite in Vegas doesn't win every Super Bowl, and the most statistically probable outcome doesn’t always unfold every baseball season. However, unlike some sports with much shorter seasons (that allow for statistically unlikely results to have a much greater impact on the season), Major League Baseball’s 162-game slate gives a large enough sample-size that the end result will have a much higher probability of falling within an expected range.
Newton's First Law states that, “An object at rest tends to remain at rest, and an object in motion tends to remain in motion, unless it is acted on by some other force.”
Generally, this law holds true, not just with objects in the vacuum of space, but across a broad array of topics in day to day life. A broke college kid will remain broke until, a gracious employer overlooks the fact that the kid doesn’t have the preferred 10 years of experience for an entry level position and hires them anyway.
In baseball, a playoff bound franchise is likely to remain playoff bound, and a losing franchise is likely to continue with its losing ways, until they are acted upon by some other force (i.e. an injury, trade, player development/decline, etc.).
So, when it comes to projecting this season’s results, the best place to start is by simply taking the results from last season. If this year were to occur in a vacuum where the results were immune to the impact of aging, unforeseen injuries, and offseason player movement, the outcome would simply be a repeat of last season. It’s as simple as saying if this year’s results equal “X”, and last year’s results equal “Y”, then X = Y.
When examining actual results in the win/loss column compared with a team’s performance, it doesn't take an original stat or mind-numbing algorithm to let the average fan know that the results aren't always an accurate reflection of a team’s overall effort. The first adjustment we can apply is to change last season’s win total to one that more accurately reflects overall performance. To win a game, a ball club must score more runs than they allow their opponent to score. A simple way to compare actual wins to expected wins is by examining a team’s run-differential.
SECTION II: EXAMPLES FROM LAST SEASON
Lets examine a real life example from last season. In the following chart, we can see the run differentials of 4 unidentified teams.
Now, If I were to tell you that one of these teams finished with a record that was nearly .500 (80-82) which one would you pick? Do you go with Team A, which has scored about as many runs as it has given up, or do you go with another team that scored significantly fewer runs than they gave up?
If you guessed Team A, you’d be correct!
However, the results do not always wind up being so cut and dry. As you are about to find out, they can get deceptively misleading.
What if I were to tell you that a second team ALSO finished with the same record as Team A, which one would you pick? Every other team on this list gave up far more runs than Team A, while scoring significantly fewer runs.
Furthermore, what if I told you the remaining two teams finished 10 games apart in the standings, in spite of near-identical run differentials?
As bizarre as this all sounds, this is exactly what happened last season for the Angels, Royals, Blue Jays, and Phillies.
Before we lose all faith in the ability of run differential to accurately reflect (and project) wins for a team, let’s not forget that the Angels and the Blue Jays have records that correlate with their run differential. As for the other two teams, the Royals and the Phillies dealt
with several mid-season variables that significantly impacted their team’s trajectories, leading to skewed results. Teams like the 2017 Royals and Phillies are why I am painstakingly going over all of this. While the superficial results make sense for half of these teams, there is a deeper layer of analysis that, when uncovered, help us to make sense of the remaining results (as well as learn what to look for to predict future outcomes moving forward).
The Royals faded down the stretch and dealt with their top starter, Dan Duffy, missing a quarter of the season (making only 24 out of 32 starts). The Royals had a similar run total to the Angels, who won 80 games, but yielded 82 more runs. This can easily be accounted for by having the best pitcher on the staff for only 75% of the season, with replacement-level pitching yielding a disproportionate number of runs relative to the total losses accrued by the team. Not to mention, the Royals also dealt with losing their star catcher just a week after the trade deadline.
The Phillies shipped out one of their top 3 starting pitchers at the August 1st trade deadline, along with one of their most statistically productive position players. The team was also dealing with two of their top 3 remaining starters being injured.
An inconsistent presence of solid contributors on offense and defense can easily skew a team’s win total from resembling the total of a different team that put up nearly identical offensive and defensive totals.
These variables can create volatility, and ocular inaccuracy in a team’s overall results. The variables lead to teams being streaky, and can drastically skew large enough sample sizes that they can even significantly alter the outcome of the entire season. While it is insultingly simplistic to assume that a team’s year-to-year production (and results) will be similar, even if significant contributors were limited due to injury or being traded, the overall volatility of production is often constant. Injuries are ever present, especially among certain players. Different teams deploy different strategies and styles in their front offices that influence volatility as well. If we can spot a pattern between teams that consistently yield superior results with inferior production, or yield inferior results with superior production, we can apply that anticipated skew in results on a team specific basis.
SECTION III: MATH
So, to recap, we have now taken our projections from simply being…
X = Y
(X is this year, Y is last year)
to
X = y
(lower case ‘y’ is the adjusted wins of last year)
to
X = y(V)
(‘V’ is a volatility coefficient that reflects a trend in skewed results for a specific team)
The formula has now taken into account team production, and ability to capitalize on that production, in making projections from one year to the next. However, two very obvious variables still remain unaddressed. The first is significant player transactions made in the offseason. The second, is time.
Addressing roster turnover is simple enough. Simply cut and paste past production of new players in and remove existing production of former players, while adjusting for any change in roles between their new and former teams. For example, if a player goes from being a starting pitcher on their former team to slotting in as a long reliever on their new one, it would be inaccurate to credit their new team with 150-200 innings of production out of a guy who is likely only going to throw between 60-100 innings in the coming season.
As for the effects of time, I will write a separate piece later that examines the many ways to account for aging in anticipating player growth or decline. For now, let’s simply acknowledge it’s impact for the purpose of this narrative.
We have now gone from…
X = y(V)
to
X = y(V) + Pn - Pf
(Pn represents new players, Pf represents former players)
to
X = y(V) + R(Pn) - Pf
(R represents the change in roles of new and existing players)
to
X = [ y(V) + R(Pn) - Pf ]^T
(T represents the impact of time on the entire projection as a whole)
My apologies to anyone who had repressed nightmares from high school algebra dredged up by the depictions of this whole process in a step-by-step formula, but there is a reason mathematicians use them. I wanted a concise way to sum up everything covered here in a way that illustrated how all these variables related to each other for the reader. Stay tuned for when I apply what we discussed here to analyze and project the upcoming season!