Model Postmortem

Introduction: Reviewing the Model

It’s been nearly three weeks since the election, and it seems we finally have results (even if many do not accept them). It has come time to assess how my model at predicting the election. To start, I’ll remind you of the model that I built. There were four parts:

Estimating turnout via a poisson regression
Estimating vote share for each candidate using a two sided binomial regression
Estimating a national polling error based on polling averages from past data
Converting the information from steps 1-3 into probability distributions, and then making draws to simulate election outcomes. Turnout was distirbuted as a poisson random variable, vote share as multivariate normal (with covariance between states based on a scaled similarity matrix), and national polling error as two stable distributions.

We can look at the map of the average electoral college over 10,000 simulations, and compare it to the actual results.

Classification

I incorrectly predicted five states: Iowa, North and South Carolina, Ohio, and Florida. I’m honestly not particularly surprised about South Carolina, given that based on conventional wisdom it was unlikely for Biden to win the state. My incorrect predictions in Iowa and Ohio are tied together - Biden winning both would be an indicator of a very strong performance overall, which did not happen. They are also tied to South Carolina being predicted incorrectly, as South Carolina could only be in play if Biden won in a landslide.

That leaves Florida and North Carolina, which many forecasts missed. This includes models like FiveThirtyEight and The Economist, two of the more sophisticated models out there. In my model, Biden’s decisive wins in those states were a symptom of the overall prediction, a Biden blowout.

By about 10 pm on election night, it was clear that was not going to happen. It was immediately clear that my model was wrong, it was just going to be a question of how much, and if the actual results fell within the predicted range of outcomes. One interesting thing to look at is the distribution of classification accuracies: as it turns out, the model was never exactly right.

In fact, the maximum classification accuracy was 49 out of 51 states (including DC), which strikes me as doing quite poorly for a probabilistic model. One would hope that the model would get the exactly right results at least some of the time, and that was not the case for this model. One thing to note is the relatively large probability mass below 45 states predicted correctly - it suggests there is something causing large groups of states to be wrong all together.

The Predicted Range

In the end, Biden won the election with 306 electoral votes to Trump’s 232. We can look at the distribution of electoral results predicted by the model, and see where those results fall. The dashed lines are the actual electoral college vote counts for each candidate.

The actual results were on the lower extreme of predicted outcomes for Biden, and on the upper extreme of predicted outcomes for Trump. This is consistent with the story from the maps - the model predicted a Biden blowout which did not come to pass.

State Level Accuracy

We can also look to see how I did at predicting the state level outcomes, both the turnout and the vote share for each candidate. We can begin with the vote shares.

The dashed lines represent the 45 degree line, which corresponds to a perfect prediction. The red and blue lines correspond to the trend line for the actual versus predicted values. We can see that the model tended to underestimate both Trump and Biden in states they won handily¹. Given the clustering of this data, it is a little bit difficult to read, so we can look at the same information in map form. These secondary plots also give us a better sense of how badly the model missed some states, on average.

The color schemes are intentionally inverted between the maps, so that red represents a shift away from Biden or towards Trump, and blue represents a shift towards Biden or away from Trump. We can see that the map is relatively uniformly red, indicating a wide spread miss by the model in Biden’s favor. Also, note the gradient: there are blocks of states that show a wider miss in favor of Trump in the midwest and south.

One other interesting thing to note is that overall, my model was more accurate for predicted Trump’s vote shares than Biden’s. There are two possible explanations for this phenomenon: first, the underlying model for incumbents had a better fit than for challengers, and second, the data suggested a better result for Biden than in reality. In my predictions post, you can see that the residual deviance for the challenger model, which was used to predict Biden’s vote shares is higher than for the incumbent model, which was used to predict Trump’s vote shares.

We can also look at the predicted turnout against the actual turnout. For the predicted values, I am using the predicted values from the poisson regression, which were used as the mean and variance of the poisson draws to simulate turnout.

Again, the dashed line is the 45 degree line, indicating perfect prediction, and the solid line is the trend line of actual turnout regressed on predicted turnout. In this case, the error is almost uniform: the trend line is almost parallel to the 45 degree line. Also note that the turnout was uniformly higher than predicted. Given that this election had some of the highest voter turnout in American history, in large part due to the candidates and the messaging surrounding the election, it is not surprising that my model failed to capture the voting surge².

Model Statistics

We can also look at a couple of statistics that give an overall sense of the model accuracy.

Model Statistic	Value
Classification Accuracy	0.9020
Turnout RMSE	909637.9334
Biden RMSE	0.0669
Trump RMSE	0.0524
Brier Score	0.0726

The classification accuracy is good, but not great. The turnout root mean squared error is relatively large, which is to be expected based on the scale of the units: being off by an average of nine hundred thousand votes per state seems not out of the ordinary, given the massive turnout surge for this election. We can see that on average, the model was off by more for Biden than Trump with regards to vote shares, though they are relatively close. Between the two, they indicate an average swing of roughly five percentage points from Biden to Trump.

Finally, we have the Brier score for the model. The Brier score is a cost function for probabilistic models with binary outcomes: in this case, it takes in the probability of a candidate winning each state, and if they won the state. A lower Brier score is better: it means that the probabilities predicted were closer to the actual result. The model’s brier score is relatively low, in large part because the majority of states were predicted to go one way or another with overwhelming odds. The largest Brier score contributions came from Florida, North Carolina, and Iowa, three states that the model was confident Biden would win that Trump carried.

What Went Wrong?

All in all, my model did not do a great job of predicting the election, no matter which way you cut it³. So what happened? There are two sayings that I’ve heard about predictive modeling that sum up what I think went wrong:

Garbage in, garbage out
KISS: Keep it simple, stupid

Once I explain what I mean by these sayings, it will give an overview of what went wrong with my model, and more importantly, why.

Garbage In, Garbage Out

This saying is a less polite way of saying that model is only as good as its data. Based on what evidence we have, it looks like there were definitely some problems with the data I was using.

On the day of the election, FiveThirtyEight projected Biden to win the national popular vote with 53.4 percent of the popular vote, with Trump winning 45.4 percent. My model predicted a slightly wider margin: 55.8 percent for Biden, and 44.2 for Trump on average. According to Cook Political Report, Biden ended up winning 51.0 percent of the vote and Trump won 47.2 percent.

On the day of the election, the FiveThirtyEight model goes into “polls-only” mode, where it only relies on polling data to determine the forecast. Based on the margin from their forecast to the actual, the polls overestimated Biden by about two and a half percentage points, while Trump out performed his polls by about the same margin. When put into historical context, this is about the same size polling error as 2016.

In my model, the polls were a key input into estimating the mean popular vote shares for each candidate. For both the incumbent and challenger models, polling had the biggest impact on the output point prediction, as the coefficient on polling was an order of magnitude larger than the others. I also double dipped with the polls, and used the variances as well. Because polling was relatively stable the variances used to simulate individual elections was relatively small. Because the mean was off, there was no way to get lucky with a large variance.

Understanding why the polls were wrong is a complex task, and something that is outside of the scope of this model reflection. However, I’ll include a couple of quick theories:

In states where the election is not close, the winner of the state tends to outperform polling. On average, this adds up and moves the popular vote average.
Partisan non-response bias: Republicans are less likely to pick up the phone as they are less trusting of the media
COVID-19: Because Democrat leaning voters were more likely to take the pandemic seriously, they are more likely to be at home and therefore more likely to be polled.

There has also been a lot made of the changes in the how Hispanic people voted, but there is evidence that this phenomenon is really just a symptom to the urban-rural political divide. All of this is to say, there’s a lot more work to be done to figure out what went wrong with polling, and why.

Also in the garbage in, garbage out category is that in some cases, I was using somewhat outdated data. My demographic data was from 2018, and even slight shifts could be a big deal. It is also not clear that the economic data I used was the best choice. I used second quarter GDP quarter over quarter growth at a national level, which had one of the biggest declines in history due to the ongoing pandemic. At the same time, I used second quarter real disposable income growth, which had a massive spike due to the CARES Act.

All of this is to say, there were a lot of possible problems with just the data for my model.

KISS

Keeping it simple is often a good strategy for predictive modeling. Building a complex model introduces a lot of choices, which adds avenues for modeling mistakes. It means you need to take greater care, something that requires an immense amount of time, something I did not have enough of before the election. There are a host of issues that I think I did a poor job of addressing.

One worry is about overfitting, especially with such a small dataset - six elections is not a lot of data to work with⁴. One method to avoid overfitting that I did not use was regularization. The basic idea behind regularization is that when you estimate a linear (or generalized linear) model, you add a cost term based on the magnitude of the coefficient values. This has the effect that some variables are reduced to having coefficients of 0, removing them from the model. Because parsimonious models tend to perform better with limited data, it may have been smart to regularize both the turnout and vote share models. I used a large number of predictors, and it may be that some had little actual impact.

There are two other crucial modeling errors that I believe hurt the performance of my model.

First, I recall that I simulated the national polling error for both parties as independent stable distributions. I intentionally picked stable distributions because they have fat-tails, meaning that larger polling errors are more likely. The problem here was independence. Because both parties beat their polling marks on average, both parties had an expectation of beating their polling average by about 2 percentage points⁵.

I noted in a footnote in the model that the national polling error values actually have negative covariance, which makes a lot of sense. If one party beats their polls, that almost always means the other party underperformed. Because I was using independent distributions, this was not accounted for.

Second, I spent a lot of time getting covariance between the results from different states to work, but I did not spend enough time on this portion of the model. Something that I noticed in the histogram of the classification accuracy of my model was that in some cases, whole blocks of states were predicted wrong. My instinct is that this has to do with the correlation matrix I was using to predict vote shares. In my predictions post, I linked a blog post by Andrew Gelman, about how he believes that FiveThirtyEight did not handle state to state correlations well. The blog post I linked was not the only information on state to state correlation on Gelman’s blog. In a post critiquing strange behavior in the FiveThirtyEight model, Gelman noticed that certain states had negative correlations between them. That led to this comment:

“Maybe there was a bug in the code, but more likely they just took a bunch of state-level variables and computed their correlation matrix, without thinking carefully about how this related to the goals of the forecast and without looking too carefully at what was going on.”

Unfortunately for me, that’s a very accurate description of what I did for my model. Gelman made the point that swings in vote shares between states generally should not have negative correlations, something that did occur in my model. In addition, for the biggest swing, Gelman makes the point that they should be national, something that I agree with. As mentioned previously, this is something that is correctable with more time, something I unfortunately did not have. Combined to the problems I had with national polling errors, this all is a recipe for some very strange behavior.

Potential Checks

Below is a non-exhaustive list of everything I could think of to try to improve the model:

Start by validating the current edition of the model on previous elections. For a number of reasons, this election was unique, and this would give a good sense of if the problems with my model were idiosyncratic to this year or a more consistent problem. Ideally, I would take a look at all the same accuracy metrics I looked at in this post.
Run the model with augmented polling data. A major question I have is how much more accurate the polls would have needed to be for my model to make an accurate prediction. I would start with a national polling error, and if that does not produce more accurate results, I would move on to augmenting particular states. For example, if polling had been more accurate in specifically Florida, could that help the model’s overall accuracy?
I would have checked to see if regularizing the model could have helped. My gut feeling is that yes, it would have, but I’d be very curious to see.
Make the national vote share errors correlated across the candidates, and see if that creates more accurate outcomes. Assuming independence was a short-sighted move on my part, especially given how things turned out.
Rerun the model with variances for the polls based on the individual polls rather than the polling averages. This may have given a wider range of outcomes, that covered the actual election results.
Re-scale the correlation matrix so that all the values fall between 0 and 1, and try other ways to build a correlation matrix. My big takeaway from the blog post that I linked is that states generally should not have negative correlation - at the lowest, they should have no impact on each other. My instinct is that this would allow for more freedom between the southern and midwest states, which may allow the model to get Ohio, Florida, and North Carolina right.

While these are not exactly statistical tests, they would give evidence to show me if my instincts about the shortcomings of my model are correct. In general, assessing a model with so many parts is tricky, and is hard to do with a single test. Something overall for me to think about is how to make my model more mathematically rigorous, making sure that the data meets all the assumptions for all of the modeling choices I made early on.

Hindsight is 2020

I think there are some concrete changes that I would want to make for my model no matter the results of my tests, along with many things I should have done differently in hindsight.

Run all of the proposed tests, especially with the state to state correlation matrix. Running a series of robustness checks on the multiple ways I could have formulated the correlation matrix likely would have helped improve the model’s overall accuracy, by making sure that states were coupled together in ways that make sense.
Make the national vote share errors correlated across parties, and probably switch from party based to incumbency based errors, just for the sake of consistency. I feel like given all my exposition earlier in this post, this is fairly self-explanatory
Include even more covariates for the turnout and vote share models. In terms of data I would want to add, I would start with data on how urban and rural states are, along with economic data. The urban and rural data is meant to control for the ongoing trend of serious splits between urban and rural areas between states. Something like the fraction of the population of a state in cities could make a good choice. Second, I would want to include more economic data, primarily longer term information like housing prices, long term income growth, and what industries are prevalent. The reason I want to include longer term data is to give more of a sense of class divides: it should be apparent that Massachusetts is a wealthier state than Alabama, and it should be apparent that Minnesota is wealthier than it’s midwestern neighbors.
I should have regularized the turnout and vote share models once I had all the inputs. The explanation for this is relatively straightforward: with so much data, I would run the risk of overfitting, so regularization could help fix this problem. Parsimonious models tend to generalize better, and given all the things that make 2020 unique, generalization is extremely important.
Check to make sure all the assumptions built into the many stages of my model actually make sense. This is a broad category, and probably the hardest thing to change. This would include spending more time making sure that the assumptions required for poisson and binomial regressions are met, and making sure that the cumulative variances of the many random variables work out as well. I know that for the turnout model, the assumptions were most certainly not met, so exploring other avenues there would have been a good idea.

This list is probably not exhaustive, but it is clear that my model left a lot to be desired. Looking back, I am happy with the model that I built, but recognize that there were many shortcomings. I think in a lot of ways, my biggest problem was a lack of time. I started on this final model about a week before the election took place, and given the complexity of my model, that was not nearly enough time to check every last detail.

One exception is New York, where vote counting has been extremely slow. It is likely that it will be much closer to the trend line once all votes are counted.↩
At one point while I was building my prediction model, I toyed with the idea of including early voting rates in the turnout prediction model, but decided to avoid it. My feeling is that if I were to include that information, it would have forecasted the massive voting surge.↩
Except for getting the winner right, I suppose. But that lacks nuance.↩
There is an argument to be made that using longer term data could be problematic due to party re-alignment, changes in media consumption, and suffrage changes.↩
This almost exactly fits with the difference for Biden in my model’s predicted national popular vote shares and the FiveThirtyEight polling data on which it was based. Interestingly, it does not at all fit with Trump’s predicted vote share versus the polls.↩