It appears that we have some resolution with the election, and the race has been called for Joe Biden. As of right now, Joe Biden is on track to win 306 electoral votes. If current counts hold, I will have predicted Ohio, Iowa, the Carolinas, and Florida incorrectly. I’ll be diving into what went wrong with this model, and why over the next couple of weeks.
At the time of writing, we are two days away from the 2020 presidential election between Donald Trump and Joe Biden. Many believe that this may be the most important election in the history of the United States. The question on everyone’s mind is the same: who is going to win? I’ll discuss the model I have built, and then show my prediction. There are four parts to the model:
Estimating turnout in each state.
Estimating vote share for each candidate in each state.
Estimating national polling error.
Simulating the election based on the estimated parameters.
I’ll go through each section in turn, examine the results, and then decide what I think of this model.
The end goal of this model is to estimate the number of voters that for each candidate in each state. Given that I want to do this with a probibalistic model, the natrual choice to do that is with draws from binomial random variables:
\[\textrm{Votes}_{ic}\sim\textrm{Bin}(n_{i}, p_{ic})\] The subscripts \(i\) denotes each state, and the subscript \(c\) denotes each candidate. For both Trump and Biden, each simulation of the election is a draw from a set of 102 random variables: two binomial distributions for each state, with 51 different turnout values and 102 different probability values.Simulating these draws is exactly step four in the process that I explained in the introduction. In my model, both \(n_i\) and \(p_{ic}\) are also random variables, making this a sort of hierarchical model. Let’s being with looking at how turnout is estimated.
I estimate turnout using a pooled model, across all states and across every election since 1992 After spending some time looking at the data, I settled on using a poisson regression to estimate turnout. Poisson models are a form of generalized linear models, based on the poisson distribution. Because the data is a set of discrete counts, this seemed like a reasonable choice to me1. I used a great number of covariates to estimate the turnout model. This includes demographic data, the polling margin in the state2, lagged variables for the previous election’s turnout and voting margin, and a state fixed effect indicator. For a full desription of the data, see Appendix: Data at the end of this post. We can look at the full output from the model:
##
## Call:
## glm(formula = total ~ last_vote_margin + poll_margin + Black +
## Hispanic + Asian + White + Male + age20 + age3045 + age4565 +
## last_turnout + state, family = "poisson", data = turnout_scaled %>%
## filter(year < 2020))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -443.77 -123.94 3.72 122.53 460.54
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.295e+01 1.581e-03 8191.23 <2e-16 ***
## last_vote_margin -5.570e-02 8.545e-05 -651.81 <2e-16 ***
## poll_margin 4.040e-01 6.211e-04 650.54 <2e-16 ***
## Black 3.908e-01 7.021e-04 556.58 <2e-16 ***
## Hispanic -2.587e-01 7.425e-04 -348.47 <2e-16 ***
## Asian 5.088e-01 1.305e-03 389.86 <2e-16 ***
## White -2.752e-01 1.024e-03 -268.78 <2e-16 ***
## Male -1.139e-01 3.463e-04 -328.92 <2e-16 ***
## age20 -3.277e-02 2.585e-04 -126.80 <2e-16 ***
## age3045 4.824e-02 1.653e-04 291.94 <2e-16 ***
## age4565 -3.644e-02 1.680e-04 -216.85 <2e-16 ***
## last_turnout 2.967e-01 4.702e-04 631.11 <2e-16 ***
## stateAL 7.985e-01 2.253e-03 354.38 <2e-16 ***
## stateAR 1.004e+00 1.911e-03 525.58 <2e-16 ***
## stateAZ 2.314e+00 2.247e-03 1029.84 <2e-16 ***
## stateCA 2.054e+00 2.675e-03 767.72 <2e-16 ***
## stateCO 2.195e+00 1.735e-03 1265.15 <2e-16 ***
## stateCT 1.495e+00 1.852e-03 807.11 <2e-16 ***
## stateDC -1.689e+00 4.003e-03 -421.83 <2e-16 ***
## stateDE -3.156e-01 2.136e-03 -147.75 <2e-16 ***
## stateFL 2.378e+00 1.945e-03 1222.99 <2e-16 ***
## stateGA 8.744e-01 2.108e-03 414.69 <2e-16 ***
## stateHI -3.613e+00 6.673e-03 -541.47 <2e-16 ***
## stateIA 1.947e+00 1.894e-03 1028.43 <2e-16 ***
## stateID 1.335e+00 1.867e-03 715.27 <2e-16 ***
## stateIL 1.977e+00 1.664e-03 1187.82 <2e-16 ***
## stateIN 1.939e+00 1.805e-03 1074.48 <2e-16 ***
## stateKS 1.533e+00 1.754e-03 873.86 <2e-16 ***
## stateKY 1.706e+00 1.877e-03 909.04 <2e-16 ***
## stateLA 5.204e-01 2.422e-03 214.82 <2e-16 ***
## stateMA 2.069e+00 1.944e-03 1064.08 <2e-16 ***
## stateMD 5.749e-01 2.114e-03 271.95 <2e-16 ***
## stateME 1.352e+00 1.998e-03 676.44 <2e-16 ***
## stateMI 1.985e+00 1.700e-03 1167.64 <2e-16 ***
## stateMN 2.155e+00 1.694e-03 1272.04 <2e-16 ***
## stateMO 1.810e+00 1.822e-03 993.42 <2e-16 ***
## stateMS -1.236e-01 2.828e-03 -43.69 <2e-16 ***
## stateMT 1.024e+00 1.657e-03 618.03 <2e-16 ***
## stateNC 1.462e+00 1.858e-03 787.27 <2e-16 ***
## stateND 7.494e-01 1.721e-03 435.49 <2e-16 ***
## stateNE 1.299e+00 1.846e-03 703.50 <2e-16 ***
## stateNH 1.155e+00 1.893e-03 610.01 <2e-16 ***
## stateNJ 1.593e+00 1.722e-03 925.10 <2e-16 ***
## stateNM 1.745e+00 3.842e-03 454.11 <2e-16 ***
## stateNV 1.045e+00 1.715e-03 609.68 <2e-16 ***
## stateNY 1.869e+00 1.890e-03 989.31 <2e-16 ***
## stateOH 2.121e+00 1.883e-03 1126.47 <2e-16 ***
## stateOK 1.483e+00 1.685e-03 879.97 <2e-16 ***
## stateOR 1.980e+00 1.699e-03 1165.46 <2e-16 ***
## statePA 2.225e+00 1.894e-03 1174.67 <2e-16 ***
## stateRI 7.081e-01 2.230e-03 317.52 <2e-16 ***
## stateSC 5.790e-01 2.322e-03 249.35 <2e-16 ***
## stateSD 7.291e-01 1.749e-03 416.99 <2e-16 ***
## stateTN 1.456e+00 1.867e-03 779.54 <2e-16 ***
## stateTX 2.470e+00 2.292e-03 1077.71 <2e-16 ***
## stateUT 1.713e+00 2.182e-03 785.10 <2e-16 ***
## stateVA 1.297e+00 1.641e-03 789.89 <2e-16 ***
## stateVT 6.315e-01 2.068e-03 305.40 <2e-16 ***
## stateWA 1.998e+00 1.573e-03 1270.18 <2e-16 ***
## stateWI 2.224e+00 1.673e-03 1329.34 <2e-16 ***
## stateWV 1.268e+00 1.970e-03 643.69 <2e-16 ***
## stateWY 5.211e-01 1.848e-03 281.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 628278481 on 324 degrees of freedom
## Residual deviance: 8371435 on 263 degrees of freedom
## (32 observations deleted due to missingness)
## AIC: 8376793
##
## Number of Fisher Scoring iterations: 4
The key numbers for in sample fit are the null and residual deviances, at the very bottom of the funciton call. The residual deviance being significantly smaller, 628278481 versus 8371435, suggests that there is significant improvement against the null3 model. Becuase of scaling discussed in the Appendix: Data section, the interpretation of coefficients is not particularly enlightening. One thing to remember is that becuase this is a poisson regression, the signs of the coefficients indicate which direction the log of the total turnout would move. The coefficient on the poll margin is positive, which actually goes against some research - it suggests that a greater poll margin leads to a higher turnout. It could be that higher poll margins are the result of more people deciding to vote for a particular party, meaning that turnout is influencing the margin, suggesting that the regression may not properly capture cause and effect.
We can also look at out of sample validation. In this case, I conducted leave one out validition by year, and then calculated the sum of the squares of the residuals in each state as a measure of fit.
Sum of Squares of Residuals | Year Preidcted |
---|---|
8.260468e+12 | 1992 |
1.246439e+13 | 1996 |
1.052389e+13 | 2000 |
1.405148e+12 | 2004 |
6.479571e+12 | 2008 |
3.681644e+12 | 2012 |
9.025660e+12 | 2016 |
While these numbers do seem quite high, it is important to remember that these values are squared. Numbers that are on the order of ten to the twelfth represent being off by millions of votes, which in an election of roughly 120 million votes is quite good The final thing to do with this model is predict the turnout for each state in 2020, which we will use later when we simulate the election.
Much has been made of the possibility of national polling error. In 2016, a roughly 3 point national polling error meant that many were blindsided by Trump’s narrow victory in a number of midwest states. To simulate that, I calculated the mean and variance for polling errors for both Democrats and Republicans in elections since 1992. Interestingly, both parties tend to outperform their polls, by 2.40 and 2.11 points for Democrats and Republicans respectively.
Returning to the basic framework of the model, we have:
\[\textrm{Votes}_{ic}\sim\textrm{Bin}(n_{i}, p_{ic})\]
Based on our predictions, we can write down \(n_i\) and \(p_{ic}\) with distributions as well:
\[n_i \sim \textrm{Pois}(\lambda_i)\] \[p_{ic} \sim \textrm{Beta}(\alpha_{ic}, \beta_{ic})\] In this case, \(\lambda_i\) are the predicted mean turnout in each state. As previously, we can re-paramaterize \(\alpha_{ic}\) and \(\beta_{ic}\) and instead input the mean and variance. One thing to note is that the \(\alpha\) and \(\beta\) values are quite large, meaning that we can approximate the beta distributions with high accuracy using a normal distribution. This is key, as it gives an easy way to account for correlation between states: the multivariate normal distribution.
Instead of drawing 102 values one at a time for each vote share for each candidate, we can instead draw two sets of 51. The mean of each value will be the predicted vote share from the vote share models previously discussed. We also need a covariance matrix, which explains how vote shares in different states are related to one another.
We can calculate a covariance model in two steps: first, building a similarity matrix that tells us how similar each state is to one another, using all of the non-categorical data from the vote share model. The key detail of this similarity matrix is that values are between -1 and 1. Values near 1 indicate high similarity (or correlation), and negative values indicate that states are very dissimilar. We can then scale this makeshift correlation matrix by standard deviations in the polls over the past several weeks to get a covariance matrix. The standard deviations in the polling averages that I use are quite small, suggesting a very stable race, and therefore a relatively small amount of uncertainty6.
The final step is the national polling error, which I draw from two independent stable distributions, one for Biden and one for Trump. I intentionally picked stable distributions because they are “fat-tailed,” meaning that events farther from the mean, holding variance fixed, happen more frequently than in a normal distribution. This choice is meant to introduce a potentially large amount of polling error, and therefore variance into the model. Something to note is that these terms are independent for each candidate: both Biden and Trump could have polling errors in their favor in this model7.
One thing to note is that in a sense, including both variance at the state and national level is double counting variance. Variance in a model at the national level is a direct function of variance at the state level. However, in the uncertain times brought on by COVID-19 and one candidate actively trying to discredit the election, I decided that increasing the variance seems reasonable. I’ll denote this variable \(s_{c}\). We can then re-write the basic structure of the model as:
\[\textrm{Votes}_{ic}\sim\textrm{Bin}(n_{i}, p_{ic}+s_c)\]
This leaves us with a complete model, from which we can draw simulated values in order:
Draw the vote shares from the multivariate normal distributions for each candidate.
Draw the turnout values for each state from the poisson distributions.
Draw the national polling error for each candidate and add to the vote shares.
Draw from the binomial in each state to get the votes in each state.
We can finally look at the results from the model over 10,000 draws.
We can first look at the average electoral map based on the average vote shares in each state.
We can also look at the actual vote shares:
State | Trump Popular Vote | Biden Popular Vote |
---|---|---|
AK | 0.5251388 | 0.4748612 |
AL | 0.5783282 | 0.4216718 |
AR | 0.5752348 | 0.4247652 |
AZ | 0.4725368 | 0.5274632 |
CA | 0.3484660 | 0.6515340 |
CO | 0.4244236 | 0.5755764 |
CT | 0.3607708 | 0.6392292 |
DC | 0.2794655 | 0.7205345 |
DE | 0.3660884 | 0.6339116 |
FL | 0.4580901 | 0.5419099 |
GA | 0.4839587 | 0.5160413 |
HI | 0.3438294 | 0.6561706 |
IA | 0.4850483 | 0.5149517 |
ID | 0.5707706 | 0.4292294 |
IL | 0.3981593 | 0.6018407 |
IN | 0.5359081 | 0.4640919 |
KS | 0.5188326 | 0.4811674 |
KY | 0.5726321 | 0.4273679 |
LA | 0.5549075 | 0.4450925 |
MA | 0.3055022 | 0.6944978 |
MD | 0.3163091 | 0.6836909 |
ME | 0.4166781 | 0.5833219 |
MI | 0.4485951 | 0.5514049 |
MN | 0.4442557 | 0.5557443 |
MO | 0.5170086 | 0.4829914 |
MS | 0.5560577 | 0.4439423 |
MT | 0.5187822 | 0.4812178 |
NC | 0.4654500 | 0.5345500 |
ND | 0.5820149 | 0.4179851 |
NE | 0.5096579 | 0.4903421 |
NH | 0.4354918 | 0.5645082 |
NJ | 0.3959544 | 0.6040456 |
NM | 0.4186697 | 0.5813303 |
NV | 0.4576305 | 0.5423695 |
NY | 0.3306714 | 0.6693286 |
OH | 0.4926390 | 0.5073610 |
OK | 0.5886615 | 0.4113385 |
OR | 0.3915786 | 0.6084214 |
PA | 0.4472588 | 0.5527412 |
RI | 0.3367492 | 0.6632508 |
SC | 0.4981731 | 0.5018269 |
SD | 0.5561404 | 0.4438596 |
TN | 0.5354477 | 0.4645523 |
TX | 0.5022584 | 0.4977416 |
UT | 0.5336834 | 0.4663166 |
VA | 0.4167673 | 0.5832327 |
VT | 0.3134606 | 0.6865394 |
WA | 0.3687515 | 0.6312485 |
WI | 0.4461054 | 0.5538946 |
WV | 0.6036859 | 0.3963141 |
WY | 0.6663434 | 0.3336566 |
This model shows a blowout for Biden in a number of key states: he is up by more than ten points in Michigan, Florida, and Pennsylvannia. He also narrowly wins South Carolina, and nearly pulls out a victory in Texas. Overall, Biden wins 389.5774 electoral votes on average, with Trump picking up the remaining 148.4226. Looking at the distribution of the electoral college outcomes fits exactly with this theme
We can see that there is very, very little overlap in the electoral college results. In fact, this model predicts Biden victory 9989 out of 10,000 simulations, while Trump wins the remaining 11. As extreme as this result is, it’s not totally out of line with The Economist’s prediction. We can also look at the plot of the popular vote shares, and we can again see very little overlap.
This plot shows Biden with a commanding lead in the national popular vote, winning above 50% of the vote in essentially every contest. There is an electoral college and popular vote split in 0 of the 10,000 contests, meaning that Trump only wins when he happens to win the popular vote. Texas and South Carolina coming down to razor thin margins fits with the heavily Democrat leaning national enviornment. However, in some senses, looking at the average election outcomes is deceptive. A number of races in battleground states are actually predicted to be quite close.