It appears that we have some resolution with the election, and the race has been called for Joe Biden. As of right now, Joe Biden is on track to win 306 electoral votes. If current counts hold, I will have predicted Ohio, Iowa, the Carolinas, and Florida incorrectly. I’ll be diving into what went wrong with this model, and why over the next couple of weeks.

Introduction

At the time of writing, we are two days away from the 2020 presidential election between Donald Trump and Joe Biden. Many believe that this may be the most important election in the history of the United States. The question on everyone’s mind is the same: who is going to win? I’ll discuss the model I have built, and then show my prediction. There are four parts to the model:

  1. Estimating turnout in each state.

  2. Estimating vote share for each candidate in each state.

  3. Estimating national polling error.

  4. Simulating the election based on the estimated parameters.

I’ll go through each section in turn, examine the results, and then decide what I think of this model.

The Basic Structure

The end goal of this model is to estimate the number of voters that for each candidate in each state. Given that I want to do this with a probibalistic model, the natrual choice to do that is with draws from binomial random variables:

\[\textrm{Votes}_{ic}\sim\textrm{Bin}(n_{i}, p_{ic})\] The subscripts \(i\) denotes each state, and the subscript \(c\) denotes each candidate. For both Trump and Biden, each simulation of the election is a draw from a set of 102 random variables: two binomial distributions for each state, with 51 different turnout values and 102 different probability values.Simulating these draws is exactly step four in the process that I explained in the introduction. In my model, both \(n_i\) and \(p_{ic}\) are also random variables, making this a sort of hierarchical model. Let’s being with looking at how turnout is estimated.

Estimating Turnout

I estimate turnout using a pooled model, across all states and across every election since 1992 After spending some time looking at the data, I settled on using a poisson regression to estimate turnout. Poisson models are a form of generalized linear models, based on the poisson distribution. Because the data is a set of discrete counts, this seemed like a reasonable choice to me1. I used a great number of covariates to estimate the turnout model. This includes demographic data, the polling margin in the state2, lagged variables for the previous election’s turnout and voting margin, and a state fixed effect indicator. For a full desription of the data, see Appendix: Data at the end of this post. We can look at the full output from the model:

## 
## Call:
## glm(formula = total ~ last_vote_margin + poll_margin + Black + 
##     Hispanic + Asian + White + Male + age20 + age3045 + age4565 + 
##     last_turnout + state, family = "poisson", data = turnout_scaled %>% 
##     filter(year < 2020))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -443.77  -123.94     3.72   122.53   460.54  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.295e+01  1.581e-03 8191.23   <2e-16 ***
## last_vote_margin -5.570e-02  8.545e-05 -651.81   <2e-16 ***
## poll_margin       4.040e-01  6.211e-04  650.54   <2e-16 ***
## Black             3.908e-01  7.021e-04  556.58   <2e-16 ***
## Hispanic         -2.587e-01  7.425e-04 -348.47   <2e-16 ***
## Asian             5.088e-01  1.305e-03  389.86   <2e-16 ***
## White            -2.752e-01  1.024e-03 -268.78   <2e-16 ***
## Male             -1.139e-01  3.463e-04 -328.92   <2e-16 ***
## age20            -3.277e-02  2.585e-04 -126.80   <2e-16 ***
## age3045           4.824e-02  1.653e-04  291.94   <2e-16 ***
## age4565          -3.644e-02  1.680e-04 -216.85   <2e-16 ***
## last_turnout      2.967e-01  4.702e-04  631.11   <2e-16 ***
## stateAL           7.985e-01  2.253e-03  354.38   <2e-16 ***
## stateAR           1.004e+00  1.911e-03  525.58   <2e-16 ***
## stateAZ           2.314e+00  2.247e-03 1029.84   <2e-16 ***
## stateCA           2.054e+00  2.675e-03  767.72   <2e-16 ***
## stateCO           2.195e+00  1.735e-03 1265.15   <2e-16 ***
## stateCT           1.495e+00  1.852e-03  807.11   <2e-16 ***
## stateDC          -1.689e+00  4.003e-03 -421.83   <2e-16 ***
## stateDE          -3.156e-01  2.136e-03 -147.75   <2e-16 ***
## stateFL           2.378e+00  1.945e-03 1222.99   <2e-16 ***
## stateGA           8.744e-01  2.108e-03  414.69   <2e-16 ***
## stateHI          -3.613e+00  6.673e-03 -541.47   <2e-16 ***
## stateIA           1.947e+00  1.894e-03 1028.43   <2e-16 ***
## stateID           1.335e+00  1.867e-03  715.27   <2e-16 ***
## stateIL           1.977e+00  1.664e-03 1187.82   <2e-16 ***
## stateIN           1.939e+00  1.805e-03 1074.48   <2e-16 ***
## stateKS           1.533e+00  1.754e-03  873.86   <2e-16 ***
## stateKY           1.706e+00  1.877e-03  909.04   <2e-16 ***
## stateLA           5.204e-01  2.422e-03  214.82   <2e-16 ***
## stateMA           2.069e+00  1.944e-03 1064.08   <2e-16 ***
## stateMD           5.749e-01  2.114e-03  271.95   <2e-16 ***
## stateME           1.352e+00  1.998e-03  676.44   <2e-16 ***
## stateMI           1.985e+00  1.700e-03 1167.64   <2e-16 ***
## stateMN           2.155e+00  1.694e-03 1272.04   <2e-16 ***
## stateMO           1.810e+00  1.822e-03  993.42   <2e-16 ***
## stateMS          -1.236e-01  2.828e-03  -43.69   <2e-16 ***
## stateMT           1.024e+00  1.657e-03  618.03   <2e-16 ***
## stateNC           1.462e+00  1.858e-03  787.27   <2e-16 ***
## stateND           7.494e-01  1.721e-03  435.49   <2e-16 ***
## stateNE           1.299e+00  1.846e-03  703.50   <2e-16 ***
## stateNH           1.155e+00  1.893e-03  610.01   <2e-16 ***
## stateNJ           1.593e+00  1.722e-03  925.10   <2e-16 ***
## stateNM           1.745e+00  3.842e-03  454.11   <2e-16 ***
## stateNV           1.045e+00  1.715e-03  609.68   <2e-16 ***
## stateNY           1.869e+00  1.890e-03  989.31   <2e-16 ***
## stateOH           2.121e+00  1.883e-03 1126.47   <2e-16 ***
## stateOK           1.483e+00  1.685e-03  879.97   <2e-16 ***
## stateOR           1.980e+00  1.699e-03 1165.46   <2e-16 ***
## statePA           2.225e+00  1.894e-03 1174.67   <2e-16 ***
## stateRI           7.081e-01  2.230e-03  317.52   <2e-16 ***
## stateSC           5.790e-01  2.322e-03  249.35   <2e-16 ***
## stateSD           7.291e-01  1.749e-03  416.99   <2e-16 ***
## stateTN           1.456e+00  1.867e-03  779.54   <2e-16 ***
## stateTX           2.470e+00  2.292e-03 1077.71   <2e-16 ***
## stateUT           1.713e+00  2.182e-03  785.10   <2e-16 ***
## stateVA           1.297e+00  1.641e-03  789.89   <2e-16 ***
## stateVT           6.315e-01  2.068e-03  305.40   <2e-16 ***
## stateWA           1.998e+00  1.573e-03 1270.18   <2e-16 ***
## stateWI           2.224e+00  1.673e-03 1329.34   <2e-16 ***
## stateWV           1.268e+00  1.970e-03  643.69   <2e-16 ***
## stateWY           5.211e-01  1.848e-03  281.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 628278481  on 324  degrees of freedom
## Residual deviance:   8371435  on 263  degrees of freedom
##   (32 observations deleted due to missingness)
## AIC: 8376793
## 
## Number of Fisher Scoring iterations: 4

The key numbers for in sample fit are the null and residual deviances, at the very bottom of the funciton call. The residual deviance being significantly smaller, 628278481 versus 8371435, suggests that there is significant improvement against the null3 model. Becuase of scaling discussed in the Appendix: Data section, the interpretation of coefficients is not particularly enlightening. One thing to remember is that becuase this is a poisson regression, the signs of the coefficients indicate which direction the log of the total turnout would move. The coefficient on the poll margin is positive, which actually goes against some research - it suggests that a greater poll margin leads to a higher turnout. It could be that higher poll margins are the result of more people deciding to vote for a particular party, meaning that turnout is influencing the margin, suggesting that the regression may not properly capture cause and effect.

We can also look at out of sample validation. In this case, I conducted leave one out validition by year, and then calculated the sum of the squares of the residuals in each state as a measure of fit.

Sum of Squares of Residuals Year Preidcted
8.260468e+12 1992
1.246439e+13 1996
1.052389e+13 2000
1.405148e+12 2004
6.479571e+12 2008
3.681644e+12 2012
9.025660e+12 2016

While these numbers do seem quite high, it is important to remember that these values are squared. Numbers that are on the order of ten to the twelfth represent being off by millions of votes, which in an election of roughly 120 million votes is quite good The final thing to do with this model is predict the turnout for each state in 2020, which we will use later when we simulate the election.

Estimating Vote Share

Estimating vote share happens roughly the same way estimating turnout, just with a different generalized linear model. Similar to many of the models I have used throughout this blog, I am using a two sided model based on party incumbency status4. I still used a pooled model across all states and years since the 1992 election.

For both the incumbent and the challenger models, I use a binomial regression, estimating the fraction of the total votes that each candidate will win. The regression uses demographic data, polling data, previous election results, economic data, along with party and state fixed effects. We can take a look at both the incumbent and challenger models. First, the incumbent model:

## 
## Call:
## glm(formula = cbind(inc_votes, total - inc_votes) ~ rdi_q2 + 
##     gdp + state + avg_poll + Black + Hispanic + Asian + White + 
##     Male + age20 + age3045 + age4565 + last_vote_margin + party, 
##     family = binomial, data = full_votes %>% filter(year < 2020, 
##         incumbent_party == TRUE))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -207.335   -31.862     0.452    33.982   205.117  
## 
## Coefficients:
##                    Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)      -1.965e+00  2.759e-03 -712.394  < 2e-16 ***
## rdi_q2            5.055e-03  9.465e-05   53.406  < 2e-16 ***
## gdp               4.751e-02  1.575e-04  301.562  < 2e-16 ***
## stateAL          -1.854e-01  3.226e-03  -57.464  < 2e-16 ***
## stateAR          -1.578e-01  3.101e-03  -50.884  < 2e-16 ***
## stateAZ           2.535e-02  3.391e-03    7.475 7.72e-14 ***
## stateCA          -5.681e-02  3.233e-03  -17.570  < 2e-16 ***
## stateCO           3.198e-02  2.838e-03   11.268  < 2e-16 ***
## stateCT          -6.234e-02  3.146e-03  -19.818  < 2e-16 ***
## stateDC           1.621e-01  7.239e-03   22.392  < 2e-16 ***
## stateDE          -1.170e-01  3.355e-03  -34.887  < 2e-16 ***
## stateFL          -5.842e-02  3.530e-03  -16.550  < 2e-16 ***
## stateGA          -1.784e-01  2.975e-03  -59.949  < 2e-16 ***
## stateHI          -7.962e-01  8.487e-03  -93.810  < 2e-16 ***
## stateIA          -9.766e-02  3.325e-03  -29.373  < 2e-16 ***
## stateID          -1.158e-01  3.279e-03  -35.306  < 2e-16 ***
## stateIL          -9.170e-02  2.881e-03  -31.834  < 2e-16 ***
## stateIN          -1.293e-01  3.045e-03  -42.479  < 2e-16 ***
## stateKS          -1.616e-01  3.036e-03  -53.223  < 2e-16 ***
## stateKY          -1.520e-01  3.165e-03  -48.029  < 2e-16 ***
## stateLA          -1.334e-01  3.263e-03  -40.894  < 2e-16 ***
## stateMA          -1.985e-01  3.251e-03  -61.044  < 2e-16 ***
## stateMD          -2.509e-01  3.153e-03  -79.573  < 2e-16 ***
## stateME          -9.035e-02  3.542e-03  -25.506  < 2e-16 ***
## stateMI          -9.126e-02  2.843e-03  -32.096  < 2e-16 ***
## stateMN          -1.241e-01  2.881e-03  -43.067  < 2e-16 ***
## stateMO          -1.632e-01  3.091e-03  -52.803  < 2e-16 ***
## stateMS          -2.079e-01  3.584e-03  -58.005  < 2e-16 ***
## stateMT          -8.205e-02  3.071e-03  -26.713  < 2e-16 ***
## stateNC          -1.596e-01  2.942e-03  -54.246  < 2e-16 ***
## stateND          -7.464e-02  3.188e-03  -23.412  < 2e-16 ***
## stateNE          -1.362e-01  3.172e-03  -42.923  < 2e-16 ***
## stateNH          -4.071e-02  3.317e-03  -12.273  < 2e-16 ***
## stateNJ          -1.374e-02  2.985e-03   -4.603 4.17e-06 ***
## stateNM           7.810e-02  4.897e-03   15.950  < 2e-16 ***
## stateNV          -2.362e-02  2.712e-03   -8.708  < 2e-16 ***
## stateNY          -2.682e-01  3.175e-03  -84.468  < 2e-16 ***
## stateOH          -1.647e-01  3.084e-03  -53.415  < 2e-16 ***
## stateOK          -1.782e-01  2.877e-03  -61.948  < 2e-16 ***
## stateOR          -5.122e-02  2.962e-03  -17.294  < 2e-16 ***
## statePA          -1.374e-01  3.188e-03  -43.106  < 2e-16 ***
## stateRI          -7.388e-02  3.692e-03  -20.010  < 2e-16 ***
## stateSC          -2.318e-01  3.198e-03  -72.458  < 2e-16 ***
## stateSD          -6.394e-02  3.163e-03  -20.216  < 2e-16 ***
## stateTN          -1.966e-01  3.009e-03  -65.346  < 2e-16 ***
## stateTX           2.490e-02  3.713e-03    6.708 1.98e-11 ***
## stateUT          -2.157e-01  3.482e-03  -61.944  < 2e-16 ***
## stateVA          -1.461e-01  2.669e-03  -54.754  < 2e-16 ***
## stateVT          -1.571e-01  3.714e-03  -42.292  < 2e-16 ***
## stateWA          -1.208e-01  2.580e-03  -46.825  < 2e-16 ***
## stateWI          -9.412e-02  2.942e-03  -31.991  < 2e-16 ***
## stateWV          -7.054e-02  3.482e-03  -20.256  < 2e-16 ***
## stateWY          -1.308e-02  3.536e-03   -3.699 0.000216 ***
## avg_poll          4.347e+00  1.272e-03 3417.178  < 2e-16 ***
## Black             1.853e-02  7.166e-04   25.851  < 2e-16 ***
## Hispanic         -4.198e-02  1.023e-03  -41.017  < 2e-16 ***
## Asian             1.269e-01  1.490e-03   85.156  < 2e-16 ***
## White             3.057e-03  1.577e-03    1.938 0.052569 .  
## Male             -4.947e-02  4.588e-04 -107.833  < 2e-16 ***
## age20             2.957e-02  3.864e-04   76.519  < 2e-16 ***
## age3045          -7.568e-03  3.272e-04  -23.132  < 2e-16 ***
## age4565           7.222e-03  3.357e-04   21.512  < 2e-16 ***
## last_vote_margin  1.411e-02  1.932e-04   73.043  < 2e-16 ***
## partyrepublican  -3.777e-03  1.579e-04  -23.916  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22644500  on 324  degrees of freedom
## Residual deviance:  1032104  on 261  degrees of freedom
## AIC: 1037001
## 
## Number of Fisher Scoring iterations: 3

We can also look at the challenger model:

## 
## Call:
## glm(formula = cbind(chl_votes, total - chl_votes) ~ rdi_q2 + 
##     gdp + state + avg_poll + Black + Hispanic + Asian + White + 
##     Male + age20 + age3045 + age4565 + last_vote_margin + party, 
##     family = binomial, data = full_votes %>% filter(year < 2020, 
##         incumbent_party == FALSE))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -271.370   -40.194     2.738    37.709   201.321  
## 
## Coefficients:
##                    Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)      -2.342e+00  2.719e-03 -861.459  < 2e-16 ***
## rdi_q2           -6.305e-03  9.533e-05  -66.135  < 2e-16 ***
## gdp              -3.723e-02  1.524e-04 -244.309  < 2e-16 ***
## stateAL           1.023e+00  3.197e-03  320.068  < 2e-16 ***
## stateAR           5.654e-01  3.065e-03  184.456  < 2e-16 ***
## stateAZ          -4.461e-01  3.385e-03 -131.800  < 2e-16 ***
## stateCA          -7.895e-01  3.263e-03 -241.991  < 2e-16 ***
## stateCO          -2.689e-01  2.792e-03  -96.294  < 2e-16 ***
## stateCT           1.487e-01  3.103e-03   47.922  < 2e-16 ***
## stateDC          -3.357e-02  9.859e-03   -3.405 0.000662 ***
## stateDE           6.807e-01  3.327e-03  204.594  < 2e-16 ***
## stateFL           4.524e-02  3.517e-03   12.862  < 2e-16 ***
## stateGA           1.014e+00  2.952e-03  343.635  < 2e-16 ***
## stateHI          -6.106e-01  8.411e-03  -72.599  < 2e-16 ***
## stateIA           1.829e-01  3.251e-03   56.270  < 2e-16 ***
## stateID           2.176e-02  3.260e-03    6.673 2.51e-11 ***
## stateIL           2.862e-01  2.848e-03  100.475  < 2e-16 ***
## stateIN           4.032e-01  2.990e-03  134.848  < 2e-16 ***
## stateKS           1.304e-01  2.993e-03   43.587  < 2e-16 ***
## stateKY           4.395e-01  3.101e-03  141.720  < 2e-16 ***
## stateLA           1.274e+00  3.244e-03  392.765  < 2e-16 ***
## stateMA           1.515e-01  3.205e-03   47.263  < 2e-16 ***
## stateMD           1.005e+00  3.118e-03  322.349  < 2e-16 ***
## stateME           2.063e-01  3.479e-03   59.299  < 2e-16 ***
## stateMI           5.964e-01  2.795e-03  213.382  < 2e-16 ***
## stateMN           2.109e-01  2.806e-03   75.148  < 2e-16 ***
## stateMO           5.119e-01  3.032e-03  168.850  < 2e-16 ***
## stateMS           1.415e+00  3.565e-03  396.982  < 2e-16 ***
## stateMT           1.560e-01  3.013e-03   51.764  < 2e-16 ***
## stateNC           7.867e-01  2.907e-03  270.648  < 2e-16 ***
## stateND           1.614e-01  3.135e-03   51.484  < 2e-16 ***
## stateNE           1.316e-01  3.134e-03   41.981  < 2e-16 ***
## stateNH           1.598e-01  3.243e-03   49.276  < 2e-16 ***
## stateNJ           1.461e-01  2.956e-03   49.442  < 2e-16 ***
## stateNM          -9.707e-01  4.922e-03 -197.239  < 2e-16 ***
## stateNV          -3.158e-01  2.714e-03 -116.387  < 2e-16 ***
## stateNY           1.922e-02  3.158e-03    6.086 1.16e-09 ***
## stateOH           5.232e-01  3.021e-03  173.187  < 2e-16 ***
## stateOK           2.760e-01  2.839e-03   97.223  < 2e-16 ***
## stateOR          -5.718e-02  2.913e-03  -19.632  < 2e-16 ***
## statePA           4.676e-01  3.131e-03  149.364  < 2e-16 ***
## stateRI           1.191e-01  3.665e-03   32.487  < 2e-16 ***
## stateSC           1.150e+00  3.172e-03  362.467  < 2e-16 ***
## stateSD           1.741e-01  3.096e-03   56.240  < 2e-16 ***
## stateTN           7.769e-01  2.964e-03  262.155  < 2e-16 ***
## stateTX          -4.870e-01  3.724e-03 -130.767  < 2e-16 ***
## stateUT          -8.295e-02  3.460e-03  -23.975  < 2e-16 ***
## stateVA           6.481e-01  2.630e-03  246.447  < 2e-16 ***
## stateVT           2.219e-01  3.668e-03   60.495  < 2e-16 ***
## stateWA          -6.884e-02  2.540e-03  -27.106  < 2e-16 ***
## stateWI           2.497e-01  2.870e-03   86.995  < 2e-16 ***
## stateWV           3.522e-01  3.433e-03  102.593  < 2e-16 ***
## stateWY          -6.133e-02  3.491e-03  -17.570  < 2e-16 ***
## avg_poll          4.509e+00  1.362e-03 3311.292  < 2e-16 ***
## Black            -2.727e-01  7.082e-04 -385.020  < 2e-16 ***
## Hispanic          3.020e-01  1.053e-03  286.786  < 2e-16 ***
## Asian             1.143e-01  1.524e-03   74.979  < 2e-16 ***
## White             4.211e-02  1.591e-03   26.466  < 2e-16 ***
## Male              5.686e-02  4.530e-04  125.506  < 2e-16 ***
## age20            -3.036e-02  3.800e-04  -79.903  < 2e-16 ***
## age3045          -2.662e-02  3.244e-04  -82.053  < 2e-16 ***
## age4565          -3.531e-02  3.365e-04 -104.933  < 2e-16 ***
## last_vote_margin  2.320e-02  2.007e-04  115.567  < 2e-16 ***
## partyrepublican   6.894e-02  1.599e-04  431.082  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22636314  on 324  degrees of freedom
## Residual deviance:  1257654  on 261  degrees of freedom
## AIC: 1262550
## 
## Number of Fisher Scoring iterations: 3

Again, for both models, the residual deviances are orders of magnitude greater than the null deviances. One thing to note is that the incumbent model seems to have a better fit in sample than the challenger model, with a null deviance of 1032104 against 1257654. Unsurprisingly, the coefficient on the polling average is large and positive, relative to other coefficients. Interestingly, the economic data coefficients are positive in the incumbent model, but negative in the challenger model, which makes some intuitive sense. An incumbent party with a good economy is likely to have better success with voters5.

For out of sample fit, we can look at the fraction of states that the model predicts correctly using leave one out validation based on year.

Year State Prediction Accuracy
1992 0.9565217
1996 0.9400000
2000 0.9183673
2004 1.0000000
2008 0.9600000
2012 0.9189189
2016 0.9200000

For the most part, the model is quite accurate. It misses a few states in some elections. As dicussed last week, a very similar model misses predictions in swing states quite frequently. This model would have incorrectly called the 2016 election for Hillary Clinton. These incorrect predictions in part stem from reliance on polling data, without accounting for variances. Therefore, we need to introduce some uncertainty into the model.

National Polling Error

Much has been made of the possibility of national polling error. In 2016, a roughly 3 point national polling error meant that many were blindsided by Trump’s narrow victory in a number of midwest states. To simulate that, I calculated the mean and variance for polling errors for both Democrats and Republicans in elections since 1992. Interestingly, both parties tend to outperform their polls, by 2.40 and 2.11 points for Democrats and Republicans respectively.

Simulating Elections

Returning to the basic framework of the model, we have:

\[\textrm{Votes}_{ic}\sim\textrm{Bin}(n_{i}, p_{ic})\]

Based on our predictions, we can write down \(n_i\) and \(p_{ic}\) with distributions as well:

\[n_i \sim \textrm{Pois}(\lambda_i)\] \[p_{ic} \sim \textrm{Beta}(\alpha_{ic}, \beta_{ic})\] In this case, \(\lambda_i\) are the predicted mean turnout in each state. As previously, we can re-paramaterize \(\alpha_{ic}\) and \(\beta_{ic}\) and instead input the mean and variance. One thing to note is that the \(\alpha\) and \(\beta\) values are quite large, meaning that we can approximate the beta distributions with high accuracy using a normal distribution. This is key, as it gives an easy way to account for correlation between states: the multivariate normal distribution.

Instead of drawing 102 values one at a time for each vote share for each candidate, we can instead draw two sets of 51. The mean of each value will be the predicted vote share from the vote share models previously discussed. We also need a covariance matrix, which explains how vote shares in different states are related to one another.

We can calculate a covariance model in two steps: first, building a similarity matrix that tells us how similar each state is to one another, using all of the non-categorical data from the vote share model. The key detail of this similarity matrix is that values are between -1 and 1. Values near 1 indicate high similarity (or correlation), and negative values indicate that states are very dissimilar. We can then scale this makeshift correlation matrix by standard deviations in the polls over the past several weeks to get a covariance matrix. The standard deviations in the polling averages that I use are quite small, suggesting a very stable race, and therefore a relatively small amount of uncertainty6.

The final step is the national polling error, which I draw from two independent stable distributions, one for Biden and one for Trump. I intentionally picked stable distributions because they are “fat-tailed,” meaning that events farther from the mean, holding variance fixed, happen more frequently than in a normal distribution. This choice is meant to introduce a potentially large amount of polling error, and therefore variance into the model. Something to note is that these terms are independent for each candidate: both Biden and Trump could have polling errors in their favor in this model7.

One thing to note is that in a sense, including both variance at the state and national level is double counting variance. Variance in a model at the national level is a direct function of variance at the state level. However, in the uncertain times brought on by COVID-19 and one candidate actively trying to discredit the election, I decided that increasing the variance seems reasonable. I’ll denote this variable \(s_{c}\). We can then re-write the basic structure of the model as:

\[\textrm{Votes}_{ic}\sim\textrm{Bin}(n_{i}, p_{ic}+s_c)\]

This leaves us with a complete model, from which we can draw simulated values in order:

  1. Draw the vote shares from the multivariate normal distributions for each candidate.

  2. Draw the turnout values for each state from the poisson distributions.

  3. Draw the national polling error for each candidate and add to the vote shares.

  4. Draw from the binomial in each state to get the votes in each state.

We can finally look at the results from the model over 10,000 draws.

Results

We can first look at the average electoral map based on the average vote shares in each state.

We can also look at the actual vote shares:

State Trump Popular Vote Biden Popular Vote
AK 0.5251388 0.4748612
AL 0.5783282 0.4216718
AR 0.5752348 0.4247652
AZ 0.4725368 0.5274632
CA 0.3484660 0.6515340
CO 0.4244236 0.5755764
CT 0.3607708 0.6392292
DC 0.2794655 0.7205345
DE 0.3660884 0.6339116
FL 0.4580901 0.5419099
GA 0.4839587 0.5160413
HI 0.3438294 0.6561706
IA 0.4850483 0.5149517
ID 0.5707706 0.4292294
IL 0.3981593 0.6018407
IN 0.5359081 0.4640919
KS 0.5188326 0.4811674
KY 0.5726321 0.4273679
LA 0.5549075 0.4450925
MA 0.3055022 0.6944978
MD 0.3163091 0.6836909
ME 0.4166781 0.5833219
MI 0.4485951 0.5514049
MN 0.4442557 0.5557443
MO 0.5170086 0.4829914
MS 0.5560577 0.4439423
MT 0.5187822 0.4812178
NC 0.4654500 0.5345500
ND 0.5820149 0.4179851
NE 0.5096579 0.4903421
NH 0.4354918 0.5645082
NJ 0.3959544 0.6040456
NM 0.4186697 0.5813303
NV 0.4576305 0.5423695
NY 0.3306714 0.6693286
OH 0.4926390 0.5073610
OK 0.5886615 0.4113385
OR 0.3915786 0.6084214
PA 0.4472588 0.5527412
RI 0.3367492 0.6632508
SC 0.4981731 0.5018269
SD 0.5561404 0.4438596
TN 0.5354477 0.4645523
TX 0.5022584 0.4977416
UT 0.5336834 0.4663166
VA 0.4167673 0.5832327
VT 0.3134606 0.6865394
WA 0.3687515 0.6312485
WI 0.4461054 0.5538946
WV 0.6036859 0.3963141
WY 0.6663434 0.3336566

This model shows a blowout for Biden in a number of key states: he is up by more than ten points in Michigan, Florida, and Pennsylvannia. He also narrowly wins South Carolina, and nearly pulls out a victory in Texas. Overall, Biden wins 389.5774 electoral votes on average, with Trump picking up the remaining 148.4226. Looking at the distribution of the electoral college outcomes fits exactly with this theme

We can see that there is very, very little overlap in the electoral college results. In fact, this model predicts Biden victory 9989 out of 10,000 simulations, while Trump wins the remaining 11. As extreme as this result is, it’s not totally out of line with The Economist’s prediction. We can also look at the plot of the popular vote shares, and we can again see very little overlap.

This plot shows Biden with a commanding lead in the national popular vote, winning above 50% of the vote in essentially every contest. There is an electoral college and popular vote split in 0 of the 10,000 contests, meaning that Trump only wins when he happens to win the popular vote. Texas and South Carolina coming down to razor thin margins fits with the heavily Democrat leaning national enviornment. However, in some senses, looking at the average election outcomes is deceptive. A number of races in battleground states are actually predicted to be quite close.

Given how close some of the state elections are supposed to be, in states like Ohio or Iowa, how can this model be so certain in predicting that Biden wins the election? The answer is the state to state correlation in outcomes. In elections where Biden narrowly wins Pennsylvannia, he is likely to do poorly in Ohio, because the states are relatively similar. However, Biden is predicted to win Pennsylvannia relatively comfortably, so on average he is likely to win Ohio as well.

This also speaks to another phenomenon that gives Biden a massive advantage: he was many paths to winning the election. Because Biden is favored in so many states, he could lose Georgia or Arizona and still comfortably win the election.

Initial Model Assessment

While this model is much more sophisticated than anything else that I have built this past semester, I think it is still flawed. While the state to state correlation and national polling error terms are the most interesting model I have built, I did not spend much time fine tuning them. State to state correlation is certainly something I wish I spent more time on, as this is a feature that is notoriously hard to get right8. I also wish I had the chance to run the full model on previous elections, conducting yet another round of out of sample validation. It would be interesting to see how this model performs with data that do not suggest a blowout election. One other note is that this model seems to understate the value of the electoral college - other models that I have seen do have the possibility of a electoral college and popular vote victor split, which did not occur here.

All in all, this model gives Biden an extremely high chance of victory, which does make sense given the national environment This model is quite certain of the victory, but everyone should remember that events with small probabilities do happen occasionally, no matter how small the probability is. For example, the rouhgly 0.1% chance that my model gives Trump is roughly the chance of three randomly selected people all being left handed. Also, frankly, I think my model is way too certain. I think further experimentation with variable selection might introduce more uncertainty, or changing how polling variance is thought of.

What this model suggests is that for Donald Trump to win re-election, he needs a massive polling error in his favor, or he needs courts to allow him to conduct extra-legal shenanigans that throw out votes. One final thing to remember is this: all models are wrong, but some models are useful. Undoubtedly, this model will be wrong in the exact votes, but it can still tell us that Biden has a very, very good chance of victory.

Appendix: Data

Data for this model came from a number of places. In some situations, I also scaled some of the data so that different measures had approximately the same magnitude.

  1. Polling data. For historical data, this was an average of state level polls over the final three weeks of the election. This is a very crude form of poll aggregation, averaging over time. The choice of three weeks comes from striking a balance between using polls that are recent enough, but also making sure not to limit the amount of data I have. Averaging over time is a strategy used by both FiveThirtyEight and The Economist. For 2020 data, I used the smoothed polling average from FiveThirtyEight, again taking the average over the final three weeks. One added benefit is that pollsters tend to herd in the final week or so before the election - meaning that no one wants to be an outlier, so incorporating data from farther back helps negate this effect.

  2. Economic data. This comes both from sources provided in class, and from the Bureau of Economic Analysis. I used two pieces of data: second quarter GDP at the national level, and real disposable income at the state level. Real disposable income data comes from “Quarterly Personal Income By State.”. GDP data comes from class. The reasoning for including a mix of both economic indicators and levels is to cover the widest range of possible impact on voting behavior, and to make sure that a single indicator from 2020 does not completely skew the results. Given that we are in a recession, something like GDP will show a massive loss in the second quarter, while real disposable income will show a huge gain (due to the CARES Act). As mentioned in the footnotes, there is significant evidence that the economy does impact how people vote.

  3. Demographic data. I use a range of demographic data at the state level. In both the turnout and vote share estimation regressions, I used the same set of data: fraction of the population that is white, black, hispanic, asian, age 20 to 30, age 30 to 45, and age 45 to 65. There are other race or age categories that I could include, but I wanted to avoid collinearity - you can figure out roughly how many people there are over the age of 65 by adding up all the other categories. This data came from the dataset provided in class. One problem in this case is that the dataset I was using only had demographic data up to 2018, which is what I substituted for 2020. A common refrain in predictive models is “garbage in, garbage out,” meaning that this could cause problems for predictions. There were two related reasons to include demographic information: first, it acts as a way to have correlation between states, and also because certain demographic groups tend to vote in particular ways. For example, African American women vote overwhelmingly for Democrats.

  4. Indicators. These are dummy variables, primarily for party and state. Given the use in a pooled model, they are meant to capture time invariant fixed effects. Because I am primarily working with a two sided model based on incumbency, adding in a control for the party seems like a reasonable thing to do to capture some amount of the effect of political polarization. State fixed effects are meant to account for the particular impact of each state, assuming that it is constant over time, which could account for things like cost of voting.

  5. Lagged data. This includes lagged vote share data, along with lagged turnout data when estimating turnout. The reason for including this information is simple: the best predictor of the future is the past.

One important thing I did early on in the process was scale a large amount of my data to have a mean of zero and standard deviation of one, within each year. I scaled previous turnout, vote margin, demographic data, and real disposable income data based on each year9. This was primarily to ensure that past turnout data did not completely overwhelm everything else, as it was on much larger order of magnitude. I intentionally did not scale the polling data, because I needed the variances in raw form to incorporate into the model at other points.


  1. There are some assumptions that should hold for a variable to be distributed poisson, and I believe that most of them should hold. The first is that individual votes being cast happens independently from one another (for example, me casting a vote does not make you in particular any more or less likely to cast a vote). Second, the mean should be roughly equal to the variance, which in this case it is not. The mean of total votes cast is roughly \(2\times10^7\), while the variance is roughly \(5.6\times10^{12}\), meaning that we have overdispersion. For simplicity in the simulation process, I decided to stick with using a poisson regression, even if it does mean that the variance in turnout will be lower.

  2. In general, states that are more competitive see surges in turnout. See Li et Al. (2018) and Bursztyn et Al. (2017).

  3. A model that is just the mean outcome variable.

  4. As discussed previously, incumbency matters a great deal. To account for this, I split the data and generate two seperate models - one for candidates from the incumbent party, and a second for the challenger party. As explored last week, this performs roughly as well as party based two sided models. One thing to note is that it is by party status, rather than for the candidates. Some models, like Abramowitz’s time for change model, which has been quite successful only use actual incumbency status.

  5. See Achen and Bartels (2017) for further evidence of this claim.

  6. One thing to note about this is that another way to measure uncertainty would have been to use the standard deviations from each state in individual polls, rather than in the average. I decided against this because individual polls can be quite noisy, and poll aggregators like FiveThirtyEight do a good job of putting together a trend line that removes a lot of the noise.

  7. This is less of a modelling choice, and more about convenience as generating draws from correlated stable distributions. In fact, the covariance between errors is negative, as expected, meaning that this is a clear limitation on the model. However, it does fit with the idea that both Democrats and Republicans consistently beat their polls.

  8. See G. Elliot Morris, Nate Cohn, and Nate Silver’s twitter feeds for examples, along with this blog post from Andrew Gelman.

  9. So that within each year, the mean value is 0 and the standard deviation is 1. I did not scale the GDP as it would all be set to mean 0, as it was the same for all states.