Introduction

This week, I’m going to be covering a number of different topics: first, I’ll focus on the issue of incumbency and how it relates to models I have built; then, I’ll touch on logistic regressions as a way to build more effective forecasts for the presidential elections. I will close with a brief discussion about what the models suggest to me for the next steps.

Incumbency

To say Donald Trump’s re-election campaign has been a unique one is an understatement. From filing the campaign paperwork with the Federal Election Commission on the day of his inauguration1, to leveraging the power of the federal government, to the particular messaging approach, there are many factors that make it hard to figure out how to model this election. All three points that I just alluded to deal with the idea of incumbency.

A commonly cited fact in forecasting presidential elections is that incumbent presidents have an advantage: eight of the eleven incumbents who have run for re-election since World War II have won their races. There are many explanations for why, and we should think critically about if possible reasons apply to President Trump’s re-election campaign.

One explanation is money: because Trump was able to officially begin his campaign on the day of his inauguration, and avoided a costly primary, meaning that he entered the final stretch of the campaign with 187 million dollars more than Joe Biden. However, there is also evidence that this will not play a major role in deciding the election, as the Trump campaign is having a cash crunch, having spent huge amount on legal fees, travel expenses for aids, and a number of highly paid consultants. To that end, it may be that because Trump does not have a cash advantage, he may have less of an advantage.

Another possible angle is that because Trump currently has the might of the federal government behind him, he can use that to his advantage in the run up to the election. This theory is quantified by Kriner and Reeves (2012), which links county level federal grants to wins by incumbents. As we saw last week, the CARES Act greatly increased people’s real disposable incomes in the second quarter of 20202, which in the fundamentals model leads to a landslide victory for Trump3. However, this relationship may not be quite that straightforward: the majority of Americans do not approve of how Trump has handled the COVID crisis, meaning they may not respond in the way we think to increased income as they only needed it because of the dire state the country is in.

A third factor to consider is campaign messaging (which we will take a longer look at after the election happens): Trump’s messaging has not been consistent with what most incumbent campaigns look like4. He continues to campaign as if he is the outsider, using messaging that was effective in 2016. A number of advertisements make reference to Joe Biden’s America overlaid with images of current images, despite Trump being the current president5.

Incumbency in Models So Far

All of this begs the question: how have I dealt with incumbency in the previous two posts? I took a similar approach in both, by estimating the incumbent and challenger vote shares seperately. One thing to note is that I have primiarily been working with incumbent parties, rather than incumbent candidates. Incumbent party presidents have only won 11 of the 18 electoions since World War II, and eight of the 11 wins were by incumbent presidents. This relationship is well documented in the Time for Change Model6.

If we take it as given that incumbency has some effect, it means that combinded with only using linear regressions, this left the impact of incumbency as an ommitted variable. Therefore the impact was either absorbed by the coefficients on the predictors or by the error term. There is evidence in both directions. Becuase I looked at state level data for real disposable income, it may be that similar to Kriner and Reeves (2012), incumbency is tied to the real dispsoable income levels. However, I believe it is more likely it was absorbed into the error, which may be problematic from an endogeneity perspective7.

Logistic Regression

One obvious critique of my models thus far has been that they are linear regressions to predict vote shares, which does not quite reflect the reality of voting. Many of the premeir election forecasters use probibalistic models, that instead simulate something like

\[Pr(VoteShare = 200000|VoterTurnout = 5000000) = f(\beta_0+\beta_1x_1 + \beta_2 x_2+...)\] In practice, it means we want to estimate the probability of someone winning an election, or state, rather than just the vote share. A simple way of doing so is using a logistic regression. Instead of using vote share to estimate the outcomes, we use an indicator on who won the election. The predictions from the regression are then probabilities of results. A probability of greater than 50 percent can be thought of as someone winning a particular election.

For this particular model, I used a tweaked version of the state level “polls plus” model from last post. There are five inputs: indicators for state and period[^8], along with state polling data, second quarter national GDP data, and second quarter state real disposable income. For this model, I used a logistic regression on indicator variables for if the incumbent party candidate won the election.

## 
## Call:
## glm(formula = incumbent_win ~ rdi_q2 + gdp + state + avg_poll + 
##     period, family = binomial, data = reg_df %>% filter(year < 
##     2020, incumbent_party == TRUE))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1194  -0.4369   0.0047   0.3491   4.0324  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -10.047750 967.895121  -0.010  0.99172    
## rdi_q2                     -0.081454   0.041669  -1.955  0.05061 .  
## gdp                         0.604665   0.197987   3.054  0.00226 ** 
## stateAlaska                 3.606518   1.961118   1.839  0.06591 .  
## stateArizona                2.397271   1.663887   1.441  0.14965    
## stateArkansas               0.652646   1.665996   0.392  0.69525    
## stateCalifornia             2.591043   1.706732   1.518  0.12898    
## stateColorado               1.255107   1.590700   0.789  0.43010    
## stateConnecticut            2.352093   1.679619   1.400  0.16140    
## stateDelaware               2.119239   1.602776   1.322  0.18609    
## stateDistrict of Columbia  -1.188237 337.417024  -0.004  0.99719    
## stateFlorida                0.556883   1.507437   0.369  0.71181    
## stateGeorgia               -0.564657   1.711350  -0.330  0.74144    
## stateHawaii                 2.675908   1.748457   1.530  0.12591    
## stateIdaho                  4.028510   2.673405   1.507  0.13184    
## stateIllinois               0.852643   1.687768   0.505  0.61343    
## stateIndiana                0.057035   1.768881   0.032  0.97428    
## stateIowa                   1.416683   1.547635   0.915  0.35999    
## stateKansas                 2.290142   2.070068   1.106  0.26859    
## stateKentucky               1.052103   1.669340   0.630  0.52853    
## stateLouisiana              1.259425   1.788380   0.704  0.48129    
## stateMaine                  2.364346   1.767944   1.337  0.18111    
## stateMaryland               0.782121   1.731072   0.452  0.65140    
## stateMassachusetts         -1.155197   2.121012  -0.545  0.58600    
## stateMichigan               1.083185   1.576206   0.687  0.49195    
## stateMinnesota              0.458344   1.528256   0.300  0.76424    
## stateMississippi            0.004684   2.064805   0.002  0.99819    
## stateMissouri               0.225698   1.575021   0.143  0.88605    
## stateMontana                0.959912   1.737058   0.553  0.58053    
## stateNebraska               2.068288   2.095927   0.987  0.32373    
## stateNevada                 2.104432   1.711531   1.230  0.21886    
## stateNew Hampshire          0.333822   1.594812   0.209  0.83420    
## stateNew Jersey             1.158564   1.630102   0.711  0.47725    
## stateNew Mexico             2.862427   1.653733   1.731  0.08347 .  
## stateNew York              -0.023488   1.882502  -0.012  0.99005    
## stateNorth Carolina        -0.457004   1.573595  -0.290  0.77149    
## stateNorth Dakota           3.033664   1.805810   1.680  0.09297 .  
## stateOhio                   0.126982   1.520796   0.083  0.93346    
## stateOklahoma              -0.397948   2.976997  -0.134  0.89366    
## stateOregon                 1.551877   1.614250   0.961  0.33637    
## statePennsylvania           0.397665   1.583857   0.251  0.80176    
## stateRhode Island           3.949134   2.276575   1.735  0.08280 .  
## stateSouth Carolina         0.665866   1.716010   0.388  0.69799    
## stateSouth Dakota           2.925169   1.707404   1.713  0.08667 .  
## stateTennessee             -1.195558   1.636465  -0.731  0.46504    
## stateTexas                  2.444242   1.725369   1.417  0.15659    
## stateUtah                   3.439779   2.694959   1.276  0.20182    
## stateVermont                2.599439   1.855393   1.401  0.16121    
## stateVirginia               0.680591   1.583773   0.430  0.66739    
## stateWashington             2.087474   1.562509   1.336  0.18156    
## stateWest Virginia          2.073452   1.613846   1.285  0.19887    
## stateWisconsin             -0.210621   1.622676  -0.130  0.89673    
## stateWyoming                3.311931   1.960772   1.689  0.09120 .  
## avg_poll                    0.482719   0.049143   9.823  < 2e-16 ***
## period1972-1992            -9.440387 967.890907  -0.010  0.99222    
## period1996-2020           -12.328922 967.890708  -0.013  0.98984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 669.21  on 484  degrees of freedom
## Residual deviance: 284.75  on 429  degrees of freedom
## AIC: 396.75
## 
## Number of Fisher Scoring iterations: 16

Interpreting the coefficients of the a logistic regression is a little different than with a linear regression. For the logistic regression, the coefficients are the change in log odds. We can see that the coefficents on polling and GDP are both statistically significant. Assessing the fit of a logistic regression is a bit tricky, so one ad hoc way to do it is to see how many times the model is correct with leave one out validation. The logistic model performs quite well: the model predicts the correct outcome 89.69 percent of the time.

We can also look at the predicted electoral map for 2020 using this regression. The predicted map is strange, thinking about intuitions on which states are likely to vote for Trump or Biden.

This model predicts a Biden landslide, with Biden winning 399 electoral college votes and Trump winning the remaining 139. We can also look at the predicted probabilities of Trump winning each state.

State Predicted Probability of Trump Winning
Alabama 92.0493784
Alaska 95.4408800
Arizona 19.4945405
Arkansas 16.6796352
California 0.1397724
Colorado 1.5371727
Connecticut 0.7888357
Delaware 2.5406055
District of Columbia 0.0000003
Florida 6.3400246
Georgia 5.9399776
Hawaii 0.0024521
Idaho 99.9456129
Illinois 0.1582114
Indiana 56.0250090
Iowa 9.1210142
Kansas 94.8417137
Kentucky 90.3431760
Louisiana 77.2482975
Maine 3.1055703
Maryland 0.1402991
Massachusetts 0.0000346
Michigan 0.4046034
Minnesota 2.5351588
Mississippi 47.1825793
Missouri 31.4928193
Montana 26.8393642
Nebraska 97.3217045
Nevada 1.0527043
New Hampshire 7.0227494
New Jersey 0.0129087
New Mexico 9.1394590
New York 0.0300054
North Carolina 7.2794133
North Dakota 98.1949806
Ohio 6.2533086
Oklahoma 83.2468575
Oregon 0.3187525
Pennsylvania 1.0283117
Rhode Island 0.2877758
South Carolina 56.0953943
South Dakota 99.7350013
Tennessee 71.8620977
Texas 52.8141759
Utah 85.1715413
Vermont 0.0138497
Virginia 3.2344618
Washington 0.1412413
West Virginia 99.5234850
Wisconsin 1.0891186
Wyoming 99.9987465

Reflections

Looking at this version of a probibalistic model, there appear to be similar problems to the linear regressions from last week. In general, it seems that there is too much weight given to the economy this year. This suggests that I should spend some time with ensemble models, looking at tuning the hyperparameters. However, given that for the most part, these models are fairly good at predicting past elections in leave one out validation, it may be that I have to experiment with ad hoc weighting.

In addition, it may be that I need to move away from linear models. The underlying assumption of all of the models I have used so far is that the relationship between polling, the economy, and the various indicators is linear. One possible way to change the models is to experiment with interaction terms. Another method, that I think may be more promising, is to look at some sort of Bayesian updating model. Using a fundamentals model as a prior, and then updating with polling and information on shocks could be a much more effective way to predict elections.


  1. Many news outlets have shown that Trump’s campaigning never really stopped.

  2. There has been evidence that despite abysmal GDP and jobs numbers, the CARES Act was able to cause incomes to increase during the second quarter of 2020.

  3. This relationship makes me wonder by the Senate Republicans have not been willing to pass further stimulus, given that it would greatly help their re-election campaigns.

  4. This article from the Washington Post gives a wonderful overview.

  5. While this piece from the New Yorker is satire, it makes the point well.

  6. Abramowitz (2016)

  7. Correlation between the error term and the outcome variable is known as endogeneity. Many econometric methods have been developed to deal with this issue for causal inference. Up to this point, becuase I have focused on prediction, these methods have not been strictly neccesary, though I may explore them in the future.