Introductions: Why Polls?

As discussed in the last blog entry we briefly discussed that a fundamentals model cannot account for everything when predicting an election, especially at the state level. In fact, using fundamentals seems like a rather indirect way of predicting an election given that polling exists; in a sense, polling is attempting to predict an election by asking a representative sample of people what they think about the election.

Many of the most sophisticated election models use polls as their backbone, like FiveThirtyEight and The Economist. In the lab section for the week, we looked at models that use polls at a national level, but as we’ve discussed before, the United States president is decided by the states and the electoral college.

Comparing Three Models

For the duration of this blog post, I will be working with three seperate but related models. The first is a fundamentals based model, that looks very similar to models previously discussed on this blog. The second is a polls only model, that relies solely on state polling to predict vote shares. The third is what I refer to as the “polls-plus” model, that incorporates both the fundamentals and the polling data. In all three cases, the model is two sided, meaning it indepedently predicts outcomes for the incumbent and challenger1. It also shifts to using raw vote share, rather than two party vote share, as that is how the majority of polls are conducted. Output for omitted regressions can be found in the appendix.

Fundamentals Model

To start, we need a basic fundamentals model for comparison. We’ll use a variant of the model from last week, this time using only on second quarter real disposable income growth at a state level, second quarter national gdp data, while controlling for state and general era2. Data comes from the “Quarterly Personal Income By State.” from the Bureau of Economic Analysis.

Similar to last week, the R-squared value of from these regressions is quite low. The incumbent fundamentals model has an R-squared of 0.175, while for the challenger it is 0.163, both relatively close to the low values from pervious fundamentals model on this blog.

The mean squared error for the incumbents is 81.855, and for the challenger it is 71.281, which is consistent with the low R-squared values. For out of sample fit, we find that the average absolute error on the vote margin3, across both models is 13.079.

Polls Only Model

For the polls only model, I work with state level polling from 1972 onward. In many cases, polling is relatively sparse; especially earlier on, not every state has polls. In addition, because there are many polls conducted, I used an average of the polling averages as a single input. This average is calculated by taking historical polling averages produced between six months from the election up the current week and taking the mean4. That value is then regressed on the incumbent or challenger vote share. Becuase this model is new, we will take a look at the full function.

We can first look at the linear regression for the incumbent.

## 
## Call:
## lm(formula = inc_pv ~ avg_poll, data = reg_df %>% filter(year < 
##     2020, incumbent_party == TRUE))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3626  -3.1061  -0.5668   2.3519  20.4655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.09468    1.04303   5.843 9.59e-09 ***
## avg_poll     0.96636    0.02358  40.981  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.664 on 469 degrees of freedom
## Multiple R-squared:  0.7817, Adjusted R-squared:  0.7812 
## F-statistic:  1679 on 1 and 469 DF,  p-value: < 2.2e-16

A few key pieces of information jump out. First, the R-squared value of 0.782 is much, much larger than the fundamentals model, suggesting a much better fit. Secondly, the coefficient on avg_poll is 0.966, it suggests a nearly one to one relationship between the average vote share; if the average polling value for an incumbent increased by 1 percentage point, their expected vote share would increase by 0.966 percentage points. Then, we can look at the challenger regression.

## 
## Call:
## lm(formula = chl_pv ~ avg_poll, data = reg_df %>% filter(year < 
##     2020, incumbent_party == FALSE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.094  -2.935  -0.229   2.605  15.293 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.02625    1.10526   6.357 4.88e-10 ***
## avg_poll     0.95593    0.02641  36.189  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.75 on 469 degrees of freedom
## Multiple R-squared:  0.7363, Adjusted R-squared:  0.7358 
## F-statistic:  1310 on 1 and 469 DF,  p-value: < 2.2e-16

The results are relativley similar: the model has an R-squared of 0.736, and the coefficient on the average poll value is 0.955. For all the concern about polls being accurate, this model suggests they are in fact quite predictive, even more than 5 weeks from the election5.

In terms of in sample fit, the mean squared error for the incumbent is 21.659, and for the challenger it is 22.464. These results are clearly much better than the fundamentals model in terms of in sample fit.

To examine out of sample fit, we look at the average absolute error on the vote margin, which is 5.844. This value is significatly lower than than for the fundamentals model, continuing with the pattern that the polls are much more predictive than fundamentals.

Polls Plus Model

For the polls plus model, it takes the fundamentals model and adds the historical polling average as another predictor. Because this model has the most robust set of inputs, one could imagine that it is also the most accurate model. Due to this model’s length and relative similarity to previous models, full details can be found in the technical appendix.

Unsurprisingly, the R-squared for both the incumbent and challenger is higher, at 0.815 for the incumbent and 0.796 for the challenger6. Interestingly, for the incumbent, both the coefficients on GDP and real disposable income are negative, while the coefficient on polling average is almost exactly 1.00. This suggests that for incumbents, the relationship between the polling average and actual results in each state is almost exactly one to one. For the challenger, the coefficient on GDP is negative, and the coefficient on vote share is 0.983, suggesting that a similar relationship holds.

In terms of in sample fit, we find the the regression for the incumbent has a mean squared error of 18.375, while it is 17.404 for the challenger. Consistent with having the lowest R-squared, the polls plus model appears to fit the data best in sample.

To examine out of sample fit, we look at the average absolute error on the vote margin, which is 5.407. This value is the lowest of the three models, again following the pattern that the polls plus model fits the data best.

Predictions and the Electoral College

After seeing the varrying degree of accuracy of these three models, it is time to move to prediction. We have second quarter GDP and real disposable income, so to predict the 2020 election we just need polling averages. I used polling averages from FiveThirtyEight as polling data input7. Using this data, we can predict the outcomes of the 2020 election. We can look at the results from each model in turn.

In stark contrast to last week, the fundamentals model predicts a landslide for Trump. This change is directly due to the release of the second quarter real disposable income growth by state for the second quarter. Due to the CARES Act, stimulus payments greatly increased many people’s incomes which tips the scales towards Trump. In this model, Trump is expected to win in a landslide, with 512 electoral votes, while Biden wins 26.

This electoral map looks much reasonable, and is in line with predictions from experts. Biden wins the electoral college comfortably, with 352 electoral votes while Trump wins 186. For Biden, this map looks like a relatively feasible to victory: he wins both Michigan and Pennsylvania, two key historical tipping point states. Biden also wins North Carolina, a state that FiveThirtyEight projects him to win by 0.06 percentage points, and The Economist projects Biden to win by 1.0 percentage points.

Interestingly enough, the polls plus model projects a Biden landslide. In this model, Biden wins 431 electoral votes, while Trump only wins 107. Biden expands his electoral map, winning Texas, Georgia, Missouri and Arkansas. While Texas and Georgia seem like plausible Biden wins, Missouri and Arkansas do not. In this model, Biden greatly benefits from state fixed effects, along with period effects. As shown in the appendix, the state fixed effects tend to either be more positive, or less negative for the challengers in many states. However, the biggest shift comes from the dummy variable that marks the election as happening from 2016 onward.

Because Trump greatly outperformed his polls in 2016, and Clinton underperformed, the model has a strong bias towards challengers. The coefficient for the incumbent is -3.823, while for the challenger it is -0.086, representing a nearly three point swing. In 2016, there was good reason for this discrepency between models and the results: Clinton underperformed by about three points due to systemmatic biases in polling, as non-college educted white people were under-represented in polls in comparison to who turned out to vote. The majority of those people voted for Trump, swinging the election. Some have theorized that such an error may repeat itself in 2020, but many disagree on why: some believe that there Trump voters are unlikely to admit they would vote for him, but there is little evidence to back up the claim.

From a modelling perspective, what this suggests is that I should not include fixed effects by period of time. Instead, it may be that I should look at other controls that are slow to change, like demographics, to get a similar effect without building in a massive bias for predictions in 2020.


  1. The distinction between incumbent and challenger is made by the party that last held the white house, not by the candidate.

  2. These are defined as 1972 through 1984, 1988 through 1996, 2000 through 2012, and 2016 through 2020. Because 2016 and 2020 are the only years in the same period, this can also be thought of controlling for Trump, in a sense.

  3. This is done by doing leave one out validation, using both the challenger and incumbent regressions for each available state and year. I average across both the incumbent and challenger to avoid throwing more numbers at the reader.

  4. At the time of writing, there are five weeks to go until the election. Therefore, I use historical polling that was produced up to 24 weeks before the election, and no earlier than 5, giving each average polling equal weight. A more sophisticated model (that may be implemented in the future) would weight polls by time until the election, as suggested by Gelman and King (2003). Their findings suggest that polls closer to the election should weigh more heavily in this average.

  5. In a sense, this fits with Galton (1907), which suggests that polls in aggregate will be quite accurate due to the law of large numbers.

  6. Much higher and I would start to worry about overfitting.

  7. Data from FiveThirtyEight was last pulled on 9/26/20. Nebraska, Illinois, Rhode Island, South Dakota, Wyoming and the District of Columbia are not included in the polling averages, as there have been no state polls there that FiveThirtyEight deems trustworthy. As a stand in, I used their demographic based vote share projection.