Measuring Election Accuracy

One of the most important criteria we use to evaluate voting methods is "Accuracy," but how do we determine if a voting method is accurate? Does it elect the candidates who should win? Are the winners representative? 

There are a number of tools that researchers use to answer these questions and like all lovers of the scientific method, we advocate taking a close look from multiple perspectives.

 

Evaluating Specific Election Outcomes

For evaluating the accuracy of individual elections, when voter preferences are known, one approach is to find the "Condorcet Winner," the candidate who would beat all others in a head to head race, if one exists. For voting methods with a more expressive ballot, looking for the highest scoring or highest rated candidate is another common-sense approach. Preference order and level of support can be thought of as measuring quality of support and quantity of voters respectively. The best results are found when both are considered in tandem. In cases where both agree, the election result was certainly correct, when they disagree, or when there are ties, the question of who should have won may become a philosophical debate.

An ideal candidate may not exist, but if they do, it is generally accepted that they would be the candidate who is closest to the ideological center of the electorate, or the candidate who would make as many voters as possible as satisfied as possible with the election outcome.


Election Simulation Models

For comparing the accuracy of various voting methods across multiple elections, most voting scientists turn to election simulations. "Bayesian Regret" (1) is one such type of model and the Ka-Ping Yee Diagrams (2, 3), which are explained in this video from Equal Vote founder Mark Frohnmayer (4), are another. One of the most sophisticated and realistic is "Voter Satisfaction Efficiency" (5), from Harvard PhD in Statistics Dr. Jameson Quinn. Quinn was Vice-Chair at the Center for Election Science and was finishing up his PhD in Statistics from Harvard at the time this study came out in 2017, and he joined the Equal Vote Coalition board of Directors in 2020. Modeling from John Huang (6, 7) and others (8, 9, 10, 11) represent newer additions to the body of evidence. It's interesting to note that though the specific numbers may differ slightly in some cases, all this data is in general agreement as to the relative conclusions comparing voting methods. Collectively, this body of simulations represent an invaluable addition to the data we can collect from real world elections.

 

Real World Elections

Elected officials and reform advocates who are considering adopting a new voting method may prefer to wait until the method has been "used in the real world" a number of times. This is helpful for looking at considerations beyond "accuracy," such as implementation logistics, voter education campaigns, and more, but reformers may find themselves in a catch 22 if they are waiting for real world elections to prove that a given method "works." 

Real world, empirical election data is one source of information, but unfortunately, it has its limitations.

"In discussions comparing election methods, people often argue for one method or another by presenting examples of cases where a particular method fails or behaves strangely. There are five commonly cited criteria (called universality, non-imposition, non-dictatorship, monotonicity, and independence of irrelevant alternatives) for "reasonable behaviour" of an election method. But it has been mathematically proven that no single-winner election method can meet all five of these criteria (12), so one can always invent situations where a particular method violates one of these criteria. Thus, presenting individual cases of strange behaviour proves little." - Ka-Ping Yee

The fact is that no voting method is perfect 100% of the time, and any method will yield the correct result in non-competitive elections. For this reason, a single election, or even hundreds or thousands of elections may not represent a statistically relevant sample. In much of the world, elections are two party dominated, and many elections don't have multiple competitive candidates or parties. Every voting method will elect the majority preferred winner in a two candidate race, so election data comparing voting methods which doesn't include a robust sample may give a false sense of security.

Further complicating the issue, the choose-one-only style ballot doesn't give us much data to go on. These ballots are not expressive enough to collect the voters' full opinions, and we have no way to definitively determine if the votes cast were honest or dishonest. We also have no way of knowing if factors like vote splitting and the Spoiler Effect distorted the election outcome. For determining if a Plurality election picked the right winner, polling data needs to be considered as well.

 

Notorious Failed Elections

For assessing voting methods with less expressive ballots, pre-voting-day polling and exit polls can be a valuable addition to election results and ballot data. In many cases, ratings are used in this kind of polling because a rating is able to collect the kind of data needed to assess less expressive ballot data and election results. Despite these less expressive ballots, we can draw firm conclusions from the data, polling, and other observations and trends.

For example, failed elections due to vote splitting and the Spoiler Effect can be glaringly obvious. The 2000 presidential election with George Bush, Sr. (Republican) vs. Al Gore (Democrat) and Ralph Nader (Green Party) is a classic example, even if we ignore the electoral college. In that election, a majority of voters were from the left end of the political spectrum. Based on polling we can safely conclude that many Green Party voters would have preferred Gore, and if we had a more expressive voting system, the election would have elected Gore. In 1996 the same scenario happened in reverse where the Republican Bob Dole was likely the candidate preferred overall, but he lost the election to Bill Clinton after voters on the right were split between Bob Dole and Ross Perot.

Among voting scientists there is full consensus that our choose-one-only voting method is wildly inaccurate with more than two candidates in the race. 

“The fact is that FPTP, the voting method we use in most of the English-speaking world, is absolutely horrible, and there is reason to believe that reforming it would substantially (though not of course completely) alleviate much political dysfunction and suffering.” -Jameson Quinn in “A Voting Theory Primer for Rationalists” (13)

Real world data is more insightful when we are looking at election results from voting methods that do use expressive ballots. For example, Instant Runoff Voting (14) (aka Ranked Choice) uses an expressive ballot, but also uses a multi-round, tournament style elimination process which doesn't count all the rankings. When we go back and look at the full ballot data, sometimes we find elections, like the 2009 Burlington, VT mayoral race, where the candidate who won wasn't actually preferred by the voters... according to the ballots cast. The system was repealed shortly thereafter.

 

Condorcet Winners and Ranked Voting

The Condorcet winner (15) is the candidate who was preferred over all others head-to-head, and a ranked ballot or any other ballot that shows preference order is all that is needed to find the Condorcet winner if one exists. Thus, for ranked ballot elections the Condorcet winner is the best way to evaluate election results.

Unfortunately Condorcet has its limitations as well:

  • First, there isn't always a single winner that was preferred over all others. Sometimes preferences are cyclical. (A>B, B>C, C>A.)
  • Second, Condorcet only looks at preference order, not level of support, so there are cases where the Condorcet winner wasn't actually the candidate with the most support. A candidate who is your second choice may be just as good as your favorite, or they could be almost as bad as your worst-case-scenario. Ranked ballots don't have the resolution to allow voters to make those distinctions.

Advocates of Instant Runoff Voting (IRV) often argue this point to defend the results of the 2009 Burlington VT, IRV election which failed to elect the candidate who was preferred over all others, but in order to make that argument convincingly we would need to know more than just voters' preference orders, we would need to know how much each voter liked each candidate.

In the Burlington Mayor's race there were three viable candidates-- a Democrat, Republican, and a Progressive-- and all three had significant support. The Democrat was preferred over all the others (the Condorcet Winner) but came in third place after voters' first choice votes were counted in the first round. The Progressive won. This result was widely regarded as a failed election and the Ranked Choice system was repealed shortly thereafter.

Especially problematic was the fact that voters had been told a number of claims which were proven to be false in stark terms:

  1. Voters were told that if their first choice was eliminated, their next choice would be counted. In Burlington, the Republican voters' 2nd choice wasn't counted, because that candidate had already been eliminated.
  2. Voters were told it was safe to vote their conscience. In reality, these voters should have strategically ranked their 2nd choice first, knowing that their first choice wasn't going to win. Voting lesser evil would have gotten them a better outcome. 
  3. Voters were told IRV would elect the majority preferred winner. The Democrat, who was eliminated first, had a larger majority than the Progressive, who won.

Did IRV elect the wrong winner? Many Republican voters ranked the Democrat as their second choice, showing that they preferred the Democrat to the Progressive candidate. If these voters would have actually been significantly more satisfied if the Democrat had won, then the Condorcet winner should have won. On the other hand, if Republicans would have been almost equally dissatisfied with either the Democrat or the Progressive, then the Progressive was probably the candidate with the most support after all.

The point is that in these kinds of close three way ties it's critical to have expressive ballot data in order to determine if the candidate who won had the most support or not. An expressive ballot shows level of support, preference order, and allows voters to express "no preference" if desired. In Burlington, the ballots clearly showed that the Democrat was preferred over all others. The Democrat was the Condorcet winner and so, according to the ballots cast, he clearly deserved to win.

It's important to note that this issue doesn't always favor Progressives or 3rd parties when it fails. IRV could just as easily elect a Republican where a Democrat was preferred, or any other candidate, but when Instant Runoff voting fails, it tends to elect the most polarizing candidate from the largest coalition. Even in situations where the Condorcet winner didn't deserve to win, these kinds of outcomes cast doubt on the legitimacy of the winner, as well as the method itself. 

To learn more about Burlington, read more from Equal Vote here (16), from The Center for Election Science here (17), and from the Center for Range Voting here (18).

Was this election a fluke or did it represent a serious flaw in the system?

 

Simulating Election Accuracy

In order to answer that question, electoral scientists, mathematicians, and political scientists turned to simulations, and to the work of Weber in 1978 and Merrill in 1984, the quest to answer this question launched a new era in the science of comparing voting methods.

Merrill, Samuel (1984). "A Comparison of Efficiency of Multicandidate Electoral Systems" (19). American Journal of Political Science. Note that Ranked Choice Voting aka Instant Runoff is labeled as Hare.

One such paper "Frequency of monotonicity failure under Instant Runoff Voting: Estimates based on a spatial model of elections" (20) begins by stating:

"It has long been recognized that Instant Runoff Voting (IRV) suffers from a defect known as nonmonotonicity, wherein increasing support for a candidate among a subset of voters may adversely affect that candidate’s election outcome. The expected frequency of this type of behavior, however, remains an open and important question, and limited access to detailed election data makes it difficult to resolve empirically. In this paper, we develop a spatial model of voting behavior to approach the question theoretically. We conclude that monotonicity failures in three-candidate IRV elections may be much more prevalent than widely presumed (results suggest a lower bound estimate of 15% for competitive elections)."

This study, from Dr. Joseph T Ornstein of the University of Michigan and Dr. Robert Z. Norman of Dartmouth College came out in 2013. These results were seen by many as a red flag, but for the researchers who had been pioneering the work of bringing advances in computer simulations and statistics into the field, these findings only confirmed warnings which had been predicted long before.

 

Bayesian Regret

In 2000, Dr. Warren Smith, PhD, of the Center for Range Voting, applied the game theory concept Bayesian Regret (21) to voting theory, breaking new ground by varying election methods, voter utility models, and strategy models systematically. The chart below, which predates the invention of STAR Voting by over a decade, showed that Score voting when combined with a Top-Two general election topped the charts, even if some voters were strategic. STAR Voting (Score Then Automatic Runoff) is essentially this method, but with a single election rather than a primary and general. 

Note that Instant Runoff Voting, the method used in Australia, Ireland, and in some parts of the United States came in 42nd place, and the ubiquitous Choose-One Plurality voting method came in 50th.

Simulations from the Center for Range Voting assessing frequency of Bayesian Regret (lowest is best) and frequency of Condorcet winners (highest is best)


These statistics foreshadowed the next revelation in voting theory, and in 2014 STAR Voting was invented (22) at the Equal Vote Conference at the University of Oregon.

The Equal Vote Conference, like most events on voting reform, featured presentations from advocates of both Instant Runoff and the Score/Approval voting camps. The two camps (one favoring Cardinal or scored methods, and the other favoring Ordinal or ranked methods) have long been at odds, with both sides citing details of the other's proposals as deal breakers.

STAR Voting combines the two approaches. The realization was that a scoring ballot includes both level of support and preference order, which means that it could be counted both ways-- with a scoring round, and then an instant runoff. This hybrid approach unlocks the simplicity and benefits of tabulation by addition, while also achieving the honest voting incentives which are gained by a preference ballot and top-two runoff.

The theory was that STAR voting may offer a compromise that could outperform both Approval and IRV, addressing major criticisms of both, and if so, that the new method may have the power to unite the fractured reform movement.

 

The Ka-Ping Yee Simulations

In 2006, a young researcher named Ka-Ping Yee who was completing his PhD in Computer Science from UC Berkeley, had introduced a way to examine single-winner election methods via computer graphics. Yee Diagrams (2), as they are now widely known, show candidates and voter blocks on a 2 dimensional political space. Yee published all his code as open sourced, and many other researchers since have been able to build on his work (23).

These kind of visualizations are useful in that you can see exactly how ideologically close or far each voter is from each candidate. The color of the background represents which candidate would win under each method if a randomized electorate, centered at that point, were to vote. For example, in the Plurality diagram even if the electorate was centered right next to the green candidate they would lose. In Approval the results look fair, and in the IRV chart you can see extreme distortions where if the center of public opinion is close to green the winner looks almost random. Yee Diagrams represent a simplification of our complex political spectrum, but do a good job at illustrating common phenomena that can effect election outcomes. 

Yee's diagrams illustrated some serious pathologies with the Plurality and Instant Runoff methods, but didn't include STAR Voting, which hadn't been invented at the time. In 2017 Mark Frohnmayer, using Yee's code, created a video called "Animated Voting Methods" (24) which adds Score Voting, STAR Voting (aka Score Runoff Voting) as well as a one-voter "ideal winner" model for comparison. These animations showed that where plurality and IRV tend to squeeze out candidates in the center, favoring more polarizing candidates, Score and Approval may give an advantage to candidates who are positioned in between others, though to a lesser extent. STAR Voting consistently performs closest to the ideal model of the systems visualized.

These findings corroborated those from Warren Smith in his paper "Pro-Extremist versus Pro-Centrist bias in Voting Methods" (25). This point often brings up a philosophical debate, with a number of advocates for IRV, especially those from the Green Party, arguing that this is a feature, not a bug. On the other side, many advocates of Approval voting consider a slight centrist bias to be an advantage which could translate to more homogenized legislatures who may be less stagnated by infighting and thus could be more effective and efficient.

These voting reform advocates fall for the common trap of preferring the system that they believe will give them an advantage, but they miss a key point. These biases depend on the position of the candidates relative to each other, not the voters, and may not correlate at all to the right-left political spectrum. A bias which favors the Green Party is just as likely to favor the Alt-Right in a red district, or even a centrist in a deep blue district. Furthermore, these flaws can be exploited by strategically nominating candidates, much as intentionally running a spoiler is used today to rig an election.

Of course, at the Equal Vote Coalition we prefer unbiased, accurate, and representative elections. Voting methods in this category according to the analysis we've seen include STAR voting, the Condorcet methods, and Score Voting if combined with a Top-Two general election. Adding a Top-Two general election to Approval may address this concern, depending on voter behavior (26).

 

Voter Satisfaction Efficiency

One of the most cutting edge tools for measuring election accuracy is VSE, or Voter Satisfaction Efficiency (5), which came out in 2016. The work of Dr. Jameson Quinn, who at the time was completing his PhD from Harvard in Statistics and was Vice-Chair at the Center for Election Science, VSE analyzes voting methods using thousands of simulated elections across a wide variety of scenarios. Factors and variables like strategic voters, voter blocks who cluster on issues, number of candidates, polarization in the electorate, and more, are considered to help us determine when and how often an election system elects the best candidate. In VSE the candidate who should win is defined as the "candidate that would make as many voters as possible, as satisfied as possible with the election outcome." 

Voter Satisfaction Efficiency makes a strong case for STAR Voting. In VSE, STAR topped the charts, coming in as more accurate than all other voting systems that are being seriously advocated for, many of them by large margins. The only voting system that was close to on par was a Condorcet Method called Ranked Pairs (27), which had previously set the bar for accuracy but which has been considered too complex for real world elections. 

Here are some of the findings that we can extrapolate from the VSE graphs:

  • STAR is among the very best of the best. When voters are honest, STAR delivers its best results with a VSE of over 98%. Under less than ideal circumstances, such as elections where a large portion of voters are strategic, STAR was still highly accurate with a VSE of over 91%.

  • STAR Voting at worst was basically just as accurate as the best-case-scenario for IRV (commonly referred to as Ranked Choice Voting) and was much better than Plurality Voting (our current system) in every scenario. For comparison, IRV elects the correct winner 80-91% of the time, and Plurality voting only delivers correct outcomes 71-86% of the time.

  • STAR showed a high resiliency to strategic voting, with results closely clustered regardless of voter strategy. This means that tactical voting has a much smaller impact on overall election accuracy compared to other systems. Even if many people try to "game the system," the election will still come out in relatively good shape. In this category 3-2-1 Voting (28) is another method which did particularly well.

  • Though STAR does well even if voters are strategic, VSE strategy simulations showed that STAR doesn't incentivize strategic voting. While no voting method can eliminate all opportunities for strategic voting (29), strategic and dishonest voting is just as likely to backfire as it is to help the individual voter in STAR. In contrast, strategic voting under Instant Runoff Voting was found to be incentivized almost three times as often as it backfired. 

The VSE chart below shows how often strategic voting works compared to how often it backfires. Better results are higher and further to the left. This chart shows how extremely dependent on strategic voting Plurality voting is, with a ratio of 17:1, by far the worst out of all voting methods studied.

Of the methods which are the subject of active campaigns in the U.S., STAR Voting boasts a 1:1 ratio, indicating that strategic voting will not give voters an edge. Ideal Approval comes next with 1:2.6, then comes IRV (Ranked Choice) with 1:2.7, and then Score Voting with a ratio of 1:3. Note that Score voting (which gets considerable criticism for being "gameable") is only slightly worse than IRV, and still over 5 times better than the current system on this metric.

You can learn more about Voter Satisfaction Efficiency here

 

Accuracy and Other Key Considerations

Advances in voting theory have given the modern era of voting reform a huge advantage compared to the reformers of yesteryear. While a few of the older methods stand up to the test of time (including Condorcet and Score voting) and do deliver outcomes significantly better than the ubiquitous Choose-One Voting, modern simulations have revealed serious flaws such as unrepresentative outcomes in methods like Instant Runoff Voting, which is the subject of reform efforts around the world. Simulations have allowed for high level analysis of strategic incentives in ways that were not previously possible.

While "Accuracy" is of course one of the most important considerations in the quest to find the best voting method, and it makes a strong case for STAR Voting, there are other factors that absolutely warrant consideration as well. At the Equal Vote Coalition (30), a nonpartisan nonprofit focused on voting reform, we've identified five overarching pillars which need to be maximized for better voting, healthy political discourse, and fair representation: Equality, Honesty, Accuracy, Simplicity, and Expressiveness (31).

 

Accurate: Winners are representative and accurately reflect the will of the people. Election accuracy is assessed using a variety of metrics including Voter Satisfaction Efficiency. 

Equal: Does not put some types of voters at an unfair advantage. Voting methods which ensure an equally weighted vote (32) eliminate vote splitting and the spoiler effect by definition. Does not bias in favor some types of candidates over others. See Center Squeeze (33), Center Expansion (34), and Electability Biases (35).

Honest: Safe to vote your conscience; strategic voting is not incentivized (36).

Simple: Easy to understand, easy to tabulate, easy to implement, easy to audit.

Expressive: Voters are able to express their full nuanced opinion. 

 


Sources:

1.) "Range voting with mixtures of honest and strategic voters " from Dr. Warren Smith PhD.

2.) Ka-Ping Yee Diagrams are used to illustrate the behavior of election methods, given a fixed set of candidates in a two-dimensional preference space.

3.) Yee Diagrams on Electowiki.

4.) "Animated Voting Methods" is a video from Equal Vote founder Mark Frohnmayer

5.) "Voter Satisfaction Efficiency," from Harvard PhD in Statistics Dr. Jameson Quinn.

6.) Modeling from John Huang is a newer addition to the body of evidence which all are in agreement as to relative conclusions comparing voting methods.

7.) "Strategic Voter Simulations" John Huang, 2 February 2021

8.) Dave Gallets Simulations: "Simulation on average distance of winner from consensus candidate and number of failed elections by number of candidates for FPTP, STAR, and IRV" by Dave Gallets.

9.) "STAR voting vs other systems on a 2D political compass" by Psephomancy.

10.) Yee Diagrams by Essenzia.

11.) "Are Condorcet and minimax voting systems the best?" by Richard B. Darlington of Cornell University (2021). ArXiv

12.) Arrow's Theorem on Wikipedia

13.) “A Voting Theory Primer for Rationalists” by Dr. Jameson Quinn PhD, Harvard Statistics, Board Member at the Equal Vote Coalition

14.) Instant Runoff Voting on Wikipedia

15.) Condorcet criterion on Wikipedia

16.) "What the heck happened in Burlington?" by Mark Frohnmayer, Equal Vote Founder

17.) "The Spoiler Effect" by Aaron Hamlin, Executive Director at the Center for Election Science

18.) "Burlington Vermont 2009 IRV mayor election: Thwarted-majority, non-monotonicity & other failures (oops)" from Dr. Warren Smith, PhD. Princeton, the Center for Range Voting

19.) Merrill, Samuel (1984). "A Comparison of Efficiency of Multicandidate Electoral Systems" American Journal of Political Science

20.) "Frequency of monotonicity failure under Instant Runoff Voting: Estimates based on a spatial model of elections" from Dr. Joseph T Ornstein of the University of Michigan and Dr. Robert Z. Norman of Dartmouth College, 2013

21.) "Bayesian Regret for dummies" from Dr. Warren Smith, PhD. Princeton, the Center for Range Voting.

22.) "About" from the Equal Vote Coalition website

23.) "To Build a Better Ballot: an interactive guide to alternative voting systems," by Nicky Case

24.)  "Animated Voting Methods" by Mark Frohnmayer, Founder of the Equal Vote Coalition, 2017

25.) Dr. Warren Smith, PhD, in his paper "Pro-Extremist versus Pro-Centrist bias in Voting Methods."

26.) The Center for Election Science, "Approval Voting Tactics"

27.) Ranked Pairs on Wikipedia

28.) 3-2-1 Voting on Electowiki

29.) Gibbard–Satterthwaite theorem on Wikipedia

30.) The Equal Vote Coalition website

31.) "Criteria: Evaluating voting systems and the criteria we judge them by" by Sara Wolk, Executive Director of the Equal Vote Coalition

32.) "What is an Equal Vote?" by Sara Wolk, Executive Director of the Equal Vote Coalition

33.) "Center Squeeze" on Electowiki

34.) "Yee Diagram" on Electowiki

35.) "Strategic Voting with STAR?" by Mark Frohnmayer, Founder of the Equal Vote Coalition,

36.) "2020 Vision: Could STAR Voting slay the “electability” dragon?" by Sara Wolk, Executive Director of the Equal Vote Coalition