Tetlock's Good Judgment Project—Can Politics Be Predicted?
12/27/2013
A+
|
a-
Print Friendly and PDF

U. of Pennsylvania psychologist Philip Tetlock has been studying "expert political judgment" for decades. An early finding was that people who are employed to go on TV and make exciting forecasts about the future aren't very accurate. Simple extrapolation models—things are going to keep on keeping on, only more so—tend to be a little more accurate than media experts (who, in their defense, are on TV to be interesting—the notion that the near future is probably going to be a lot like the recent past is just about the definition of Bad TV).

Then he determined that people who are ideological one trick ponies (hedgehogs, to use Isaiah Berlin's terminology) are worse at forecasting than people who have more arrows in their quiver (foxes). (Here are some other Tetlock findings.)

A few years ago Tetlock started the Good Judgment Project in which anybody on the Internet can try their hand at forecasting the upcoming year's events. (It's subsidized by federal spooks at IARPA.)

Tetlock recently wrote in the Economist:

In the late 1980s one of us (Philip Tetlock) launched such a tournament. It involved 284 economists, political scientists, intelligence analysts and journalists and collected almost 28,000 predictions. The results were startling. The average expert did only slightly better than random guessing. Even more disconcerting, experts with the most inflated views of their own batting averages tended to attract the most media attention. Their more self-effacing colleagues, the ones we should be heeding, often don’t get on to our radar screens. Wrong Again! From the ECONOMIST

That project proved to be a pilot for a far more ambitious tournament currently sponsored by the Intelligence Advanced Research Projects Activity (IARPA), part of the American intelligence world. Over 5,000 forecasters have made more than 1m forecasts on more than 250 questions, from euro-zone exits to the Syrian civil war. Results are pouring in and they are revealing. We can discover who has better batting averages, not take it on faith; discover which methods of training promote accuracy, not just track the latest gurus and fads; and discover methods of distilling the wisdom of the crowd. 

The big surprise has been the support for the unabashedly elitist “super-forecaster” hypothesis. The top 2% of forecasters in Year 1 showed that there is more than luck at play. If it were just luck, the “supers” would regress to the mean: yesterday’s champs would be today’s chumps. But they actually got better. When we randomly assigned “supers” into elite teams, they blew the lid off IARPA’s performance goals. They beat the unweighted average (wisdom-of-overall-crowd) by 65%; beat the best algorithms of four competitor institutions by 35-60%; and beat two prediction markets by 20-35%. 

To avoid slipping back to business as usual—believing we know things that we don’t—more tournaments in more fields are needed, and more forecasters. So we invite you, our readers, to join the 2014-15 round of the IARPA tournament. Current questions include: Will America and the EU reach a trade deal? Will Turkey get a new constitution? Will talks on North Korea’s nuclear programme resume? To volunteer, go to the tournament’s website at www.goodjudgmentproject.com. We predict with 80% confidence that at least 70% of you will enjoy it—and we are 90% confident that at least 50% of you will beat our dart-throwing chimps.

One of my readers is a super-forecaster in Tetlock's tournaments, and he writes:

Hi Steve,  

I saw your question at WaPo pages from two days ago and, since no one answered and I am not about make an account with them, I thought I'd answer by email. As a regular reader of your blog, it's my pleasure. I am one of the 120 "superforecasters", doing this this for the third year (they upgraded our numbers from 60 as of last year).

Are you allowed to pick and choose which questions to answer? 

Yes, you are, but with some limitations:  

1. You must answer 1/3 (for regular participants) or 1/2 (for "supers") of all questions in during the season. This year, it's about 150 between August 1 through May 1. If you don't, they drop you and don't pay a laughable (considering the time spent) honorarium of $150 ($250 for those "retained" from past year).  

2. You will be scored for all questions no matter what. As the very first solid finding of the project was that teams of 12-15 give better forecasts than individual forecasters (duh!), almost all participants today work in teams. For all questions that you did not forecast (and for all days that you didn't forecast any particular question), your score will be median of the your team's. Team's score is median of individual forecasters.  

 As Tetlock's team keeps saying, doing well in this weird competition involves more than sheer luck. (I suppose that's their biggest finding to date and they are doing all kinds of silly psychometric tests on us to see what they can correlate it to). Two examples:  

- In the first year, I finished #4 in my "experimental condition" that had about 240 participants. All forecasts were individual in this condition. Top 5 from each group became "supers", others were allowed to keep going as usual. Majority, I imagine, dropped out because it truly takes a lot of time. At least three others in my group who were in Top 10 but didn't make it to "supers" did well enough in Season 2 to achieve the "super" status in this year. Even if they "competed" with a pool of >3,000.  

- Last year, a particular group of "supers" beat everyone in four other groups by a largish margin. I don't know how they were able to be so good! Today, this same team still has the best score even if "supers" competition is now among eight groups. 

And yes, the "supers" consistently beat everyone else but I think it has a lot to with self-selection for folks willing to google on regular basis information pertaining to completely weird stuff like this: 

Will China seize control of the Second Thomas Shoal before 1 January 2014 if the Philippines structurally reinforces the BRP Sierra Madre beforehand? (The answer is supposed to come as probability and can be updated daily if desired.) 


The BRP Sierra Madre is the rusting hulk of a ship that the Filipino navy ran onto a reef in the Spratley Islands in 1999 and has maintained a half-starved platoon in it ever since in an attempt to establish a legal precedent to underwater oil and gas rights in the South China Sea. Very Waterworldy, except with a far smaller budget. (By the way, I had never heard of this ship until two hours ago.)

My prediction is: This Probably Won't Happen by January 1, 2014, considering it's already December 26th (assuming it hasn't already happened and I didn't notice).

Now that I think about it, I wouldn't be surprised if a fair amount of competence in this tournament derives from having a sense of just how long it takes for stuff to happen. Since the game looks at typically annual time frames so that it can determine winners and losers in a reasonable amount of time, I bet a lot of losers have a tendency to say, "Yeah, that will probably happen" without estimating how long it could take for it to happen.

For example, say there is a question that asks if the coalition government in Britain will come undone. In the long run, the answer is sure Yes. But, will it happen within the next year?

Another trap is that players in the real world are also making the same calculations as you are. For example, say you figure there is a high probability the Chinese will immediately seize control of the Second Thomas Shoal in retaliation if the Filipino government fixes up its rusting hulk. After all, the Filipino's can't afford to stop the Chinese.

But what if the Filipino foreign minister is of the same mind as you about the unstoppability of Chinese retaliation? Perhaps he or she reasons: at present, we can't stop the Chinese from retaliating if we fix up our Waterworld set, so let's not fix it up in 2013. Maybe in 2014 or later we will be able to put together a coalition of powers to deter the Chinese from seizing the Second Thomas Shoal, but we can't do it yet, so let's not cause a confrontation now that we are sure to lose.

Thus, on this question which is conditional upon the Filipinos setting the ball rolling, the only way to win if you think the Chinese would retaliate is for the Filipino government to have worse judgment than you have. You only get credit for getting this question right if the Chinese agree with you and the Filipinos disagree.

 As you can imagine, it requires more or less the same mentality as the one demonstrated by those tireless Wikipedia editors. As such, I am sure that you won't be shocked to learn that there are very few women among the "supers" (and among all participants, I suspect). Last year we had one out of 12 and she dropped out. This year we have one out of 15 and she is quite good but not overly enthusiastic, devoting minimum amount of time to the project. Sexism leaves its indelible mark again :-) 


It's interesting that Tetlock and Co. randomly assigns the best forecasters into all-star teams. I wonder if voluntary teams of stars would be even better. Theoretically, you'd want your team to be made up of different specialists, like a comic book universe superhero squad such as the Avengers or the Justice League, so negotiating the makeup of your own team ought to be best. But perhaps personalities would get in the way?

With so many questions, the role of inside information is likely minimized. I mean, do you know anybody who is a big wheel in Spratley Islands circles? Probably not. And if you do, you probably don't know too many people in the Turkish constitution-writing business.

Yet, much of the traditional role of diplomats was to collect inside information at social gatherings by charming and lulling other diplomats into spilling the beans about their governments' intentions regarding Constantinople.

Of course, with inside information in the financial sphere, there has long been a metaphysical debate over the prime mover exception. In the early 1990, the feds prosecuted Michael Milken for making profitable forecasts about stock prices based on inside information about upcoming takeover bids his stooges were launching. Milken's defenders argued that logically there had to be an exception for the ultimate insider in a takeover bid, and that Milken was obviously the main man, not the nominal corporate raider whom he was financing. An interesting point, but the feds put him in prison for a couple of years, anyway. (That seems like a long time ago.)

I recently read the latest Lawrence of Arabia biography, Lawrence in Arabia, and it seems like T.E. Lawrence for a few years had the knack of prediction when it came to the Middle East: e.g., don't land at Gallipoli, land at Alexandretta (not that we can tell what would have happened in the counterfactual). Of course, to help make some of his predictions come true—e.g., Prince Feisal looks like a winner—he would get on his camel and go blow something up in Prince Feisal's name, which participants in the Good Judgment Project are probably discouraged from doing.

Another question would be how big of a g-factor is there in world affairs forecasting. Do people who specialize in southeast Asia outpredict global generalists on the Second Thomas Shoal question? Or to be a global generalist, do you just have to be better overall than the regional specialists?

Consider major league baseball players by way of analogy. Often they have fairly specialized roles in the majors such as closer or utility infielder. Yet in high school, they typically played shortstop or centerfield (when they weren't pitching), and they almost all batted third or cleanup in the lineup. In other words, they were just the best all-around ballplayers on their high school teams. There is a substantial baseball g-factor.

On the other hand, the minor leagues are full of good all-around baseball players who lack the special skill—a 95 mph fastball rather than a 90 mph one, or 20-13 vision rather than 20-18, or being a plodding 240 pound lefty first baseman rather than a 240 pound righty first baseman—that would make them useful in the majors.

It's almost tautological that there will be both a g-factor and specific subfactors (such as language knowledge) in international affairs forecasting. The question will be what is the balance and what are the most important subfactors.

Print Friendly and PDF