Teddy's Rat Lab: COMMENT: Statistics

Thursday, November 1, 2012

COMMENT: Statistics

http://teddysratlab.blogspot.com [Full link to blog for email clients.]

UPDATE 1: Thanks to INSTAPUNDIT and Insta-ssistant Sarah A. Hoyt!

Update 1a: WOW! >6000 page views. Thanks, Insty!

UPDATE 2: I've corrected the "N-factorial" error - it's really summed N. Thanks to the folks who pointed it out.

Update 2a: We're still thrashing about that N factor on number of tests. I get 1/2 * N * (N+1) which is the summation I have listed. Several other readers have convinced me it should be N-squared - if so, though, it should be N-squared minus one, since the final combination (leaving out all 2012 coefficients, and replacing with 2008 coefficients) is essentially the 2008 model we are comparing against.

Many, many, MANY thanks to the readers! - s2la

[NEWS: Sorry, I've been sick, and it really affects my ability to sit and write, but I think that's out of the way. In better news, the Carpal Tunnel surgery on left wrist went well, I now am eager to get the right wrist done after the first of the year. I promise to build up a back-log of posts to bridge any down-time. Best news, I have figured out that The Lab Rats' Guide to the Brain needs an additional chapter - one that fits in with my recent research and conferences. Starting next week - a multi-blog series on "Interfacing the Brain." We'll start with biofeedback and jump to modern EEG-driven brain-to-machine interfaces, talk about neural prosthetics, the "inverse problem," and discuss the difficulty in deciphering and mapping brain activity to specific "thoughts" and actions. To wrap up the section, I'll delve into Dr. Travis S. Taylor's quantum theories of the brain (and Mind).]

SO, first post back in two weeks, and I didn't really mean to get political - but I've been asked about statistics, polls and models. So I'm going to talk about statistical modeling and assessment thereof. I promise to go easy on the politics.

Dr. Nate Silver is a "Sabremetrics" modeler whose rise to fame revolves around statistical analysis of sports (http://www.baseballprospectus.com/news/?author=59). He is now in the news for his model of the 2008 and now 2012 U.S. Presidential election. There are many supporters and detractors, and I'm not really going to discuss merits except in terms of statistical methodology. Currently Silver's model simulation yields a win for Obama 79% of the time, and for Romney 21% of the time.

In a recent discussion elsewhere, Matt P. mentioned that Silver says that if O wins, then of course his model was right by majority result - if R wins, then of course, the model predicted a long-shot, but still possible win. Further, it would be very hard to discern that difference.

Matt then asked:

"Is there, in fact, some way to tease out the difference after the results are in? is there some analysis he can perform to try to determine whether his model was valid and we saw the long shot come in, or that his model had Romney at too low a probability all along? If so, can you give me the idiot's guide explanation of that methodology?"

So, at the risk of a looooooooong explanation, I'll start with methodology, and then move on to the question about how to tease out differences. This is long, you are forewarned! BTW, CAPS are not shouting, I'm using them for certain key statistical concepts.

There's only one statistically valid way to test a predictive model, but there are variations for later application, so we'll start with the critical assessment and call it METHOD 1:

METHOD 1: Start with an "empirically" designed model (i.e. designed from equations for observed trends in natural data) in which you have a bunch of equations that supposedly can be fitted to datasets of the desired type. For continuous data that create a series or waveform like electromagnetic spectra, you can use Fourier transforms, wavelets, power spectra, principal components analysis, etc. For discontinuous data of discrete measurements such as demographics, epidemiology, medical trends, you use factor analysis, canonical correlation (if you presume correlation of any inputs), logistic regression, nonlinear analyses (Laguerre linear approximations of Volterra-Wiener kernels to give the technical term). These are the technical terms for the most common "models." Each is a set of equations that can describe any set of data. What makes the model fit the data is the weightings and coefficients.

Given a model, you "teach" it with one-half of your data set. Note that all data in the set is collected at the same time, same subjects and under the same conditions - but you divide in half by any method of your choice. This also means, in the event of polling data, that you are using data from an already completed event - with both the pre-election polls and election results. Teaching a model means that you know the input and the output, and you use some version of iterative approximation to find coefficients to the equations that force the model to match the known result.

Once your model is "taught" and you have the coefficients and/or weightings that make the generic equations "fit" your data, you , then apply the remaining half of your starting data set to the equations and derived coefficients. Do not adjust the coefficients, and do not reveal or apply the known results to the model. Compare the predicted results with the actual results and assess with one of the accepted statistical tests (for those taking notes, a Komolgorov-Smirnov, KS, test is a good one to use here). If you get >80% correct or a correlation of >0.7 [*] on this, the "INTERNAL" test, then you can proceed.

METHOD 2 can then only be performed if you successfully pass an INTERNAL test.

METHOD 2: Once a model is successful, you can now test it with an EXTERNAL dataset. One in which you again know both the inputs and the results, but was collected at a time and from an event separate from the one used to build your model. For example - build your model using Ohio's last election, then test it with Pennsylvania's data, or build with 2008 numbers and try to predict 2004 result. [To jump out of the election context - this is also the challenge for Climate Change models in terms of producing a model which can not only predict current & future trends, but also account for the Medieval Warm Period and mid-twentieth century warm (1930's)-to-cool (1970's) cycle.]

Once this EXTERNAL test is successful, then the model is ready for PREDICTIVE testing. Note, however, if the coefficients of the model had to be tweaked in METHOD 2 to get a successful result, the dataset can NO LONGER BE CONSIDERED AN EXTERNAL DATASET. The entire model testing reverts to the METHOD 1 INTERNAL test condition, and EXTERNAL testing must be repeated with the new dataset.

ONLY when a model passes METHOD 2 without change to any correlates can we move on to METHOD 3 - PREDICTION. Note, that once a dataset is used for EXTERNAL testing and fails - it becomes an INTERNAL dataset and a new one must be used for each and every EXTERNAL test.

METHOD 3: This is the PREDICTION stage and is the current status of Nate Silver's model and all polling models for the 2012 election. All data known to date is input and we will not know the actual result until after November 6th. In many experimental settings, statistical power (or validity) assessments are set at 90% power - in other words, the model should not result in an error in outcome in 9 out of 10 "BLIND" tests (where the model is run before the results are known, so that there is no possibility of biasing the result). This is not "model simulations" as Silver currently reports, but actual Presidential elections.

By the way, successfully applying and passing these tests is what is sometimes termed "statistical rigor" - at least as I am using it here. With only one Presidential election every 4 years, as of this date, Silver's model can't truly be considered "accurate with sufficient statistical power and rigor." One successful outcome in one attempt is not statistically significant - in fact it is not even insignificant. A single point (successful prediction from a BLIND test) has no dimensions, no width, breadth or variance. He needs many more BLIND tests, and if we set the criterion at "9 out of 10 Presidential elections" that means the model won't reach the criterion until at least 2044!

---

Now, as to Matt P.'s question about how to tease out the difference and determine whether the model in fact predicted the "sucker bet" or the "long shot" - there is a variation on METHOD 1 testing called "LEAVE-ONE-OUT" which we will call METHOD 4.

METHOD 4: In this test, we need (A) Silver's original 2012 predictive model (2012P) as of his most recent model run that yielded 4 out of 5 simulations with an O win; and (B) a revised model from after the election in which he re-fits the coefficients to the actual 2012 election results (2012A). Ideally, the model will then yield the actual result in 100% of simulation runs (but is more likely to be around 98%), rather than the current 79% O, 21% R. To achieve this will naturally require changing some coefficients and weightings.

Now starts a long and tedious analysis where we takes the new model (2012A) and with respect to coefficients, we "leave one out" and substitute instead the 2012P coefficient. We rerun the simulation and compares accuracy. Note, that we can only one coefficient at a time - when we move on to the next coefficient, we reset the previous one to the 2012A value. In this manner, each coefficient is checked to see if it was the one that accounted for the most "variance" - in this case, the greatest increase in accuracy. Once all coefficients are tested independently, we can start by changing two coefficients at a time, then three, then four, etc. [Note - corrected - s2la] If there are "N" coefficients or factors in the model, we will need to perform "Sigma-N" (The sum of 1-through-N = [N]+[N-1]+[N-2]+[N-3]+...+3+2+1) tests to complete the procedure. At this point we rank the factors, and might want to run an "eigenvector decomposition" (factor analysis) to decide which ones are "significant" to producing a successful or failed match to the actual 2012 election result.

[For reference, when it comes to modeling the rodent or primate brain with "only" 50,000 coefficients, it takes about 6 hours using a computer cluster equivalent to 256 Pentium 4 PCs. I expect Silver's model has millions if not hundreds of millions of data points and coefficients, and figure it will take many months to complete the METHOD 4 analysis.]

Conclusion: when all is said and done, we - and Silver - can know what aspects of his model contributed to the success or failure of his simulations/predictions.

---

This has been a long methodology explanation, but it brings me back to my statistical concerns with relying too heavily on predictions based on such a model:

(1) to paraphrase Tom K. "success in 2008 may simply mean Silvere was lucky, not that his model is valid." [Yes, I know that Silver is quite successful with sports predictions, but baseball models can be BLIND tested hundreds of times per year - Presidential elections occur with a frequency of 0.25 per year - we're talking three orders of magnitude difference in frequency.]

(2) with only one election - i.e. only one BLIND test (2008 result), there is no way to confirm sufficient statistical power nor sufficient statistical rigor.

(3) the only way to assess validity as factors change is with appropriately applied METHOD 2. Each iteration that cannot be completed without tweaking the coefficients requires a new, "CLEAN" DATASET.

(4) if every possible factor and existing data has been used for tweaking a model, there will be no CLEAN DATASETS left for passing METHOD 2 - and remember that only a success at METHOD 2 before moving to METHOD 3 conveys statistical rigor.

---

Nate Silver's model may be correct, but given that right now it is generating one opposing result out of every five simulations, it is indeed very difficult to assess its value until it can be thoroughly evaluated from the perspective of historical accuracy.

---

[*](Correlation "r" > 0.7 yields an r-square of at least 0.49, thus indicating that roughly 50% or more of the of the variability in observed results is accounted for by your model.)

[A casual note - the next time you hear an animal-rights activist ranting about how stupid animal researchers are, how we "waste" animal lives and do useless work full of contradiction and falsification - consider this - the "statistical power and rigor" analysis I performed here is a standard requirement of an Institutional Animal Care and Use protocol. We do this all the time, and hold our results to these standards, just to get our own departments and institutions to allow us to touch a mouse or rat (let alone a cat, dog or monkey!).]

15 comments:

AnonymousNovember 2, 2012 at 10:15 AM
Put simply, if Silver's model predicts Obama to win with a 75% likelihood, but Obama loses anyway, that could mean either that Silver's model is wrong, or that its correct with Obama having experienced an expected 1 in 4 loss.

Without dozens or hundreds of other otherwise similar elections to validate Silver's model against, there is no way to know if his model is correct or not. In practice, since every election is unique, Silver's model probably CANNOT be validated this way in the real world.

That doesn't make the model wrong, but with such a small sample size to compare the model against, this leaves open the very real possibility that any earlier electoral prediction success using it was due to dumb luck, rather than sophisticated modelling skill. This would be true, even if Obama won. . .were his "true" odds really 75%?

Now, all that said, even if we can't apply rigorous mathematical testing to Silver's model for lack of numerous national elections to compare against, there are still real world 'common sense' measures that we could apply.

For example, hypothetically, if Romney were to completely blow out Obama, winning a decisive victory with several points of the popular vote, that would, at the very least, suggest that a model predicting a 75% likelihood of an Obama victory was significantly flawed.

ReplyDelete
Replies
The HouseNovember 2, 2012 at 10:37 AM
THe fact is, Silvers "probability" is nothing more than what a Las Vegas odds maker does. His claim has no statistical value whatsoever.
ReplyDelete
Replies
AnonymousNovember 2, 2012 at 10:39 AM
We do have another test of the Silver predictive equations. In 2010, he used his model to predict the 2010 house results. His conclusion prior to the election was that there was a 70%+ chance that the GOP would win less than 60 house seats (they won 63 that year). The house prediction was using different equations, but they are related. So if we say he had a hit with 2008 and a miss with 2010, then you at least have some "breadth and depth" of results to measure and at 50-50 he isn't exactly hitting on all cylinders.
ReplyDelete
Replies
JasonNovember 2, 2012 at 11:25 AM
You said, "we will need to perform "N!" (N-factorial = [N]+[N-1]+[N-2]+[N-3]+...+3+2+1) tests to complete the procedure," but that is not what N! means. The factorial is the product of all positive integers <= N, not the sum. So 6! is 720, not 20.
ReplyDelete
Replies
AnonymousNovember 2, 2012 at 1:06 PM
Looks like you'll need a correction to the correction. The number of subsets of the N parameters is 2^N (two to the Nth power) so the number of re-tests to see what subsets of parameters account for the differences would be 2^N-2 (minus two because neither the empty set nor the complete set are meaningful here).

The original factorial was a better guess than the sum because at least the factorial function is exponential. But it is wrong because it counts *ordered* subsets and obviously the parameters must have a fixed order.
ReplyDelete
Replies
sane_voterNovember 2, 2012 at 7:04 PM
I do not have a problem with the premise behind the Silver model per se, it's just that he is a captive to data (polls) of unknown province. He has no idea how good or bad the polls are or if the assumptions of partisan turnout are correct. He must take them at face value and run them through the model. I would bet Silver a large sum of money Romney exceeds the electoral vote count he is predicting (234).

Also, note that Sean Davis (@seanmdav)very closely replicates Silver's results using a simple monte carlo simulation in excel.
ReplyDelete
Replies
BorfNovember 3, 2012 at 8:23 AM
The "probability" says more about Silver's model than anything else. The probability of Obama winning is either 0 or 1.

What he's saying is that, 75% of the time, his model can predict the outcome of an election, given the set of parameters he's using, and that, in this case, it is predicting a win for Obama.

What he may not be quantifying is the sources of uncertainty: His parameters must have to be measured. How certain are those measurements? How stable is his model in the prediction region (do small changes in parameters result in large changes to the output)?

Personally, I'd be willing to place a bet with Mr. Silver using his model to price it.
ReplyDelete
Replies
JZNovember 3, 2012 at 11:47 AM
Not to split hairs, but since the election hasn't happened yet, without question the "probability" of Obama winning is somewhere in the range of values between 0 and 1, not just one of the two extremes.

Anyhow -- isn't Silver performing a state-by-state analysis and then (more or less) summing up those results to determine a winner? I realize there's quite a bit of independence issues that crop up (ex. if Romney takes Pennsylvania then his probability of winning Ohio must be increased too) -- and perhaps he's using one gigantic complex model instead of 20-50 smaller ones -- but if it's the state-by-state case then couldn't his methods be examined at much finer detail? I'd always assumed he'd just set up models for 50 states or 20 regions or whatever and then just ran the simulation millions of times.

Please forgive me if my impression is wrong; it's safe to say I don't visit the NYT all that often. :)
ReplyDelete
Replies