UPDATE 1: Thanks to INSTAPUNDIT and Insta-ssistant Sarah A. Hoyt!
Update 1a: WOW! >6000 page views. Thanks, Insty!
UPDATE 2: I've corrected the "N-factorial" error - it's really summed N. Thanks to the folks who pointed it out.
Update 2a: We're still thrashing about that N factor on number of tests. I get 1/2 * N * (N+1) which is the summation I have listed. Several other readers have convinced me it should be N-squared - if so, though, it should be N-squared minus one, since the final combination (leaving out all 2012 coefficients, and replacing with 2008 coefficients) is essentially the 2008 model we are comparing against.
Many, many, MANY thanks to the readers! - s2la
[NEWS: Sorry, I've been sick, and it really affects my ability to sit and write, but I think that's out of the way. In better news, the Carpal Tunnel surgery on left wrist went well, I now am eager to get the right wrist done after the first of the year. I promise to build up a back-log of posts to bridge any down-time. Best news, I have figured out that The Lab Rats' Guide to the Brain needs an additional chapter - one that fits in with my recent research and conferences. Starting next week - a multi-blog series on "Interfacing the Brain." We'll start with biofeedback and jump to modern EEG-driven brain-to-machine interfaces, talk about neural prosthetics, the "inverse problem," and discuss the difficulty in deciphering and mapping brain activity to specific "thoughts" and actions. To wrap up the section, I'll delve into Dr. Travis S. Taylor's quantum theories of the brain (and Mind).]
SO, first post back in two weeks, and I didn't really mean to get political - but I've been asked about statistics, polls and models. So I'm going to talk about statistical modeling and assessment thereof. I promise to go easy on the politics.
Dr. Nate Silver is a "Sabremetrics" modeler whose rise to fame revolves around statistical analysis of sports (http://www.baseballprospectus.com/news/?author=59). He is now in the news for his model of the 2008 and now 2012 U.S. Presidential election. There are many supporters and detractors, and I'm not really going to discuss merits except in terms of statistical methodology. Currently Silver's model simulation yields a win for Obama 79% of the time, and for Romney 21% of the time.
In a recent discussion elsewhere, Matt P. mentioned that Silver says that if O wins, then of course his model was right by majority result - if R wins, then of course, the model predicted a long-shot, but still possible win. Further, it would be very hard to discern that difference.
Matt then asked:
"Is there, in fact, some way to tease out the difference after the results are in? is there some analysis he can perform to try to determine whether his model was valid and we saw the long shot come in, or that his model had Romney at too low a probability all along? If so, can you give me the idiot's guide explanation of that methodology?"So, at the risk of a looooooooong explanation, I'll start with methodology, and then move on to the question about how to tease out differences. This is long, you are forewarned! BTW, CAPS are not shouting, I'm using them for certain key statistical concepts.
There's only one statistically valid way to test a predictive model, but there are variations for later application, so we'll start with the critical assessment and call it METHOD 1:
METHOD 1: Start with an "empirically" designed model (i.e. designed from equations for observed trends in natural data) in which you have a bunch of equations that supposedly can be fitted to datasets of the desired type. For continuous data that create a series or waveform like electromagnetic spectra, you can use Fourier transforms, wavelets, power spectra, principal components analysis, etc. For discontinuous data of discrete measurements such as demographics, epidemiology, medical trends, you use factor analysis, canonical correlation (if you presume correlation of any inputs), logistic regression, nonlinear analyses (Laguerre linear approximations of Volterra-Wiener kernels to give the technical term). These are the technical terms for the most common "models." Each is a set of equations that can describe any set of data. What makes the model fit the data is the weightings and coefficients.
Given a model, you "teach" it with one-half of your data set. Note that all data in the set is collected at the same time, same subjects and under the same conditions - but you divide in half by any method of your choice. This also means, in the event of polling data, that you are using data from an already completed event - with both the pre-election polls and election results. Teaching a model means that you know the input and the output, and you use some version of iterative approximation to find coefficients to the equations that force the model to match the known result.
Once your model is "taught" and you have the coefficients and/or weightings that make the generic equations "fit" your data, you , then apply the remaining half of your starting data set to the equations and derived coefficients. Do not adjust the coefficients, and do not reveal or apply the known results to the model. Compare the predicted results with the actual results and assess with one of the accepted statistical tests (for those taking notes, a Komolgorov-Smirnov, KS, test is a good one to use here). If you get >80% correct or a correlation of >0.7 [*] on this, the "INTERNAL" test, then you can proceed.
METHOD 2 can then only be performed if you successfully pass an INTERNAL test.
METHOD 2: Once a model is successful, you can now test it with an EXTERNAL dataset. One in which you again know both the inputs and the results, but was collected at a time and from an event separate from the one used to build your model. For example - build your model using Ohio's last election, then test it with Pennsylvania's data, or build with 2008 numbers and try to predict 2004 result. [To jump out of the election context - this is also the challenge for Climate Change models in terms of producing a model which can not only predict current & future trends, but also account for the Medieval Warm Period and mid-twentieth century warm (1930's)-to-cool (1970's) cycle.]
Once this EXTERNAL test is successful, then the model is ready for PREDICTIVE testing. Note, however, if the coefficients of the model had to be tweaked in METHOD 2 to get a successful result, the dataset can NO LONGER BE CONSIDERED AN EXTERNAL DATASET. The entire model testing reverts to the METHOD 1 INTERNAL test condition, and EXTERNAL testing must be repeated with the new dataset.
ONLY when a model passes METHOD 2 without change to any correlates can we move on to METHOD 3 - PREDICTION. Note, that once a dataset is used for EXTERNAL testing and fails - it becomes an INTERNAL dataset and a new one must be used for each and every EXTERNAL test.
METHOD 3: This is the PREDICTION stage and is the current status of Nate Silver's model and all polling models for the 2012 election. All data known to date is input and we will not know the actual result until after November 6th. In many experimental settings, statistical power (or validity) assessments are set at 90% power - in other words, the model should not result in an error in outcome in 9 out of 10 "BLIND" tests (where the model is run before the results are known, so that there is no possibility of biasing the result). This is not "model simulations" as Silver currently reports, but actual Presidential elections.
By the way, successfully applying and passing these tests is what is sometimes termed "statistical rigor" - at least as I am using it here. With only one Presidential election every 4 years, as of this date, Silver's model can't truly be considered "accurate with sufficient statistical power and rigor." One successful outcome in one attempt is not statistically significant - in fact it is not even insignificant. A single point (successful prediction from a BLIND test) has no dimensions, no width, breadth or variance. He needs many more BLIND tests, and if we set the criterion at "9 out of 10 Presidential elections" that means the model won't reach the criterion until at least 2044!
Now, as to Matt P.'s question about how to tease out the difference and determine whether the model in fact predicted the "sucker bet" or the "long shot" - there is a variation on METHOD 1 testing called "LEAVE-ONE-OUT" which we will call METHOD 4.
METHOD 4: In this test, we need (A) Silver's original 2012 predictive model (2012P) as of his most recent model run that yielded 4 out of 5 simulations with an O win; and (B) a revised model from after the election in which he re-fits the coefficients to the actual 2012 election results (2012A). Ideally, the model will then yield the actual result in 100% of simulation runs (but is more likely to be around 98%), rather than the current 79% O, 21% R. To achieve this will naturally require changing some coefficients and weightings.
Now starts a long and tedious analysis where we takes the new model (2012A) and with respect to coefficients, we "leave one out" and substitute instead the 2012P coefficient. We rerun the simulation and compares accuracy. Note, that we can only one coefficient at a time - when we move on to the next coefficient, we reset the previous one to the 2012A value. In this manner, each coefficient is checked to see if it was the one that accounted for the most "variance" - in this case, the greatest increase in accuracy. Once all coefficients are tested independently, we can start by changing two coefficients at a time, then three, then four, etc. [Note - corrected - s2la] If there are "N" coefficients or factors in the model, we will need to perform "Sigma-N" (The sum of 1-through-N = [N]+[N-1]+[N-2]+[N-3]+...+3+2+1) tests to complete the procedure. At this point we rank the factors, and might want to run an "eigenvector decomposition" (factor analysis) to decide which ones are "significant" to producing a successful or failed match to the actual 2012 election result.
[For reference, when it comes to modeling the rodent or primate brain with "only" 50,000 coefficients, it takes about 6 hours using a computer cluster equivalent to 256 Pentium 4 PCs. I expect Silver's model has millions if not hundreds of millions of data points and coefficients, and figure it will take many months to complete the METHOD 4 analysis.]
Conclusion: when all is said and done, we - and Silver - can know what aspects of his model contributed to the success or failure of his simulations/predictions.
This has been a long methodology explanation, but it brings me back to my statistical concerns with relying too heavily on predictions based on such a model:
(1) to paraphrase Tom K. "success in 2008 may simply mean Silvere was lucky, not that his model is valid." [Yes, I know that Silver is quite successful with sports predictions, but baseball models can be BLIND tested hundreds of times per year - Presidential elections occur with a frequency of 0.25 per year - we're talking three orders of magnitude difference in frequency.]
(2) with only one election - i.e. only one BLIND test (2008 result), there is no way to confirm sufficient statistical power nor sufficient statistical rigor.
(3) the only way to assess validity as factors change is with appropriately applied METHOD 2. Each iteration that cannot be completed without tweaking the coefficients requires a new, "CLEAN" DATASET.
(4) if every possible factor and existing data has been used for tweaking a model, there will be no CLEAN DATASETS left for passing METHOD 2 - and remember that only a success at METHOD 2 before moving to METHOD 3 conveys statistical rigor.---
Nate Silver's model may be correct, but given that right now it is generating one opposing result out of every five simulations, it is indeed very difficult to assess its value until it can be thoroughly evaluated from the perspective of historical accuracy.
[*](Correlation "r" > 0.7 yields an r-square of at least 0.49, thus indicating that roughly 50% or more of the of the variability in observed results is accounted for by your model.)
[A casual note - the next time you hear an animal-rights activist ranting about how stupid animal researchers are, how we "waste" animal lives and do useless work full of contradiction and falsification - consider this - the "statistical power and rigor" analysis I performed here is a standard requirement of an Institutional Animal Care and Use protocol. We do this all the time, and hold our results to these standards, just to get our own departments and institutions to allow us to touch a mouse or rat (let alone a cat, dog or monkey!).]