A reader passed along a citation to a very interesting article, Peter C. Austin, at al., Testing Multiple Statistical Hypotheses Resulted in Spurious Associations: A Study of Astrological Signs and Health, 59 J. Clinical Epidemiology 964 (2006). The basic point of the article is that, as an accompanying editorial (59 J. Clinical Epidemiology 871) notes "spurious P-values arise at a suprisingly high frequency if a researcher has sufficient creativity and a large database." The lesson, I think, is the importance of adhering to the Bradford Hill criteria before drawing any conclusions from an epidemiological study; also, before scientific journals publish such studies, they should ask the authors how many potential associations they investigated. Given the standard significance threshold of 95%, a researcher who looks for one hundred different associations may find five of them by chance.

I can't say I'm any sort of expert in statistics, but from what I've seen at academic workshops, economists (including law and economics scholars) tend to be even more sloppy than epidemiologists about (a) exhibiting naivete about the importance of finding a "statistically significant" result; (b) concluding that they have discovered causation when they've only discovered, at best, correlation; and (c) coming up with unconvincing post-hoc rationalizations as to why they manipulated their data set in a particular way that just happened to help them achieve a "positive" result. That's not to say that there isn't some fine empirical work out there in the law and economics literature, just that appropriate cautions and safeguards don't seem to me to be as built-in to professional culture as they should be. One safeguard I'd like to see: to prevent data dredging and other forms of manipulation, researchers should publish on the Internet their research protocol, what factors they are going to consider, and why, BEFORE they start looking for results. For (random) example, if one is going to do a paper on the effect of mandatory sentencing on crime rates, one should announce in advance which crimes one is going to consider and why, and not instead be able to first elmininate roberries, than eliminate rapes, than add back robberies and eliminate burglaries, etc., until one comes up with an "interesting" result, after which one can post-hoc rationalizes why one chose the crimes one did.

On the second "BEFORE" point, though, I have mixed opinions. This post facto revision is often called data mining and some adamantly oppose it. But I think the discovery of unexpected associations can make a contribution. Especially because the one result should not be considered powerful until confirmed. I would find it objectionable if the author declined the report the negative results for other crimes, though.

On the second "BEFORE" point, though, I have mixed opinions. This post facto revision is often called data mining and some adamantly oppose it. But I think the discovery of unexpected associations can make a contribution. Especially because the one result should not be considered powerful until confirmed. I would find it objectionable if the author declined the report the negative results for other crimes, though.The problem here is confusion between developing a hypothesis and testing it. An unexpected association is interesting and suggestive, but it is wrong to apply a statistical test to it, precisely because when there are many associations there will inevitably be some "unexpected" ones.

Take Bernstein's example. Mandatory sentencing is introduced and bank robberies, say, drop. You test this and find it significant. But that's nonsense. There are lots of crimes and normal variation will produce some drops (and increases). The statistical test is meaningless.

Now it would be reasonable to hypothesize that mandatory sentencing has a strong effect on bank robberies. You could then test this in a

differentjurisdiction.The point is that you cannot, in general, develop

andtest a hypothesis on the same data. That's data mining.If 10,000 studies are done in a year at 95% confidence, we know 500 things as true that are in fact false.

The local version only adds one, which is a drop in the bucket of our bogus knowledge production.

publicpredictions based on his theory of general relativity. As I recall, he stated that if those predictions were not borne out, his theory would be wrong. That is courageous and admirable.As to data-mining, one of the problems in research--which is beginning to be addressed--is that negative results, although useful, are not published. Publishing negative results would relieve some of the downside from data-mining.

We don't "know" things as true because one study finds a statistically significant association. If the odds of one study finding a statistically significant association are 5%, the odds of two perfectly independent studies finding that association are .0025, and the odds of three finding the association are .000125. It is only with thihs sort of replication and checking that we should claim to "know" anything.

People are far to quick to jump on a single finding and promote it as truth, but that's not a problem with the research itself.

MUSTmake his dataset available to other academics so they can (if possible) replicate his results would go a long way toward eliminating "junk science."For example, a prolific an anti-gun "researcher", will NOT share his datasets with other academics.

The reader has to take all of his basic numbers "on trust."All you can examine are the conclusions he chooses to disclose in his articles. The article may have "feet of clay" but no one can expose him.As you mention, it is very easy to run afoul of this because multicollinearity is a matter of degree and not a yes-or-no dichotomy. OLS point estimates are still valid under extreme multicollinearity but the significance tests will be too generous leading to an increased possibility of rejecting a true null hypothesis.

Good researchers will report when this is a problem.

-nj

I am honestly surprised by this statement - the entire focus of empirical labor economics, public economics, health economics, and most of the other branches of empirical microeconomics these days seems to be on identification, finding natural experiments, finding suitable exogenous instruments, and so on. I spend considerable time each week in their workshops and seminars and absolutely nobody would get away with claiming causation based on a correlation. I suspect you are slandering fields you know little about.

The Law and Economics literature, at least the parts of it that are being published in law reviews, are an unfortunate exception to my description and mostly serve as examples of how empirical research should not be done. The quality differences are truly striking, and I don't know exactly what the reasons for it are. I suspect it is created by a lack of training for some of the researchers, and the lack of a competent refereeing process at the law reviews.

It is genuinely frustrating to referee a truly badly done empirical paper for an econ journal, and to then see it appear in a prestigious law review a year later. That has happened to me twice in the last four years, and I am afraid that it may be characteristic of the quality of non-legal articles published in law reviews. I should add that the paper selection process at law reviews may work perfectly well for papers on legal topics, I am certainly unqualified to judge that.

Multicollinearity (which is what the problem you describe is called) isn't usually as serious a problem as you make it sound, even though it can be. When one runs a regression, one usually has one or two variables on the right hand side one is really interested in, and a set of other so-called control variables. There can be fairly high correlations among the control variables, as long as they are not closely correlated with the variables you are interested in your analysis should still be alright.

Thomas MC, Lyons BD, Walker RJ John Thomas sign: common distraction or useful pointer? Med J Aust. 1998 Dec 7-21;169(11-12):670.)

On a more serious note, this is a much bigger problem, to me, when people quote risk ratios (aka relative risk) for cancer, sudden death, etc. It has been noted in a number of places that risk ratios of less than 2 are really not meaningful (the risk ratio of smoking with lung cancer, in contrast, is around 11-20 depending on the study population). We have a *lot* of policy and regulation based on very low relative risks, many of which are simply artifactual.

Sorry but very often I read parenthetical remarks in papers along the lines of "we should repeat this work using more variables in our regression".

That much work is nonetheless useful and valuable comes largely from researchers lacking the material on hand to be able to just dump more variables into their analysis.

i.e., it is the barriers to collecting data, not a deep apparent understanding of the techniques that makes many publications relatively sound.

Even remarking on many of my friends and coworkers at Caltech--hardly a low institution--many very talented researchers in their field are nonetheless shockingly naive about some of the statistical techniques they use.

Nice greetings from the better engineering school in the North-East. :-)

I am not at all disagreeing with what you are saying. Dumping a lot of variables onto the right-hand side of a regression without either theoretical or common sense justification for their inclusion is certainly not the way to go.

A researcher who tests many hypothesis and only reports the winners without describing the correction is reporting fraudulent results, even if by ignorance.

The real problem is one of attitude. A critical approach to any theory is absolutely necessary no matter what kind of research you are doing and how you do it. A good scientist doesn't come up with a hypothesis and "try to prove it" - a good scientist comes up with a hypothesis and

tries to prove it false. Because if it's really worth its salt, it with withstand this effort.Unfortunately, there are too many researchers that use statistics "like a drunkard uses a lamppost - for support, rather than illumination."

"I am not at all disagreeing with what you are saying. Dumping a lot of variables onto the right-hand side of a regression without either theoretical or common sense justification for their inclusion is certainly not the way to go."

.... but it *is* the way to get published :-)

Not when I am the referee :-).

I'm sorry to say that the notion that a relative risk (or odds ratio) less than two is not really meaningful is simply nonesense. The magnitude of these relative measures is entirely dependent on the scale of the exposures and the definition of the reference group (the denominator). Without the context of the information about what the relative risk referrs to, no magnitude of relative risk is meaningful.

a prioritheory but with confirmatory factor analysis programs such as LISREL, EQS, and AMOS, it is possible to data mine, say, half of a data set for exploratory purposes, create an hypothesis, and then test it with the unused half of the data set. Modern statistical techniques make increasing use of iterative computer techniques--for more specific examples, check out the Journal of Multi-variate statistics.That's not too bad. What's bad is if the researchers were doing a little data mining. If those studies in total calculated 100,000 different correlations, and only reported the 10,000 that came out as statistically significant, then

halfof them are false."The point is that you cannot, in general, develop and test a hypothesis on the same data. That's data mining."Sure you can. The problem you're worried about can be solved by randomly selecting a subset of observations and setting them aside before doing any analysis of the data. Then, once you've developed and tested your hypothesis, you test it out on the data you left out at the beginning.

The more serious problem is that all you're testing is correlation. If you want to discover

causes, you have to do a lot more work, e.g. conducting randomized, controlled experiments. Observational data just won't get you the result you want, no matter what kind of statistical testing you do.The second big problem is that most people don't really understand what a significance test is. It gets used as if it means "scientifically significant," when all it tells you is that the correlation you're observing is not likely due to random chance.

For that matter, it only indirectly tells you about the size of the correlation you're looking at. Get a big enough N, and even very small correlations can appear significant beyond .000001 (or whatever).

"In my experience, the most common error is to presume that a regression against more variables is per-se better than a regression against a few--this is because the variables of the regression ought to be independent. It is very easy to run afoul of this rule (can someone recall the name of this?): e.g., gender and income are not independent variables."You're confusing two different things. The problem of having correlated variables is multicollinearity, and as someone else pointed out, the variables have to be quite highly correlated before it's a problem.

A more common problem is that by putting two many variables (and hence parameters) into your model, you can increase the variance in the parameter estimates, even while it appears that you are getting a better fit (e.g. a higher R-squared).

On the other hand, if you leave out some important variables, you may be introducing bias into your parameter estimates.

This is commonly known as the "bias-variance tradeoff".

There are techniques to optimize the tradeoff, such as cross-validation. They tend to rely on the same sort of technique I mentioned above: Leaving out some of your data, fitting the model, then coming back to test the model on your data you left out.

"with respect to "data mining," there are some newer statistical techniques that might make data mining a bit more rigorous than it used to be--of course, one's hypothesis should be based in a priori theory but with confirmatory factor analysis programs such as LISREL, EQS, and AMOS, it is possible to data mine, say, half of a data set for exploratory purposes, create an hypothesis, and then test it with the unused half of the data set."Right, but you don't need to go to LISREL, you can do this with any kind of model-fitting exercise, including ordinary least-squares regression.

For my dissertation, I invented a genetic algorithm to test out literally billions of models on a given dataset to select the most predictive model (using internal cross-validation techniques). Then I validated it on an entirely different dataset. This works extremely well. It wasn't a regression model either, it was a nonparametric classification model.

But you can do that with any model where you're trying to predict something.

The problem you're worried about can be solved by randomly selecting a subset of observations and setting them aside before doing any analysis of the data. Then, once you've developed and tested your hypothesis, you test it out on the data you left out at the beginning.Sure. But then you're testing the hypothesis on different data (the subset) than you used to develop it. My point is that you can't look at a bunch of data on, say, the height of children, notice that the 12-year-olds in Cleveland are much taller than expected, and claim some sort of statistical significance.

Some group of n-year-olds in some city is going to be much taller than average. So what?

"But then you're testing the hypothesis on different data (the subset) than you used to develop it.""Different data" -- different in that the exact observations may be different, but the same data in the sense that taking a simple random sample for your test set gives you a representative sample. Thus the expected value of any given parameter estimate (which is what you really care about) will be the same.

"One way to get around these issues is to use an explicitly Bayesian model and codify the prior infommation directly."As a favorite stats professor of mine once said, "It sounds like a great idea - unfortunately, I've never actually met a genuine Bayesian."

I suspect you are slandering fields you know little about.As a professional outrage merchant, Bernstein must do this on a regular basis.

If 10,000 studies are done in a year at 95% confidence, we know 500 things as true that are in fact false.I would think a bunch of lawyers would understand that "not proven to be false" is not the same as "proven to be true".

Employing a 95% confidence level will necessarily imply that 1 out of 20 results will show causality where none exists. That is a necessary requirement of using any probability-based statistical method.

A smart researcher implicitly realizes that sometimes you'll observe false statistical significance. That's why people do follow on work, because further examination of the spurious false-positive will uncover its spuriousness.

"Different data" -- different in that the exact observations may be different, but the same data in the sense that taking a simple random sample for your test set gives you a representative sample. Thus the expected value of any given parameter estimate (which is what you really care about) will be the same.Mahan,

I'm not sure I follow what you are saying here. Could you clarify a bit? Thanks.

Bayesian analysis probably has a better theoretical footing, but has a practical problem. Who has a prior distribution for a previously unknown model?

"Employing a 95% confidence level will necessarily imply that 1 out of 20 results will show causality where none exists."Confidence levels don't tell you anything about causality in any case (at least not where you're using observational data, and that's what social science usually has).

Even then, the 1 out of 20 number (for a false positive indication of a non-random correlation -- not causation) isn't quite right.

The 1-out-of-20 number would be true if people were randomly searching for correlations, but generally, people aren't doing studies where they randomly throw variables into a regression formula. More often than not, they have some reason to think that some of the variables they're studying have some connection to the dependent variable.

"I'm not sure I follow what you are saying here. Could you clarify a bit?"What you're usually interested in measuring is the value of some parameter. You have some estimator of the parameter (e.g. the ordinary least squares estimate of a regression coefficient).

If you take out a random subset of your data with a simple random sample, then a parameter estimate based on that subset of data will, on average, be equal to the parameter estimate calculated on the remaining data.

"In general, the use of regression is accompanied by a lot of handwaving at the statistical assumptions."Yes.

"In addition to the ones mentioned above, there are the questions of whether the data has normal (Gaussian) error, whether a linear model is appropriate, etc."Careful -- What assumptions you need depend on what questions you're trying to answer. For example, you don't need to assume normality to get an unbiased estimated of a regression coefficient. Nor do you need it to get a standard error.

"Student's "Exact Statistical Inference" only applies if all the hypothesis are valid, which is approximately never."No, not "never".

Inference is fine if you're working off a random sample, and trying to make inferences about the population from which your sample was drawn. It's also fine if you're working with data taken from a randomized, controlled experiment.

But I agree, people ignore these assumptions way, way too often. I've made that point many times before on this site: People calculate p-values when they have data that isn't drawn with a random sample, and where there is no other sources of randomness anywhere in sight.

Try to tell people this, and they'll think you're crazy.

Even then, the 1 out of 20 number (for a false positive indication of a non-random correlation -- not causation) isn't quite right.Would you believe me if I say that I meant to say "correlation" and not "causation" in my sentence? :-)

Because, of course, I know that regression only gives us correlation - for causation you need theory.

BTW, I agree with what you're saying - there is a boatload of assumptions underlying my overly generalized statement. Besides, I'm generally speaking from my experience in the context of calculating autocorrelation functions, where sometimes you will find a statistically significant large-lag correlation which is clearly a spurious correlation that is deemed significant only because of the use of a 95% confidence level.

Only a mindless automaton would believe this correlation to be meaninful or indicative of anything because it passes some statistical test - almost any sentient being is able to use his judgement (and other tools) to assess meaninfulness of the correlation.

We all understand "lies, damned lies and statistics". I believe that the number of people using statistics to mislead (or lie outright) is smaller than the number Bernstein believes exists, but I'm not a sinister-minded polemicist who sees disingenuity (or an anti-semitic gun control advocate) under every rock.