# Weak Statistical Standards Implicated In Scientific Irreproducibility

#### Soulskill posted about a year ago | from the nobody-who-needs-to-understand-statistics-understands-statistics dept.

182
ananyo writes *"The plague of non-reproducibility in science may be mostly due to scientists' use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University. Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF). He advocates for scientists to use more stringent P values of 0.005 or less to support their findings, and thinks that the use of the 0.05 standard might account for most of the problem of non-reproducibility in science — even more than other issues, such as biases and scientific misconduct."*

## 2 + 2 = 5 !! (0)

## Anonymous Coward | about a year ago | (#45407071)

I heard it more than once !!

## Or you know.. (1)

## tanujt (1909206) | about a year ago | (#45407139)

## Re:Or you know.. (5, Informative)

## Anonymous Coward | about a year ago | (#45407181)

This would have the same problems, maybe even worse. The problem with statistics is usually that the model is wrong, and Bayesian stats offers two chances to fuck that up: in the prior, and in the generative model (=likelihood). Bayesian statistics still requires models (yes, you can do non-parametric Bayes, but you can do non-parametric frequentist stats also).

Contrary to the hype and buzzwords, Bayesian statistics is not some magical solution. It is incredibly useful when done right, of course.

## Re:Or you know.. (5, Insightful)

## hde226868 (906048) | about a year ago | (#45407273)

## Re:Or you know.. (5, Interesting)

## Anonymous Coward | about a year ago | (#45407439)

Yes, I agree. If a p-value of 0.05 actually "means" 0.20 when evaluated, then any sane frequentist will tell you that things are fucked, since the limiting probability does not match the nominal probability (this is the definition of frequentism).

The power of Bayesian stats is largely in being able to easily represent hierarchical models, which are very powerful for modeling dependence in the data through latent variables. But it's not the Bayesianism per se that fixes things, it's the breadth of models it allows. A mediocre modeler using Bayesian statistics will still create mediocre models, and if they use a bad prior, then things will be worse than they would be for a frequentist.

Consider that if Bayesian statisticians are doing a better job than frequentists at the moment, it may be because Bayesian stats hasn't yet been drilled into the minds of the mediocre, as frequentist stats has been for decades. People doing Bayesian stats tend to be better modelers to begin with.

## Re:Or you know.. (4, Insightful)

## Daniel Dvorkin (106857) | about a year ago | (#45408853)

The problem with frequentist statistics as used in the article is that its "recipe" character often results in people using statistics that do not understand its limitations (a good example is assuming a normal distribution when there is none). The bayesian approach does not suffer from this problem, also because it forces you to think a little bit more about the problem you are trying to solve compared to the frequentist approach.

If only. The number of people who think "sprinkle a little Bayes on it" is the solution to everything is frighteningly large, and growing exponentially AFAICT. There's now a Bayesian recipe counterpart to just about every non-Bayesian recipe, and the only difference between them, as a practical matter, is that the people using the former think they're doing something special and better. One might say that their prior is on the order of P(correct|Bayes) = 1, which makes it very hard to convince them otherwise ...

## you sound like you know what you're talking about (1)

## raymorris (2726007) | about a year ago | (#45409493)

It sounds like you have a clue about statistics. Do you know of a good forum to ask a fairly involved statistics question? I have a set of measured variables A-E which all tend to indicate the likelihood of X. The relationships are a bit complex and unknown, though, so I need help with how I should analyze the historical data in order to come up with parameters to use in the future for making "predictions" of X based on known values of A-E.

## That book about the bell curve (-1)

## Anonymous Coward | about a year ago | (#45407141)

That, and the fact that all of statistics is a joke. It's all based on the assumption that data is distributed in a bell curve. Sure, a bell curve does fit a lot of data, but we blindly assume it fits everything which just can't be true.

## Re:That book about the bell curve (3, Informative)

## Derec01 (1668942) | about a year ago | (#45407201)

That is because of the central limit theorem, (http://en.wikipedia.org/wiki/Central_limit_theorem), which indicated that for a large number of independent samples, it doesn't matter what the original distribution was, and we certainly can reliably use the normal distribution. It is NOT unfounded.

## Re:That book about the bell curve (2, Insightful)

## Will.Woodhull (1038600) | about a year ago | (#45407613)

Unless of course we happen to be working in a chaotic system where strange attractors mean there can be no centrality to the data.

Chaos theory is a lot younger than the central limit theorem. The situation might be similar to the way Einstein's theory of relativity has moved Newton's three laws from a position of central importance in all physics to something that works well enough in a small subset. A subset that is extremely important in our daily life, but still a subset.

Some portions of a chaotic system will be consistent with what the central limit theorem would predict. Other data sets from the same system, uh, no.

An important question I do not believe has been answered yet (I am an armchair follower of this stuff, neither expert nor student) is whether all the systems we work with where the CLT does seem to hold are merely subsets of larger systems. A related question would be whether there is any test that can be applied to a discrete data set that rule out its being a subset of a larger chaotic process.

## Re:That book about the bell curve (0)

## Anonymous Coward | about a year ago | (#45407667)

You hardly need chaos theory to come up with examples where a statistical estimator is not normally-distributed.

Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics. Even one example would be great.

## Re:That book about the bell curve (1)

## Will.Woodhull (1038600) | about a year ago | (#45407875)

Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics.

When I turn that around, it seems to say that statistics is only of value in systems that have fully matured. Which sounds like most of the time statistics have no value.

Is that correct? Or is there some other way to reverse the quotation?

## Re:That book about the bell curve (1)

## Anonymous Coward | about a year ago | (#45408001)

Well, there's really nothing to turn around... you're spouting a lot of pseudo-science here, and still nothing that you've said has even suggested why statistics wouldn't work on "immature" (whatever that means) systems. The central limit theorem can apply to dynamic systems, and even if the CLT didn't hold, that doesn't mean that statistics is impossible. There are many estimators which do not obey the CLT.

Just google "statistics of chaotic systems" or whatever. You'll find plenty of work on the subject. Admittedly, they are using "statistics" the way physicists do, but it's still the same idea: a mathematical characterization.

Basically, whenever there is a probabilistic model for something, statistics happens when you are ignorant of (certain aspects of) the model, and try to infer what you don't know from the data. Again, google "dynamic statistical models"; you'll find a lot.

## Re:That book about the bell curve (1)

## Will.Woodhull (1038600) | about a year ago | (#45408525)

you're spouting a lot of pseudo-science here

I agree that there IS a lot pseudo-science here, and that I have fallen into a nasty trap.

What can I say? This is not the first time an AC troll has gotten me good, and it probably will not be the last.

Now get thee back under that dark, damp, cobwebby bridge where thou belongest! Or I shall sprinkle thee with Troll-B-Gone powder and there will be nothing left around here but some grins and giggles.

## Re:That book about the bell curve (0)

## Anonymous Coward | about a year ago | (#45409555)

Chaos also occurs in the dynamic evolution of a system, so it's hard to see the connection you're implying with statistics.

When I turn that around, it seems to say that statistics is only of value in systems that have fully matured.

Statistics in today's world is more a financial tool. That system of manipulation

hasfully matured. I promise you they knowexactlywhat they are doing with that tool.## Yes and no (4, Interesting)

## golodh (893453) | about a year ago | (#45407805)

So it gives you a very valid excuse to assume that the value distribution of some quantity occurring in nature will follow a Normal distribution when you know nothing else about it.

But there's the crux: it remains an assumption; a hypothesis, and fortunately it's usually a *testable* hypothesis. It's the responsibility of a researcher to check if it holds, and to see how problematic it is when it doesn't.

If something has a normal distribution, its square or its square root (or another power) doesn't have a Normal distribution. Take for example the diameter, surface area, and volume of berries. The diameter (goes with the radius, r), the surface area (goes with r^2), and the volume of berries (goes with r^3). They cannot all be Normally distributed at the same time, so assuming any of them is starts you out on shaky foundation.

## Re:Yes and no (1)

## colinrichardday (768814) | about a year ago | (#45408225)

Maybe none of them is normally distributed, but if we take distributions of sample means of 50 berries, then those distributions might all be close to the normal distribution.

## Re: No (1)

## stymy (1223496) | about a year ago | (#45408999)

## Re:Yes and no (1)

## umafuckit (2980809) | about a year ago | (#45409237)

As you say, there is the Central Limit Theorem (a whole bunch of them actually) that says that the Normal distribution is the asymptotic limit that describes unbelievably many averaging processes.

So it gives you a very valid excuse to assume that the value distribution of some quantity occurring in nature will follow a Normal distribution when you know nothing else about it.

If your sample distribution is non-normal and you're using tests that assume normality then you're fucked regardless of the central limit theorem. Anyway, the central limit theorem tells you that the means of repeated samples will be normally distributed, but this isn't usually what you're applying your test to. You're usually applying your test to a single sample and that may well be very non-normal (which is the point of the central limit theorem).

## Re:Yes and no (0)

## Anonymous Coward | about a year ago | (#45409907)

Actually, for very large n, the

estimatorsof each one are approximately normal. You can show it quite easily with a first-order Taylor expansion justified by the law of large numbers, which says that the probability mass of the estimator is concentrated enough that the linear expansion is valid. A linear function of a normal random variable is still normal.However this only works when the variance is asymptotically zero.

## Re:That book about the bell curve (2)

## Daniel Dvorkin (106857) | about a year ago | (#45408945)

The CLT is one of the most elegant and powerful results in all of mathematics, and can be used, quite appropriately, to justify normal models for all sorts of measurements. That being said, its usefulness has led to the dumbed-down idea of "the bell curve" being

theappropriate model for all sorts of things where it's clearly not--I don't know how many times I've seen a normal curve superimposed on a histogram or kernel density estimation of data that are clearly non-normal. As another poster pointed out, there are simple and well-understood tests for normality, and failure to apply them when constructing a normal model is just ridiculous.## Re:That book about the bell curve (0)

## Anonymous Coward | about a year ago | (#45407229)

That, and the fact that all of statistics is a joke. It's all based on the assumption that data is distributed in a bell curve. Sure, a bell curve does fit a lot of data, but we blindly assume it fits everything which just can't be true.

We do not assume everything fits a bell curve.

STA-101: When using a normal curve, there needs to be a good reason for it.

In many cases, that good reason is the Central Limit Theorem [wikipedia.org] .

## Re:That book about the bell curve (2, Informative)

## Anonymous Coward | about a year ago | (#45407293)

Statistics does not, by any means, make that assumption. If it did, the entire field of statistics would have been completed by 1810.

Mediocre (actually, sub-mediocre) practitioners of statistics make that assumption.

It is true that many

estimatorstend to a normal distribution as the sample size gets large, but this is not the same as assuming that thedata itselfcomes from the normal distribution.## Re:That book about the bell curve (2)

## Entropius (188861) | about a year ago | (#45407359)

No, statisticians certainly do not assume that. If everything in my field were normally distributed then my life would be a lot easier, but it's not, and we're aware that it's not.

## Re:That book about the bell curve (0)

## Anonymous Coward | about a year ago | (#45408281)

## Doubt it (1)

## Anonymous Coward | about a year ago | (#45407183)

Doubt it makes a difference, the root of this problems us systematic errors.

## Re:Doubt it (0)

## Anonymous Coward | about a year ago | (#45407265)

That's not a very scientific response... Doubt.

The real question is:

What P value did he use to come to this conclusion?

## Five Sigma or Bust (2)

## upmufa (702569) | about a year ago | (#45407233)

## Re:Five Sigma or Bust (3, Interesting)

## mysidia (191772) | about a year ago | (#45407289)

Five sigma is the standard of proof in Physics. The probability of a background fluctuation is a p-value of something like 0.0000006.Of proof yes... that makes sense.

Other fields should probably use a threshold of 0.005 or 0.001.

If they use move to five sigma....... 2013 might be the last year that scientists get to keep their jobs.

What are you supposed to do; if no research in any field is admissable, because the bar is so high noone can meet it, even with meaningful research?

## Re:Five Sigma or Bust (3, Insightful)

## Will.Woodhull (1038600) | about a year ago | (#45407703)

Agreed. P = 0.05 was good enough in my high school days, when handheld calculators were the best available tool in most situations, and you had to carry a couple of spare nine volt batteries for the thing if you expected to keep it running through an afternoon lab period.

We have computers, sensors, and methods for handling large data sets that were impossible to do anything with back in the day before those first woodburning "minicomputers" of the 1970s. It is ridiculous that we have not tightened up our criteria for acceptance since those days.

Hell, when I think about it, using P = 0.05 goes back to my Dad's time, when he was using a slide rule while designing engine parts for the SR-71 Blackbird. That was back in the 1950s and '60s. We

shouldhave come a long way since then. But have we?## Re:Five Sigma or Bust (1)

## Kjella (173770) | about a year ago | (#45408117)

Hell, when I think about it, using P = 0.05 goes back to my Dad's time, when he was using a slide rule while designing engine parts for the SR-71 Blackbird. That was back in the 1950s and '60s. We should have come a long way since then. But have we?

In engineering? Yes [rt.com] . Science? Well...

## Re:Five Sigma or Bust (2, Insightful)

## Anonymous Coward | about a year ago | (#45408871)

Agreed. P = 0.05 was good enough in my high school days, when handheld calculators were the best available tool in most situations

Um, the issue is not that it is difficult to calculate P-values less than 0.05. Obtaining a low p-value requires either a better signal to noise ratio in the effect you're attempting to observe, or more data. Improving the signal to noise ratio is done by improving experimental design, removing sources of measurement error like rater reliability, measurement noise, covariates, etc. It should be done to the extent feasible, but you can't wave a magic wand and say "computers" to fix it. Likewise, data collection is also expensive, and if you have to have an order of magnitude more subjects, it will substantially raise the cost of doing research.

There does exist a tradeoff between research output and research quality. It may be (I think so at least) that we ought to push the bar a bit toward quality over quantity, but there is a cost. In the extreme, we might miss out on many discoveries because we could only afford the time and cost of going after a handful of sure things.

## Re:Five Sigma or Bust (3, Insightful)

## umafuckit (2980809) | about a year ago | (#45409273)

We have computers, sensors, and methods for handling large data sets that were impossible to do anything with back in the day before those first woodburning "minicomputers" of the 1970s. It is ridiculous that we have not tightened up our criteria for acceptance since those days.

But that stuff isn't the limiting factor. The limiting factor is usually getting enough high quality data. In certain fields that's very hard because measurements are hard or expensive to make and the signal to noise is poor. So you do the best you can. This is why criteria aren't tighter now than before: because stuff at the cutting edge is often hard to do.

## Re:Five Sigma or Bust (0)

## Anonymous Coward | about a year ago | (#45408035)

Five sigma probably isn't possible in the medical field for example. What sample size would you need to use to get that?

## Re:Five Sigma or Bust (1)

## LoRdTAW (99712) | about a year ago | (#45408153)

"

What are you supposed to do; if no research in any field is admissable, because the bar is so high noone can meet it, even with meaningful research?"James Cameron could reach the bar.

## Re:Five Sigma or Bust (1)

## mysidia (191772) | about a year ago | (#45408319)

James Cameron could reach the bar.Hm... James Cameron is a deep-sea explorer, and film director.... he directed Titanic.

In what way, does that make him a researcher who could be sure of meeting five sigma in all his research; even when infeasible truly massive datasets would be required?

## Re:Five Sigma or Bust (0)

## mdsolar (1045926) | about a year ago | (#45407971)

## Scarcely productive (4, Interesting)

## fey000 (1374173) | about a year ago | (#45407255)

Such an admonishment is fine for the computational fields, where a few more permutations can net you a p-value of 0.0005 (assuming that you aren't crunching on a 4-month cluster problem). However, biological laborations are often very expensive and take a lot of time. Furthermore, additional tests are not always possible, since it can be damn hard to reproduce specific mutations or knockout sequences without altering the surrounding interactive factors.

So, should we go for a better p-value for the experiment and scrap any complicated endeavour, or should we allow for difficult experiments and take it with a grain of salt?

## Re:Scarcely productive (0)

## Anonymous Coward | about a year ago | (#45407355)

If the author's assertion is true and that P value of 0.05 or less means that 17–25% of such findings are probably false, then what is the point of publishing the findings? Or at least come at the writting from a more sober perspective. Of course, any such change would need to come with an academia culture change from the 'publish or perish' mindset.

## Re:Scarcely productive (4, Insightful)

## hawguy (1600213) | about a year ago | (#45407453)

If the author's assertion is true and that P value of 0.05 or less means that 17–25% of such findings are probably false, then what is the point of publishing the findings? Or at least come at the writting from a more sober perspective. Of course, any such change would need to come with an academia culture change from the 'publish or perish' mindset.

Because I'd rather use a drug found to be 75-83% effective at treating my disease than die while waiting for someone to come up with one that's 99.9% effective.

## Re:Scarcely productive (3, Informative)

## Anonymous Coward | about a year ago | (#45407515)

This is a fallacious understanding of p-value.

Something closer to (but still not quite) correct would be: that there is a 75-83% chance that the

claimedefficacy of the drug is within the stated error bars. For example, there may be a 75-83% chance that the drug is between 15% and 45% effective at treating your disease.That's much worse, isn't it?

## Re:Scarcely productive (0)

## Anonymous Coward | about a year ago | (#45407921)

And more importantly, a 17-25% chance that it's completely ineffective, no better than a placebo.

## Re:Scarcely productive (3, Interesting)

## hawguy (1600213) | about a year ago | (#45407973)

And more importantly, a 17-25% chance that it's completely ineffective, no better than a placebo.

My sister went through 4 different drugs before she found one that made her condition better. One made her (much) worse.

Yet she likely wouldn't be alive today if none of those 4 drugs worked.

## Re:Scarcely productive (0)

## Anonymous Coward | about a year ago | (#45407539)

## Re:Scarcely productive (1)

## evilviper (135110) | about a year ago | (#45408133)

The problem becomes when you're treating a non-life threatening ailment with a drug that turns out to:

1) Not help at all, ever.

2) Has other, life-threatening side-effects.

## Re:Scarcely productive (0)

## Anonymous Coward | about a year ago | (#45407551)

It's not an assertion, it's basic math--0.05 is 20%. If the chance of any result being found by chance is 20%, it stands to reason that about 20% of all results were found by chance and, therefore, not to be expected to be reproducible.

The author isn't the first to point this out. There was a really good paper published about 5 years ago which delved much deeper into our approaches to basic science, especially in biology and the social sciences.

For example, we assume that if a hypothesis holds up that we've learned something substantive about the structure of the particular system being studied. But some systems might be riddled with a huge number of coincidental and misleading relationships, so that even if a study is reproducible it may only add noise to the field. This is why so many results are a dead end. So even by switching to a tiny P-value, you still haven't necessarily improved the productivity of the field. In fact, you may drive researchers to chase even narrower hypotheses which are more likely to be valid but nonetheless worthless.

## Re:Scarcely productive (1)

## petermgreen (876956) | about a year ago | (#45407947)

If the chance of any result being found by chance is 20%, it stands to reason that about 20% of all results were found by chance and, therefore, not to be expected to be reproducible.

Statistical significant levels only tell us about the chance of a study producing a false positive, they say nothing about the chance of a study producing a true positive.

So if the chance of a true positive is low then the false positives could easilly outnumber the true positives.

## Re:Scarcely productive (1)

## colinrichardday (768814) | about a year ago | (#45408251)

It's not an assertion, it's basic math--0.05 is 20%.

0.05 is 2%, not 20%

## oops, my bad (1)

## colinrichardday (768814) | about a year ago | (#45408289)

Ahhh!! it's 1/20, not two percent. Of course, it's 5%.

## Re:Scarcely productive (1)

## theqmann (716953) | about a year ago | (#45408295)

## Re:Scarcely productive (0)

## Anonymous Coward | about a year ago | (#45407823)

## Re:Scarcely productive (1)

## Anonymous Coward | about a year ago | (#45409157)

After three decades working in a National Laboratory, and after having been involved in several fundamental discoveries, I just have to ask:

What the hell is a "laboration"? Is it a new made-up word to go along with what frequently appears now to be just made-up science?

## Re:Scarcely productive (0)

## Anonymous Coward | about a year ago | (#45409471)

What you said is a fine representation of the problem: most scientists that have no mathematical background (and even many that should have) don't understand what they're doing in classical hypothesis testing.

## Economic Impact (3, Insightful)

## Anonymous Coward | about a year ago | (#45407269)

Truth is expensive.

## Re:Economic Impact (1)

## Anonymous Coward | about a year ago | (#45408505)

Truth is expensive.

Not as expensive as ignorance.

## Not first post (-1)

## Anonymous Coward | about a year ago | (#45407271)

This article is statically a waste of time and certainly Irreproducibility implicated with some other concept involving 6 or 7 syllable words that most acertainly to attract freshmen geeks with bowl haircuts

## Not going to happen (4, Insightful)

## Anonymous Coward | about a year ago | (#45407281)

If we were to insist on statistically meaningful results 90% of our contemporary journals would cease to exist for lack of submissions.

## Re:Not going to happen (3, Insightful)

## Anubis IV (1279820) | about a year ago | (#45407623)

...and nothing of value would be lost. Seriously, have you read the papers coming from that 90% of journals and conference proceedings outside of the big ones in $field_of_study? The vast majority of them suck, have extraordinarily low standards, and are oftentimes barely readable. There's a reason why the major conferences/journals that researchers actually pay attention to routinely turn away between 80-95% of papers being submitted: it's because the vast majority of research papers are unreadable crap with marginal research value being put out to bolster someone's published paper count so that they can graduate/get a grant/attain tenure.

If the lesser 90% of journals/conferences disappeared, I'd be happy, since it'd mean wading through less cruft to find the diamonds. I still remember doing weekly seminars with my research group in grad school, where we'd get together and have one person each week present a contemporary paper. Every time one of us tried to branch out and use a paper from a lesser-known conference (this was in CS, where the conferences tend to be more important than the journals), we ended up regretting it, since they were either full of obvious holes, incomplete (I once read a

publishedpaper that had empty Data and Results sections...just, nothing at all, yet it was published anyway), or relied on lots of hand-waving to accomplish their claimed results. You want research that's worth reading, you stick to the well-regarded conferences/journals in your field, otherwise the vast majority of your time will be wasted.## Re:Not going to happen (0)

## Anonymous Coward | about a year ago | (#45407753)

The p < 0.05 is the standard for

statisticallymeaningful results, notscientificallymeaningful results. You know, the standard 'correlation is not causation' sort of thing.One of the many difficulties we have in the sciences is the difficulty in publishing studies showing a

lackof statistical effect, as is mentioned in the Nature blurb. I don't often see the results of power tests, used to avoid Type I errors, but those would be welcome additions to results sections. The main Nature article does not mention the use of power tests for some reason, and if I were a statistician I might have an inkling why that is.## Re:Not going to happen (0)

## Anonymous Coward | about a year ago | (#45407959)

Awesome. Then, maybe instead of judging scientists based on the volume of papers they have published, we could judge them based on the quality of their research.

## Interpretation of the 0.05 threshold (5, Insightful)

## Michael Woodhams (112247) | about a year ago | (#45407325)

Personally, I've considered results with p values between 0.01 and 0.05 as merely 'suggestive': "It may be worth looking into this more closely to find out if this effect is real." Between 0.01 and 0.001 I'd take the result as tentatively true - I'll accept it until someone refutes it.

If you take p=0.04 as demonstrating a result is true, you're being foolish and statistically naive. However, unless you're a compulsive citation follower (which I'm not) you are somewhat at the mercy of other authors. If Alice says "In Bob (1998) it was shown that ..." I'll tend to accept it without realizing that Bob (1998) was a p=0.04 result.

Obligatory XKCD [xkcd.com]

## Re:Interpretation of the 0.05 threshold (1)

## Black Parrot (19622) | about a year ago | (#45408245)

Obligatory XKCD [xkcd.com]

FWIW, tests like the Tukey HSD ("Honestly Statistically Different") are designed to avoid that problem.

I suspect that's how the much-discussed "Jupiter Effect" for astrology came about: Throw in a big pile of names and birth signs, turn the crank, and watch a bogus correlation pop out.

## Re:Interpretation of the 0.05 threshold (1)

## theqmann (716953) | about a year ago | (#45408329)

## Re:Interpretation of the 0.05 threshold (1)

## Anonymous Coward | about a year ago | (#45408989)

doesn't a p 0.05 mean that 95% of your data samples (2 sigma) support the hypothesis? wouldn't 1 sigma be more of a "suggestive" level? 95% seems pretty good

The best way I've found to understand p-values is to consider the situation where you have an experiment. You're attempting to observe an effect, but in reality there is no effect to observe and all you're seeing are random fluctuations. If your criteria for declaring you've observed your effect is a p-value of 0.05, it means you will be convinced you've seen something there that isn't really there one time in twenty. Can you imagine if you had a one in twenty chance of believing a traffic light was green when it was in fact red? I think "suggestive" is an appropriate label for that level of confidence.

## Obligatory XKCD (2, Funny)

## Anonymous Coward | about a year ago | (#45407327)

http://xkcd.com/882/

## A universal standard for significance... (3, Insightful)

## Anonymous Coward | about a year ago | (#45407369)

Authors need to read this: http://www.deirdremccloskey.com/articles/stats/preface_ziliak.php

It explains quite clearly why a p value 0.05 is a fairly arbitrary choice as it cannot possibly the standard for every possible study out there. Or, put it another way, be very skeptical when one sole number (namely 0.05) is supposed to be a universal threshold to decide on the significance of all possible findings, in all possible domains of science. The context of any finding still matters for its significance.

## Student's T-test (1)

## The Real Dr John (716876) | about a year ago | (#45407407)

## Re:Student's T-test (1)

## Will.Woodhull (1038600) | about a year ago | (#45407761)

The bad news is that it is getting harder and harder to sort the science reported in journals from the papers whose purpose is to generate or preserve revenue streams for the researchers (or the corporations for which they are agents).

## Re:Student's T-test (1)

## The Real Dr John (716876) | about a year ago | (#45408455)

## Impossible! (-1, Troll)

## ralphbecket (225429) | about a year ago | (#45407487)

Is the author mad? p < 0.05 would completely invalidate climate models! That simply can't be true, ergo p >= 0.05 is absolutely necessary in (post-normal) science.

## Re: Impossible! (1)

## KeensMustard (655606) | about a year ago | (#45408219)

## Re: Impossible! (1)

## ralphbecket (225429) | about a year ago | (#45409311)

Climate models are currently, at best, when treated as an ensemble (if you buy that as legitimate), skirting along the p 0.05 level of significance in the validation period.

Pointing this out is considered trolling -- it probably offends some religious sensibilities.

Tightening the threshold as the article suggests would mean the model results are not "significant" (i.e., not reasonably distinguishable from natural variation -- note that I am not a "denier" and that I do accept that CO2 is a greenhouse gas etc. etc.; I am however hugely skeptical of most climate and environmental science that I have investigated).

## The Economist just had an article on this (2)

## Beeftopia (1846720) | about a year ago | (#45407491)

Unreliable research

Trouble at the lab

Scientists like to think of science as self-correcting. To an alarming degree, it is not

Oct 19th 2013 |From the print edition

The Economist

First, the statistics, which if perhaps off-putting are quite crucial. Scientists divide errors into two classes. A type I error is the mistake of thinking something is true when it is not (also known as a “false positive”). A type II error is thinking something is not true when in fact it is (a “false negative”). When testing a specific hypothesis, scientists run statistical checks to work out how likely it would be for data which seem to support the idea to have come about simply by chance. If the likelihood of such a false-positive conclusion is less than 5%, they deem the evidence that the hypothesis is true “statistically significant”. They are thus accepting that one result in 20 will be falsely positive—but one in 20 seems a satisfactorily low rate.

In 2005 John Ioannidis, an epidemiologist from Stanford University, caused a stir with a paper showing why, as a matter of statistical logic, the idea that only one such paper in 20 gives a false-positive result was hugely optimistic. Instead, he argued, “most published research findings are probably false.” As he told the quadrennial International Congress on Peer Review and Biomedical Publication, held this September in Chicago, the problem has not gone away.

Dr Ioannidis draws his stark conclusion on the basis that the customary approach to statistical significance ignores three things: the “statistical power” of the study (a measure of its ability to avoid type II errors, false negatives in which a real signal is missed in the noise); the unlikeliness of the hypothesis being tested; and the pervasive bias favouring the publication of claims to have found something new.

http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble [economist.com]

## d'oh (1)

## Iniamyen (2440798) | about a year ago | (#45407603)

## What is a p Value? (1)

## mrsquid0 (1335303) | about a year ago | (#45407617)

A significant problem is that many of the people who quote p values do it without understanding what a p value actually means. Getting p = 0.05 does not mean that there is only a 5% chance that the model is wrong. That is one of the fundamental misunderstandings in statistics, and I suspect that it is behind a lot of the cases of scientific irreproducibility.

## Too little too late. (-1)

## Anonymous Coward | about a year ago | (#45407621)

TOo many scientists in fields like geophysics and biology and other fields have been implicated in too many coverups and hoaxes with regards to liberal frauds like evolution and global warming. At this point the public has become almost completely opposed to almost all science and instead are relying on engineering to move society forward. Ultimately the only way to avoid this kind of problem is to appoint some layperson judges who can use simple common sense to pick out those scientific theories that should be funded and those that should not, rather than allowing corrupt liberal scientists to pick and choose their own pet agendas at taxpayer expense.

## I defer to Feynman (2)

## xski (113281) | about a year ago | (#45407745)

## Re:I defer to Feynman (-1)

## Anonymous Coward | about a year ago | (#45407943)

I like Feynman as much as the next geek but he's frighteningly wrong here. Wrong in the way that very many people on the hard science spectrum tend to be. The assumption that not working with numbers all of the time invalidates what you're doing as a pursuit.

Feynman saying social science isn't a science because it has no laws. Preposterous.

## The real issue (5, Interesting)

## Okian Warrior (537106) | about a year ago | (#45407845)

Okay, here's the real problem with scientific studies.

All science is data compression, and all studies are are intended to compress data so that we can make future predictions. If you want to predict the trajectory of a cannonball, you don't need an almanac cross referencing cannonball weights, powder loads, and cannon angles - you can calculate the arc to any desired accuracy with a set of equations that fit on half a page. The half-page compresses the record of all prior experience with cannonball arcs, and allows us to predict future arcs.

Soft science studies typically make a set of observations which relate two measurable aspects. When plotted, the data points suggest a line or curve, and we accept the linear-regression (line or polynomial) as the best approximation for the data. The theory being that the underlying mechanism

isthe regression, and unrelated noise in the environment or measurement system causes random deviations of observation.This is the wrong method. Regression is based on minimizing squared error, which was chosen by Laplace for no other reason that it is easy to calculate. There's lots of "rationalization" explanations of why it works and why it's "just the best possible thing to do", but there's no fundamental logic that can be used to deduce least squares from from fundamental assumptions.

Least squares introduces several problems:

1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source [wikipedia.org] ).

2) There is no computable way to determine whether the data represent a line or a curve - it's done by "eye" and justified with statistical tests.

3) The resultant function frequently looks "off" to the human eye, humans can frequently draw better matching curves; meaning: curves which better predict future data points.

4) There is no way to measure the predictive value of the results. Linear regression will

alwaysreturn the best line to fit the data, even when the data is random.The right way is to show how much the observation data is compressed. If the regression function plus data (represented as offsets from the function) take fewer bits than the data alone, then you can say that the conclusions are valid. Further, you can tell

howrelevant the conclusions are, and rank and sort different conclusions (linear, curved) by their compression factor and choose the best one.Scientific studies should have a threshold of "compresses data by N bits", rather than "1-in-20 of all studies are due to random chance".

## Re:The real issue (1)

## colinrichardday (768814) | about a year ago | (#45408665)

1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source [wikipedia.org])Do outliers skew the results? If the outliers are biased, then that may tell us something about the underlying population. If they aren't biased, then their effects cancel.

4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.But random data would generate statistically insignificant correlation coefficients. Also, the 95% confidence intervals used to predict values are wider for random data.

## Re:The real issue (1)

## Okian Warrior (537106) | about a year ago | (#45409047)

Do outliers skew the results? If the outliers are biased, then that may tell us something about the underlying population. If they aren't biased, then their effects cancel.

There's no algorithm that will identify the outliers in this example [dropbox.com] .

But random data would generate statistically insignificant correlation coefficients. Also, the 95% confidence intervals used to predict values are wider for random data.

What value of correlation coefficient distinguishes pattern data from random data in this image [wikimedia.org] ?

## Re:The real issue (1)

## colinrichardday (768814) | about a year ago | (#45409535)

There's no algorithm that will identify the outliers in this example [dropbox.com].So there's no algorithm for comparing observed values to modeled (predicted) values? The absolute value of the difference between the two can't be calculated? Hmm. . .

What value of correlation coefficient distinguishes pattern data from random data in this image [wikimedia.org]?Are the data in that image random? Also, the data without the four points at the bottom would have a higher correlation coefficient.

## Re:The real issue (1)

## colinrichardday (768814) | about a year ago | (#45409819)

Also, you may want to account for the difference between the

xcoordinate of the point and the average of thexs, as having anxcoordinate far from the mean contributes to being farther away from the regression line.## Re:The real issue (0)

## Anonymous Coward | about a year ago | (#45409095)

1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source [wikipedia.org])Do outliers skew the results? If the outliers are biased, then that may tell us something about the underlying population. If they aren't biased, then their effects cancel.

Outliers are often so extreme and rare that despite being statistically unbiased, they nevertheless severely skew statistics which aren't robust to them.

4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.But random data would generate statistically insignificant correlation coefficients. Also, the 95% confidence intervals used to predict values are wider for random data.

Even random data will show significant correlation coefficients at the rate determined by the p-value threshold for significance (typically 0.05). A set of random data with a significant correlation coefficient is indistinguishable from a genuine correlation.

The whole point of statistics is not to give us any certainty as to the validity of conclusions. Certainty is the one thing statistics can never provide. Rather, it's to keep people from throwing up their hands when presented with noisy data that doesn't lie exactly on a line or parabola or whatever. It gives us a knob to turn to control the tradeoff between the ability to discover new knowledge and the risk of misleading ourselves into believing things that aren't true.

## Re:The real issue (1)

## colinrichardday (768814) | about a year ago | (#45409489)

Outliers are often so extreme and rare that despite being statistically unbiased, they nevertheless severely skew statistics which aren't robust to them.If outliers are unbiased, they can affect the results, but how can they

skewthe results? Also, if they're rare, how much effect can they have?A set of random data with a significant correlation coefficient is indistinguishable from a genuine correlation.Not on a scatterplot. It's pretty clear how close the data are to the line. Also, how probable is it that random data would have s statistically significant correlation coefficient?

## Re:The real issue (0)

## Anonymous Coward | about a year ago | (#45409073)

Some of the things you mentioned are good reasons to useRobust Statistics

## Re:The real issue (1)

## stymy (1223496) | about a year ago | (#45409081)

## Variance of error is not what we want (1)

## Okian Warrior (537106) | about a year ago | (#45409757)

Actually, there is a really good reason to use least-squares regression. A model that minimizes squared error is guaranteed to minimize the variance of error, obviously.

This is the wrong place for an argument (you want room 12-A [youtube.com] ) and I don't want to get into a contest, but for illustration here is the problem with this explanation.

A rule learned from experience should minimize the

error, not thevarianceof error.It's a valid conclusion from the mathematics, but based on a faulty assumption.

## An example (2)

## Michael Woodhams (112247) | about a year ago | (#45408193)

Having quickly skimmed the paper, I'll give an example of the problem. .54 .65 .74 .83 .88 .96 .94 .98

I couldn't quickly find a real data set that was easy to interpret, so I'm going to make up some data.

Chance to die before reaching this age

Age woman man

80

85

90

95

We have a person who is 90 years old. Taking the null hypothesis to be that this person is a man, we can reject the hypothesis that this is a man with greater than 95 percent confidence (p=0.04). However, if we do a Bayesian analysis assuming prior probabilities of 50 percent for the person being a man or a woman, we find that there is a 25 percent chance that the person is a man after all (as women are 3 times more likely to reach age 90 than men are.)

(Having 11 percent signs in my post seems to have given /. indigestion so I've had to edit them out.)

## Re:An example (0)

## Anonymous Coward | about a year ago | (#45409181)

Unfortunately, that's not quite right, but it's in the spirit of the right way of thinking at least. If you know a person is 90 years old, then that is prior information that could be used in a Bayesian estimate of the gender of your mystery person. Given that the person is 90, we could say that 12 percent of women, and 4 percent of men make it to that age. This gives the person a 25 percent chance of being a man, and a 75 percent chance of being a woman.

If you ignore the Bayesian prior probabilities, you don't have much of an example of anything. You could measure the gender of the population and conclude that on average 50 percent (maybe not quite) of people are male, which would not be anywhere near sufficient evidence to conclude anything about a particular person.

## Well, duh. (1)

## Black Parrot (19622) | about a year ago | (#45408217)

Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF).

.

Found? Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result -

by definition?More interesting, IMO, is that statistical doesn't tell you what the

scaleof an effect is. There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.## Re:Well, duh. (1)

## ColdWetDog (752185) | about a year ago | (#45408409)

There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.

This is especially prevalent in medicine (especially drug advertising). If you look at medical journals, they are replete with ads touting that Drug A is 'statistically better' than Drug B. Even looking at the 'best case' data (the pretty graph in the advert) you quickly see that the lines very nearly converge. Statistically significant. Clinically insignificant.

Lies, Damned Lies and Statistics

## Re:Well, duh. (2)

## Mr. Slippery (47854) | about a year ago | (#45408953)

A P-value [wikipedia.org] of 0.05 means by definition that there is a 0.05, or 5%, or 1 in 20, probability that the result could be obtained by chance even though there's no actual relationship.

## Integrating to x sigma (1)

## mdsolar (1045926) | about a year ago | (#45408243)

Medicine has a formal means to end a trial early if a medicine turns out to be dangerous or particularly helpful. This is an ethical consideration. But, it does make the trial results void.

## not the real problem (3, Insightful)

## ganv (881057) | about a year ago | (#45408285)

## The bigger problem (1)

## msobkow (48369) | about a year ago | (#45408429)

The bigger problem is the habit of confusing correlation with cause.

## Let's get something straight you non-staticians (4, Insightful)

## j33px0r (722130) | about a year ago | (#45408553)

This is a geek website, not a "research" website so stop talking a bunch of crap about a bunch of crap. I'm providing silly examples so don't focus upon them. Most researchers suck at stats and my attempt at explaining should either help out or show that I don't know what I'm talking about. Take your pick.

"p=.05" is a stat that reflects the likelihood of rejecting a true null hypothesis. So, lets say that my hypothesis is that "all cats like dogs" and my null hypothesis is "not all cats like dogs." If I collect a whole bunch of imaginary data, run it through a program like SPSS, and the results turn out that my hypothesis is correct then I have a .05 percent chance that the software is wrong. In that particular imaginary case, I would have committed a Type I Error. This error has a minimal impact because the only bad thing that would happen is some dogs get clawed on the nose and a few cats get eaten.

Now, on a typical experiment, we also have to establish beta which is the likelihood of committing a type II error, that is, accepting a false null hypothesis. So let's say that my hypothesis is that "Sex when desired makes men happy" and my null hypothesis is "Sex only when women want it makes men happy." It's not a bad thing if #1 is accepted but the type II error will make many men unhappy.

Now, this is a give and take relationship. Every time that we make p smaller (.005, .0005, .00005, etc.) for "accuracy," then the risk of committing a type II error increases. A type II error when determining what games 15 year olds like to play doesn't really matter if we are wrong but if we start talking about drugs and false positives then the increased risk of a type II error really can make things ugly.

Next, there are guideline for determining a how many participants are needed for lower p (alpha) values. Social sciences (hold back your Sheldon jokes) that do studies on students might need lets say 35 subjects/people per treatment group at p=.05 whereas with a .005 might need 200 or 300 per treatment group. I don't have a stats book in front of me but .0005 could be in the thousands. Every adjustment impacts a different item in a negative fashion. You can have your Death Star or you can have Luke Skywalker. Can't have 'em both.

Finally, there is a statistical concept of power, that is, there are stats for measuring the impact of a treatment. Basically, how much of the variance between the group A and group B can be assigned to the experimental treatment. This takes precedence in many peoples minds over simply determining if we have a correct or incorrect hypothesis. Assigning p does not answer this.

Anyways, I'm going to go have another beer. Discard this article and move onto greener pastures.

## And has been so for, oh, 50 years? (1)

## zedrdave (1978512) | about a year ago | (#45409527)

Back when I was a wee researchling, this [berkeley.edu] is literally one of the first paper I was told to read and internalise (published 20 years ago, and not even particularly breakthrough at the time).

There is absolutely no need for new evidence or further discussion of the limitations of statistical testing thresholds: anybody who cares is keenly aware of them. People who don't (particularly in some areas of social science), are just looking for a way to get their next paper out the door by any means possible.

## Such as (0)

## ksemlerK (610016) | about a year ago | (#45409785)

## Hah, could smell it! (0)

## Anonymous Coward | about a year ago | (#45409795)

I always knew there was something wrong with their Pies...