让统计显著性退场(Retire SS)
科学家们起来反抗统计显著性
(Scientists rise up against statistical significance)
(翻译:Google Translate/陈立功)
Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.
Nature:20 MARCH 2019 Nature 567, 305-307 (2019); Picture Source: V. Amrhein et al.
When was the last time you heard a seminar speaker claim there was ‘no difference’ between two groups because the difference was ‘statistically non-significant’?
请问你最近一次在研讨会上听到有发言人称“由于无统计显著性,所以两组之间没有‘差异’”是什么时候?
If your experience matches ours, there’s a good chance that this happened at the last talk you attended. We hope that at least someone in the audience was perplexed if, as frequently happens, a plot or table showed that there actually was a difference.
如果你和我们有过相似的经历,那么你很可能在上次参加的演讲中遇到过这种会经常发生的事情。我们希望至少有一些听众会感到某种困惑:一个图示或表格分明显示存在着某种差异,为何被说成没有呢?
How do statistics so often lead scientists to deny differences that those not educated in statistics can plainly see? For several generations, researchers have been warned that a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment on some measured outcome)1. Nor do statistically significant results ‘prove’ some other hypothesis. Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exists.
统计学如何引导科学家经常否认那些未经统计学教育的人能明白看到的差异?历代统计学家一直在告诫应用研究者们,一个无统计显着性的结果并不能“证明”无效假设(假设组间差异为零或治疗对某测量结果无影响)1。有统计显著性的结果也没有“证明”某些其它假设。伴随着某种夸大其词的说法,这种误解已经显著地歪曲了文献所报告的内容,还有一些则不那么显著地引发了不同研究结果之间不存在冲突的主张。
We have some proposals to keep scientists from falling prey to these misconceptions.
我们有一些建议让科学家避免成为这些误解的牺牲品。
Pervasive problem普遍存在的问题
Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero. Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions.
让我们首先搞清楚必须停止什么:我们永远不应该仅仅由于P值大于0.05的阈值(或者等价地说,由于置信区间包括零)而得出“没有差异”或“没有关联”的结论。如果一项研究结果有统计学意义而另一项没有,我们也不应该因此就断言它们之间存在着冲突。这样的错误会浪费研究工作并误导政策决策。
For example, consider a series of analyses of unintended effects of anti-inflammatory drugs2. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome.
例如,考虑对抗炎药的副作用进行一系列分析2。因为它们的结果无统计显着性,一组研究人员便得出结论,认为接触这些药物与新发房颤(最常见的心律紊乱)“无关”,这与早期有统计显著性的研究结果相反。
Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).
现在,让我们看看实际数据。研究人员描述了其无统计显着性的结果,发现风险比为1.2(即使用抗炎药的人群相对于不使用者的风险增加20%),但95%的置信区间跨越了从微不足道的风险降低3%到相当大的风险增加48%(P = 0.091,我们的计算)。他们在其早期的一个同类研究数据中得到过完全相同的风险比1.2,以及一个更精确的风险区间:9%~33%(P = 0.0003,我们的计算)。
It is ludicrous to conclude that the statistically non-significant results showed “no association”, when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect. Yet these common practices show how reliance on thresholds of statistical significance can mislead us (see ‘Beware false conclusions’).
当区间估计包括严重的风险增加时,如果以统计上无显着性就否认两者之间的“关联性”未免有点荒谬。如果认为这一结果与显示相同观察效果的早期结果形成了某种对立也同样是荒谬的。然而,这些常见的做法表明,依赖统计显着性的阈值会误导我们(参见“谨防错误结论”)。
These and similar errors are widespread. Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half (see ‘Wrong interpretations’ and Supplementary Information).
这些和类似的错误很普遍。对数百篇文章的调查发现,无统计显着性的结果被解释为“无差异”或“无影响”的约占一半(参见“错误的解释”和补充信息)。
In 2016, the American Statistical Association released a statement in The American Statistician warning against the misuse of statistical significance and P values. The issue also included many commentaries on the subject. This month, a special issue in the same journal attempts to push these reforms further. It presents more than 40 papers on ‘Statistical inference in the 21st century: a world beyond P < 0.05’. The editors introduce the collection with the caution “don’t say ‘statistically significant’”3. Another article4 with dozens of signatories also calls on authors and journal editors to disavow those terms.
2016年,美国统计学会在《美国统计学家》上对滥用统计显着性和P值发出警告。该问题还包括许多关于这一主题的评论。本月,该刊在其特刊上发表了40多篇关于“21世纪统计推断:超越P <0.05的世界”的论文,试图进一步推动这一改革。编辑们在介绍这一系列文章时,谨慎地表示“不要说‘有统计显著性’”3。另有一篇文章以及数十个签署者也呼吁作者们和期刊编辑应拒绝使用这些术语4。
We agree, and call for the entire concept of statistical significance to be abandoned.
我们同意,并呼吁放弃整个统计显著性的概念。
We are far from alone. When we invited others to read a draft of this comment and sign their names if they concurred with our message, 250 did so within the first 24 hours. A week later, we had more than 800 signatories — all checked for an academic affiliation or other indication of present or past work in a field that depends on statistical modelling (see the list and final count of signatories in the Supplementary Information). These include statisticians, clinical and medical researchers, biologists and psychologists from more than 50 countries and across all continents except Antarctica. One advocate called it a “surgical strike against thoughtless testing of statistical significance” and “an opportunity to register your voice in favour of better scientific practices”.
我们不是在孤军奋战。当我们邀请其他人阅读本评论的草稿并以签名表示对我们的认可时,有250人在最初的24小时就签了名。一周之后,签名者达到800人 ---- 所有签名者都确认了其属于一个学术联盟或表明其当前或过去的工作领域依赖于统计建模(参见补充信息中的签名名单和最终统计),涉及50多个国家和除南极洲以外的所有大陆的统计学家、临床和医学研究人员、生物学家和心理学家。一位倡导者将其称为“一次针对统计显著性之轻率检验的外科手术”,以及“一次为更好的科学实践发声的机会”。
We are not calling for a ban on P values. Nor are we saying they cannot be used as a decision criterion in certain specialized applications (such as determining whether a manufacturing process meets some quality-control standard). And we are also not advocating for an anything-goes situation, in which weak evidence suddenly becomes credible. Rather, and in line with many others over the decades, we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis5.
我们并非要求禁用P值,既没说它不能被用于某些特殊场合(例如确定制造过程是否符合某些质量控制标准)的决策标准,也没有提倡可将弱证据突然变得可信这种无所事事的情形。相反,我们像几十年来的许多其他人那样,只是呼吁停止以传统的二分法决定结果是否反驳或支持科学假设的方式来使用P值5。
Quit categorizing退出(取消)分类化
The trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different6–8. The same problems are likely to arise under any proposed statistical alternative that involves dichotomization, whether frequentist, Bayesian or otherwise.
麻烦在于人类和认知而不是统计:将结果分为“统计上显着”和“统计上不显着”使人们认为以这种方式得到的是完全不同的分类结果6–8。无论是频率主义者,还是贝叶斯学派,或者任何其流派,他们提出的任何涉及二分法的统计替代方案都可能产生同样的问题。
Unfortunately, the false belief that crossing the threshold of statistical significance is enough to show that a result is ‘real’ has led scientists and journal editors to privilege such results, thereby distorting the literature. Statistically significant estimates are biased upwards in magnitude and potentially to a large degree, whereas statistically non-significant estimates are biased downwards in magnitude. Consequently, any discussion that focuses on estimates chosen for their significance will be biased. On top of this, the rigid focus on statistical significance encourages researchers to choose data and methods that yield statistical significance for some desired (or simply publishable) result, or that yield statistical non-significance for an undesired result, such as potential side effects of drugs — thereby invalidating conclusions.
不幸的是,一个错误的信念认为跨越了统计显著性门槛能足以表明结果的“真实性”,这使得科学家和期刊编辑更钟情于这样的结果,从而扭曲了文献。统计上显著的估计值会幅度向上地产生大的偏差,且潜在地达到了很大的程度,而那些统计上不显著的估计值则会幅度向下地存在偏差。因此,任何侧重于其估计的显著性的讨论都会导致偏倚。除此之外,对统计显著性的严格关注鼓励着研究人员选择数据和方法来获得对某些期望的(或简单可发表的)结果的统计学意义,或者对那些不希望的结果产生统计上的无意义,例如药物潜在的副作用----从而使结论无效。
The pre-registration of studies and a commitment to publish all results of all analyses can do much to mitigate these issues. However, even results from pre-registered studies can be biased by decisions invariably left open in the analysis plan9. This occurs even with the best of intentions.
承诺预先登记研究并公布所有分析的全部结果可以大大减轻上述问题。然而,即使是预先登记的研究结果,也可能会因分析计划中始终存在的某种意念而产生偏见9。即使有着最好的意图,也会发生这种情况。
Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.
同样,我们并不主张禁止P值、置信区间或其它统计措施 ---- 我们只是认为不应该区别有加地对待它们。这包括作为统计上显著或不显著的二分法,以及基于其它统计测量(例如贝叶斯因子)的分类。
One reason to avoid such ‘dichotomania’ is that all statistics, including P values and confidence intervals, naturally vary from study to study, and often do so to a surprising degree. In fact, random variation alone can easily lead to large disparities in P values, far beyond falling just to either side of the 0.05 threshold. For example, even if researchers could conduct two perfect replication studies of some genuine effect, each with 80% power (chance) of achieving P 0.30. Whether a P value is small or large, caution is warranted.
避免这种“二分法之痴迷”的一个理由是,所有的统计数据,包括P值和置信区间,在不同的研究之间自然会有所不同,并且通常会达到令人惊讶的程度。事实上,仅仅随机变异就很容易导致很大的P值差异,远远超过0.05阈值的任何一侧。例如,即使研究人员可以对一些真实效果进行两次完美的重复性研究,每次都有80%的效能(机会)达到P<0.05,一个人获得P<0.01而另一个P> 0.30就不足为奇了。无论P值是小还是大,都需要谨慎。
We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits. In doing so, they should remember that all the values between the interval’s limits are reasonably compatible with the data, given the statistical assumptions used to compute the interval7,10. Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.
我们必须学会接受不确定性。一种实用的方法是将置信区间重命名为“兼容区间”,并以避免过度自信的方式解释它们。具体而言,我们建议作者描述区间内所有值的实际含义,尤其是观察到的效应(或点估计)和限制。与此同时,他们应该牢记,在给定用于计算区间的统计假设的情况下7,10,区间内的所有值都与数据合理地兼容。因此,在区间中挑出一个特定值(例如空值)为“显示”是没有意义的。
We’re frankly sick of seeing such nonsensical ‘proofs of the null’ and claims of non-association in presentations, research articles, reviews and instructional materials. An interval that contains the null value will often also contain non-null values of high practical importance. That said, if you deem all of the values inside the interval to be practically unimportant, you might then be able to say something like ‘our results are most compatible with no important effect’.
坦率地说,我们厌倦了在演示文稿、研究文章、评论和教学材料中看到的这种荒谬的“无效证明”和非关联主张。一个无效值存在的区间通常还包含着具有高实用且重要的非无效值,也就是说,如果您认为一个区间内的所有值实际上并不重要,那么您可能会说“我们的结果最兼容而没有重要的效应”。
When talking about compatibility intervals, bear in mind four things. First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible. In fact, values just outside the interval do not differ substantively from those just inside the interval. It is thus wrong to claim that an interval shows all possible values.
在谈论兼容区间时,请记住四件事。首先,仅仅因为在给定的假设下区间给出了与数据最相容的值,它并不意味着它之外的值是不相容的;它们只是兼容性较差而已。实际上,区间之外的值与区间内的值没有实质性差异。因此声称区间显示了所有可能的值是错误的。
Second, not all values inside are equally compatible with the data, given the assumptions. The point estimate is the most compatible, and values near it are more compatible than those near the limits. This is why we urge authors to discuss the point estimate, even when they have a large P value or a wide interval, as well as discussing the limits of that interval. For example, the authors above could have written: ‘Like a previous study, our results suggest a 20% increase in risk of new-onset atrial fibrillation in patients given the anti-inflammatory drugs. Nonetheless, a risk difference ranging from a 3% decrease, a small negative association, to a 48% increase, a substantial positive association, is also reasonably compatible with our data, given our assumptions.’ Interpreting the point estimate, while acknowledging its uncertainty, will keep you from making false declarations of ‘no difference’, and from making overconfident claims.
其次,根据假设,区间内并非所有值都与数据同等兼容。点估计是最兼容的,其附近的值比接近极限的值更兼容。这就是为什么我们敦促作者们讨论点估计,即使它们具有较大的P值或较宽的区间,以及讨论该区间的极限。例如,上述作者可能写道:“与以前的研究一样,我们的研究结果表明,给予抗炎药物的患者新发房颤的风险增加了20%。尽管如此,根据我们的假设,风险差异从3%的减少,即小的负相关,到48%的增长,即实质性正相关,也与我们的数据合理地相容。”解释点估计的同时承认其不确定性,可避免做出“无差异”的虚假声明和过于自信的主张。
Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. It is based on the false idea that there is a 95% chance that the computed interval itself contains the true value, coupled with the vague feeling that this is a basis for a confident decision. A different level can be justified, depending on the application. And, as in the anti-inflammatory-drugs example, interval estimates can perpetuate the problems of statistical significance when the dichotomization they impose is treated as a scientific standard.
第三,与它所来自的0.05阈值一样,用于计算区间的默认95%本身就是一种任意(不是任意的,而是为了使得结果具有足够的充分性,译者注)约定。它基于一种错误(准确地说是一种可操作性,译者注)观点,即计算的区间本身有95%的可能性包含真值,再加上模糊的感觉,这是一个自信决定的基础。根据应用,一个不同的水平是合理的。并且,如在抗炎药物实例中,当它们施加的二分法被视为科学标准时,区间估计可以使统计显着性的问题永久化。
Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval. In practice, these assumptions are at best subject to considerable uncertainty7,8,10. Make these assumptions as clear as possible and test the ones you can, for example by plotting your data and by fitting alternative models, and then reporting all results.
最后,最重要的是要保持谦虚:兼容性评估取决于用于计算区间的统计假设的正确性。实际上,这些假设充其量都是不确定的7,8,10。应尽可能使假设得到清楚的表达和检验,如绘制数据并拟合替代模型,并报告所有结果。
Whatever the statistics show, it is fine to suggest reasons for your results, but discuss a range of potential explanations, not just favoured ones. Inferences should be scientific, and that goes far beyond the merely statistical. Factors such as background evidence, study design, data quality and understanding of underlying mechanisms are often more important than statistical measures such as P values or intervals.
无论统计数据显示什么,都可以找出有关结果的原因,但应讨论一系列潜在的而不仅仅只是有利的解释。推论应该是科学的,且远远超出单纯的统计范畴。背景证据、研究设计、数据质量和对潜在机制的理解等因素通常比统计测量(如P值或区间)更重要。
The objection we hear most against retiring statistical significance is that it is needed to make yes-or-no decisions. But for the choices often required in regulatory, policy and business environments, decisions based on the costs, benefits and likelihoods of all potential consequences always beat those made based solely on statistical significance. Moreover, for decisions about whether to pursue a research idea further, there is no simple connection between a P value and the probable results of subsequent studies.
我们听到的反对不再使用统计显著性的意见是因为需要作出是或否的决定。但对于监管、政策和商业环境中经常所需的抉择,基于所有潜在后果的成本、收益和可能性的决策总是优于单纯基于统计显着性的决策。此外,对于是否进一步追求研究思想的决定,P值与后续研究的可能结果之间没有简单的联系。
What will retiring statistical significance look like? We hope that methods sections and data tabulation will be more detailed and nuanced. Authors will emphasize their estimates and the uncertainty in them — for example, by explicitly discussing the lower and upper limits of their intervals. They will not rely on significance tests. When P values are reported, they will be given with sensible precision (for example, P = 0.021 or P = 0.13) — without adornments such as stars or letters to denote statistical significance and not as binary inequalities (P 0.05). Decisions to interpret or to publish results will not be based on statistical thresholds. People will spend less time with statistical software, and more time thinking.
不再使用统计显著性后会变成什么情形?我们希望方法部分和数据列表更加详尽和细致。作者将强调他们的估计结果以及其中的不确定性 ---- 例如,明确讨论它们的区间的下限和上限。他们不会依赖显著性检验。当报告P值时,它们将以合理的精度给出(例如,P = 0.021或P = 0.13)---- 没有星形或字母等装饰来表示统计显着性,也不是二元不等式(P <0.05或P> 0.05)的形式)。解释或发布结果的决定不会基于统计阈值。人们花在统计软件上的时间会更少,而是用更多的时间去思考。
Our call to retire statistical significance and to use confidence intervals as compatibility intervals is not a panacea. Although it will eliminate many bad practices, it could well introduce new ones. Thus, monitoring the literature for statistical abuses should be an ongoing priority for the scientific community. But eradicating categorization will help to halt overconfident claims, unwarranted declarations of ‘no difference’ and absurd statements about ‘replication failure’ when the results from the original and replication studies are highly compatible. The misuse of statistical significance has done much harm to the scientific community and those who rely on scientific advice. P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go.
我们要求不再使用统计显著性,并将置信区间理解为兼容区间并不是一剂灵丹妙药。虽然它会消除许多不良做法,但也很可能导致新的不良后果。因此,监测文献中的统计滥用应该是科学界一个持续的优先事项。但是,当原始和重复研究的结果高度兼容时,根除分类将有助于避免过度自信的主张,无担保的“无差异”声明以及关于“重复失败”的荒谬声明。滥用统计意义对科学界和依赖科学建议的人造成了很大的伤害。P值、区间和其它统计测量都有它们的位置,但现在是与统计显著性告别的时候了。
References
1. Fisher, R. A. Nature 136, 474 (1935). Article Google Scholar
2. Schmidt, M. & Rothman, K. J. Int. J. Cardiol. 177, 1089–1090 (2014). PubMed Article Google Scholar
3. Wasserstein, R. L., Schirm, A. & Lazar, N. A. Am. Stat. https://doi.org/10.1080/00031305.2019.1583913 (2019). Article Google Scholar
4. Hurlbert, S. H., Levine, R. A. & Utts, J. Am. Stat. https://doi.org/10.1080/00031305.2018.1543616 (2019). Article Google Scholar
5. Lehmann, E. L. Testing Statistical Hypotheses 2nd edn 70–71 (Springer, 1986).
6. Gigerenzer, G. Adv. Meth. Pract. Psychol. Sci. 1, 198–218 (2018). Article Google Scholar
7. Greenland, S. Am. J. Epidemiol. 186, 639–645 (2017). PubMed Article Google Scholar
8. McShane, B. B., Gal, D., Gelman, A., Robert, C. & Tackett, J. L. Am. Stat.
https://doi.org/10.1080/00031305.2018.1527253 (2019). Article Google Scholar
9. Gelman, A. & Loken, E. Am. Sci. 102, 460–465 (2014). Article Google Scholar
10. Amrhein, V., Trafimow, D. & Greenland, S. Am. Stat. https://doi.org/10.1080/00031305.2018.1543137 (2019). Article Google Scholar Download references