Essays and Comments by Year:

Why "Redefining Statistical Significance" Will Not Improve Reproducibility and Might Make the Replication Crisis Worse

Harry Crane (2017).

Abstract

A recent proposal to "redefine statistical significance" (Benjamin, et al. Nature Human Behaviour, 2017) claims that false positive rates "would immediately improve" by factors greater than two and replication rates would double simply by changing the conventional cutoff for 'statistical significance' from P<0.05 to P<0.005. I analyze the veracity of these claims, focusing especially on how Benjamin, et al neglect the effects of P-hacking in assessing the impact of their proposal. My analysis shows that once P-hacking is accounted for the perceived benefits of the lower threshold all but disappear, prompting two main conclusions: (i) The claimed improvements to false positive rate and replication rate in Benjamin, et al (2017) are exaggerated and misleading. (ii) There are plausible scenarios under which the lower cutoff will make the replication crisis worse.

Commentary by Andrew Gelman.

Response by one of the authors (E.J. Wagenmackers) Bayesian Spectacles Blog.

More discussion by Tim van der Zee.

Comment on F. Caron and E.B. Fox. Sparse graphs using exchangeable random measures.

Harry Crane. (2017). Journal of the Royal Statistical Society, Series B, 79, Part 5.

Caron and Fox tout their proposal as

"the first fully generative and projective approach to sparse graph modelling [...] with a notion of exchangeability that is essential for devising our scalable statistical estimation procedure." (p. 12, emphasis added).

In calling theirs the first such approach, the authors brush aside prior work of Barabasi and Albert (1999), whose model is also generative, projective, and produces sparse graphs. The Barabasi–Albert model is not exchangeable, but neither is the authors’. And while the Barabasi–Albert model is inadequate for most statistical purposes, the proposed model is not obviously superior, especially with respect to the highlighted criteria above.

Generative. Though amenable to simulation, the obscure connection between Kallenberg’s theory of exchangeable CRMs and the manner in which real networks form makes it hard to glean much practical insight from this model. At least Barabasi and Albert’s preferential attachment mechanism offers a clear explanation for how sparsity and power law might arise in nature. I elicit no such clarity from the Caron–Fox model.

Projective. Projectivity is important for relating observed network to unobserved population, and is therefore crucial in applications for which inferences extend beyond the sample. Without a credible sampling interpretation, however, the statistical salience of projectivity is moot. Here projectivity involves restricting point processes in the two-dimensional real plane to bounded rectangles, whose best known interpretation via p-sampling (Veitch and Roy, 2016) seems unnatural for most conceivable networks applications, including those in Section 8.

Exchangeability and Sparsity. A certain nonchalance about whether and how this model actually models real networks betrays an attitude that sees sparsity as an end in itself and exchangeability as a means to that end. Even the authors acknowledge that "exchangeability of the point process [...] does not imply exchangeability of the associated adjacency matrix" (p. 3). So why all the fuss about exchangeability if its primary role here is purely utilitarian? To me, the pervasiveness of "exchangeability" throughout the article is but a head fake for unsuspecting statisticians who, unlike many modern machine learners, understand that exchangeability is far more than a computational contrivance.

Final Comment. The Society’s Research Section is all too familiar with the Crane–Dempsey edge exchangeable framework, which meets the above four criteria while staying true to its intended application of interaction networks. For lack of space, I refer the reader to Crane and Dempsey (2015, 2016) for further discussion.

Comment on A. Gelman and C. Hennig. Beyond subjective and objective in statistics.

Harry Crane. (2017). Journal of the Royal Statistical Society, Series A, 180, Part 4.

I applaud the authors’ advocacy for subjectivity in statistical practice and appreciate the overall attitude of their proposal. But I worry that the proposed virtues will ultimately serve as a shield to deflect criticism, much like objectivity and subjectivity often do now. In other words, won’t acceptance of 'virtue' as a research standard in short order be supplanted by the "pursuit to merely appear" virtuous?

I believe Gelman and Hennig when they assert, "[W]e repeatedly encounter publications in top scientific journals that fall foul of these virtues" (p. 27). I’m less convinced, however, that this "indicates [...] that the underlying principles are subtle". This conclusion seems to conflate doing science and publishing science. In fact I suspect that most scientists are more or less aware of these virtues, and many would agree that these virtues are indeed virtuous for doing science. But I’d expect those same scientists to acknowledge that some of these virtues may be regarded as vices in the publishing game. Just think about the lengths to which journals go to maintain the appearance of objectivity. They achieve this primarily through peer review, which promises transparency, consensus, and impartiality, three of Gelman and Hennig's 'virtues', but rarely delivers either. It should be no surprise that a system so obsessed with appearances also tends to reward research that 'looks the part'. As "communication is central to science" (p. 6) and publication is the primary means of scientific communication, is it any wonder that perverse editorial behaviors heavily influence which virtues are practiced and which are merely preached?

Finally, I ask: just as statistical practice is plagued by the "pursuit to merely appear objective", is science not also plagued by the pursuit to 'appear statistical'? Judging from well publicized issues, such as p-hacking (Gelman and Lokin, 2014; Nuzzo, 2014; Wasserstein and Lazar, 2016), and my own conversations with scientists, I’d say so. To borrow from Feyerabend (2010, p. 7), "The only principle that does not inhibit progress is: anything goes". So why not simply encourage scientists to make convincing, cogent arguments for their hypotheses however they see fit, without having to check off a list of 'virtues' or run a battery of statistical tests.

Wasserman (2012) invites us to imagine "a world without referees". Instead, I’m envisioning a world without editors, journals, or statistics lording over science and society. Without 'objectivity' obscuring the objective, and without 'virtues' standing in the way of ideals. That world looks pretty good to me.

The modern-day snake oil salesman

Harry Crane. December 2, 2016.

Description:

Commentary on how probabilistic predictions can be both correct and meaningless at the same time, with a focus on the 2016 presidential election.

Rejoinder: The ubiquitous Ewens sampling formula

Harry Crane. (2016). Statistical Science, 31(1):37-39.

Description:

Some concluding remarks regarding my 2016 article on The ubiquitous Ewens sampling formula, which was discussed by Arratia, Barbour & Tavare, Favaro & James, Feng, McCullagh, and Teh.