Reproducibility Crisis in Data Science

Last week on the podcast I interviewed Clare Gollnick, CTO of Terbium Labs, on the reproducibility crisis in science and its implications for data scientists. We also got into an interesting conversation about the philosophy of data, a topic I hadn’t previously thought much about. The interview seemed to really resonate with listeners, judging by the number of comments we’ve received via the show notes page and Twitter. I think there are several reasons for this.

I’d recommend listening to the interview if you haven’t already. It’s incredibly informative and Clare does an excellent job explaining some of the main points of the reproducibility crisis. The short of it though is that many researchers in the natural and social sciences report not being able to reproduce each other’s findings. A 2016 “Nature” survey demonstrated that more than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. This concerning finding has far-reaching implications for the way scientific studies are performed.

Gollnick suggests that one contributing factor is the idea of “p-hacking”–that is, examining one’s experimental data until patterns are found that meet the criteria for statistical significance, before determining a specific hypothesis about the underlying causal relationship. P-hacking is also known as “data fishing” for a reason: You’re working backward from your data to a pattern, which breaks the assumptions upon which statistical significance is determined in the first place.

Clare points out that data fishing is exactly what machine learning algorithms do though–they work backward from data to patterns or relationships. Data scientists can thus fall victim to the same errors made by natural scientists. P-hacking in the sciences, in particular, is similar to developing overfitted machine learning models. Fortunately for data scientists, it is well understood that cross-validation, by which a hypothesis is generated on a training dataset and then tested on a validation dataset, is a necessary practice. As Gollnick points out, testing on the validation set is a lot like making a very specific prediction that’s unlikely to occur unless your hypothesis is true, which is essentially the scientific method at its purest.

Beyond the sciences, there’s growing concern about a reproducibility crisis in machine learning as well. A recent blog post by Pete Warden speaks to some of the core reproducibility challenges faced by data scientists and other practitioners. Warden refers to the iterative nature of current approaches to machine and deep learning and the fact that data scientists are not easily able to record their steps through each iteration. Furthermore, the data science stack for deep learning has a lot of moving parts, and changes in any of these layers–the deep learning framework, GPU drivers, or training or validation datasets–can all impact results. Finally, with opaque models like deep neural networks, it’s difficult to understand the root cause of differences between expected and observed results. These problems are further compounded by the fact that many published papers fail to explicitly mention many of their simplifying assumptions or implementation details, making it harder for others to reproduce their work.

Efforts to reproduce deep learning results are further confounded by the fact that we really don’t know why, when or to what extent deep learning works. During an award acceptance speech at the 2017 NIPS conference, Google’s Ali Rahimi likened modern machine learning to alchemy for this reason. He explained that while alchemy gave us metallurgy, modern glass making, and medications, alchemists also believed they could cure illnesses with leeches and transmute base metals into gold. Similarly, while deep learning has given us incredible new ways to process data, Rahimi called for the systems responsible for critical decisions in healthcare and public policy to be “built on top of verifiable, rigorous, thorough knowledge.”

Gollnick and Rahimi are united in advocating for a deeper understanding of how and why the models we use work. Doing so might mean a trip back to basics–as far back as the foundations of the scientific method. Gollnick mentioned in our conversation that she’s been fascinated recently with the “philosophy of data,” that is, the philosophical exploration of scientific knowledge, what it means to certain of something, and how data can support these. It stands to reason that any thought exercise that forces us to face tough questions about issues like explainability, causation, and certainty, could be of great value as we broaden our application of modern machine learning methods. Guided by the work of science philosophers like Karl Popper, Thomas Kuhn, and as far back as David Hume, this type of deep introspection into our methods could prove useful for the field of AI as a whole.

What do you think? Does AI have a reproducibility crisis? Should we bother philosophizing about the new tools we’ve made, or just get to building with them?