A Conversation About Public Datasets for AI Research

150 150 The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

This post is an excerpt from the July 15, 2016 edition of the This Week in Machine Learning & AI podcast. You can listen or subscribe to the podcast below.

In this post I want to revisit some comments that I made last week while discussing the news that Google DeepMind was granted access to a collection of 1,000,000 eye scan images by the British National Health System. If you’ll recall, I asked whether this data, which was collected by a government-funded public health organization should instead of being exclusively handed over to a single research organization, should rather be made publicly available to all researchers.

Well, I wasn’t the only person thinking this thought. This week I came across a really interesting article by Natasha Lomas over on TechCrunch that takes this question a few steps further. While the focus of my question was on data accessibility, a key underlying issue, which Natasha very nicely articulates, is the issue of data value.

To be clear, the issue here is that while Google DeepMind says it will be publishing the results of its research, and if you’re a regular listener here you know that this is very likely the case, they haven’t committed to share, via open source or otherwise, the models they create as a result of the work.

As an example of a likely outcome, Google could turn around and license their models, which are based on public data, to one of the vendors of the eye scanners that are used by physicians. Sure they created the models, but they’re given quite a head start with exclusive access to the data.

An article on the topic in New Scientist Magazine paraphrases a University of Pittsburgh eye doctor as saying: [DeepMind may get free access to valuable patient data – but the alternative is to keep potential insight locked up in the Moorfields dataset, inaccessible to human analysis.]

You imagine the NHS saying the same thing, but this is obviously a false dichotomy. Who’s to say that if the data weren’t public another research organization such as a public university wouldn’t take up the challenge.

Natasha asks a few good questions in her piece, namely:
– why governments and public bodies fail to see the value locked up in the publicly funded data-sets
– why aren’t they coming up with ways to maintain public ownership of public assets?
– How could they do so in such a way as to distribute benefits equally, rather than disproportionately rewarding the company with the slickest sales pitch?

Natasha compares the NHS DeepMind arrangement to other transactions involving the privatization of public resources, suggesting that these amount to a transfer of wealth from citizens to corporate interests.

She suggests that “we, the public, really need to get our act together and demand a debate about who should own the value locked up in our data. And preferably do so before we’ve handed over any more sets of keys.”

What occurred to me in thinking about this a bit more is that perhaps one piece of the puzzle is a new type of licensing model for data. Something viral like the GPL, but whose virality applies to derivative works, where in this case we mean models created by training on the data. So, if you used data licensed under such a license to train a model, you would need to publish the source code for the models should you choose to publicize them via services or executables.

I’m just thinking aloud here. Let me know what you think in the comments, or on twitter.

Leave a Reply

Your email address will not be published.