Taming arXiv with Natural Language Processing w/ John Bohannon

800 800 The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

In this episode i’m joined by John Bohannan, Director of Science at AI startup Primer.

As you all may know, a few weeks ago we released my interview with Google legend Jeff Dean, which, by the way, you should definitely check if you haven’t already. Anyway, in that interview, Jeff mentions the recent explosion of machine learning papers on arXiv, which I responded to jokingly by asking whether Google had already developed the AI system to help them summarize and track all of them. While Jeff didn’t have anything specific to offer, a listener reached out and let me know that John was in fact already working on this problem. In our conversation, John and I discuss his work on Primer Science, a tool that harvests content uploaded to arxiv, sorts it into natural topics using unsupervised learning, then gives relevant summaries of the activity happening in different innovation areas. We spend a good amount of time on the inner workings of Primer Science, including their data pipeline and some of the tools they use, how they determine “ground truth” for training their models, and the use of heuristics to supplement NLP in their processing.

Event Season!

Tomorrow, 5/7, I’m keynoting at the Prepare AI event here in Saint Louis and then making my way out to San Francisco for Figure Eight’s Train AI conference. The event agenda looks great, and I’ll be on-site all day podcasting, so if you’re in the Bay Area you should definitely plan to stop by. Of course if you do, use the discount code TWIMLAI for 30% off of registration. Be sure to give me a shout if you’re planning to be around!

About John

Mentioned in the Interview

“More On That Later” by Lee Rosevere licensed under CC By 4.0


Leave a Reply

Your email address will not be published.