NLP for Mapping Physics Research with Matteo Chinazzi

EPISODE 353

MARCH 2, 2020

Watch

Banner Image: Matteo Chinazzi - Podcast Interview

Facebook

About this Episode

Renowned 19th-century biologist Louis Pasteur once said, "Science knows no country, because knowledge belongs to humanity, and is the torch which illuminates the world." Yet, we can learn much about the evolution of science by exploring its local origins. And with the help of AI, we can use this local knowledge to predict its future. Predicting the future of science, particularly physics, is the task that Matteo Chinazzi, an associate research scientist at Northeastern University focused on in his paper Mapping the Physics Research Space: a Machine Learning Approach, along with co-authors including former TWIML AI Podcast guest Bruno Gonçalves. "The idea is essentially to look under the microscope of how science works, meaning for example, how it evolves over time, how collaboration occurs between different scientists, in between different fields. How scientists pick their research problems, how they, for example, move across different institutions, how nations develop expertise in different fields of research and so on." In addition to predicting the trajectory of physics research, Matteo is also active in the computational epidemiology field. His work in that area involves building simulators that can model the spread of diseases like Zika or the seasonal flu at a global scale. Science of Science Matteo's background in economics and his interest in human behavior sparked his desire to explore the "science of science." Physics was the natural starting point since he already worked with many individuals in the field. To build his models, Matteo uses a core data set of papers published in the journals of The American Physical Society. This dataset was chosen in part because of the robustness of its classification scheme, the Physics and Astronomy Classification Scheme (PACS), which provides references to affiliated topics, authors and publications for each of the papers in the archive. PACS also provides a consistent set of keywords for each of the papers. These keywords are used to relate the various physics researchers to one another using an embedding model. In Matteo's case, the model they use is StarSpace, developed by Facebook AI Research. As Matteo puts it, "We are treating each author as a bag of topics, a bag of research fields in which that author has worked. Then we use this bag of topics to infer the embeddings for each specific research sub-area." Having created an embedding that relates the various research topics to one another, Matteo and his co-authors then use it to create what they call the Research Space Network (RSN). The RSN is a "mapping of the research space [created] by essentially looking at the expertise of authors to guide us on what it means for two topics to be similar to each other."

Principle of Relatedness One of the main findings from the research so far is what Matteo refers to as a "fingerprint" of the scientific production of cities. The work is based on the idea of The Principle of Relatedness, an economics term that aims to measure the relationship between a nation's overall production, exports, expertise, and trade partners to predict what items the country should export next. In applying this idea to their research, Matteo would look at all of the scientific publications from a city and use the embedding space to measure the level of relatedness, and predict the direction of the city's scientific knowledge. You can use a network to visually show the interactions between different vectors (science topics) and rank the probability that a city will enter a specific field. That ranking becomes your "classifier" and allows you to determine where that field will or will not be developed next. If you were to plot out the topics of existing research in a city, you could see where the "knowledge density" collects, and note where the density is high, to predict the trajectory of research. If a country is in an intermediate stage of development, there's a higher chance of "jumping" to a different space. Focus and Limitations The focus, for now, is to find the best way of creating embeddings for a very specific problem, not for a variety of tasks. For example, there is no weighting of a researcher's volume of work or its relative importance--the associations include anything they've been active in. Likewise, for some analyses, you might want to identify where the scientist is most active and remove any side projects or abandoned subjects. None of these are considered in this paper. Rather, Matteo approaches the problem from the simplest possible scenario, effectively asking "What if we are blind?" "We...get a big pile of papers from an author. We just list all the topics in which he has worked on and train on that." They want to prove that you do not need to perform manual checks and optimizations to get useful results. Performance Metrics Matteo tested the results using a couple of different validations: One approach was to visualize the RSN and regional fingerprints for assessment. This made it easy to see the macro areas where the PACS classification distinguishes the different subfields of physics. This hierarchy was not used at training time and the algorithm was able to determine the right classification. The second method was to measure the predictive power of the algorithm by looking at each city at a given time period and listing the topics where they had a competitive advantage. Then they compared them using a standard metric like an ROC curve to see if the model was performing better than a random model. What's Next? While the goal is to eventually expand and apply these techniques to entire papers (vs just the PACS keywords), having a predetermined taxonomy and hierarchical structure laid out gives them a benchmark to validate their own observations. Scaling this approach to other fields is something they are starting to work on. They've made some progress using the Microsoft Academic Graph which includes all the different fields in science. As of now, they can't replicate the results they get when they apply the algorithm to physics, but the potential for the embedding space can be evolved for tracking things like the semantics of a term over time, or how authors tend to move in this space. There's also the possibility of finding gaps in the science and making connections that the field might not know to make.