Join our list for notifications and early access to events
Principle of Relatedness
One of the main findings from the research so far is what Matteo refers to as a "fingerprint" of the scientific production of cities. The work is based on the idea of The Principle of Relatedness, an economics term that aims to measure the relationship between a nation's overall production, exports, expertise, and trade partners to predict what items the country should export next.
In applying this idea to their research, Matteo would look at all of the scientific publications from a city and use the embedding space to measure the level of relatedness, and predict the direction of the city's scientific knowledge. You can use a network to visually show the interactions between different vectors (science topics) and rank the probability that a city will enter a specific field. That ranking becomes your "classifier" and allows you to determine where that field will or will not be developed next.
If you were to plot out the topics of existing research in a city, you could see where the "knowledge density" collects, and note where the density is high, to predict the trajectory of research. If a country is in an intermediate stage of development, there's a higher chance of "jumping" to a different space.
Focus and Limitations
The focus, for now, is to find the best way of creating embeddings for a very specific problem, not for a variety of tasks.
For example, there is no weighting of a researcher's volume of work or its relative importance--the associations include anything they've been active in. Likewise, for some analyses, you might want to identify where the scientist is most active and remove any side projects or abandoned subjects.
None of these are considered in this paper. Rather, Matteo approaches the problem from the simplest possible scenario, effectively asking "What if we are blind?"
"We...get a big pile of papers from an author. We just list all the topics in which he has worked on and train on that."
They want to prove that you do not need to perform manual checks and optimizations to get useful results.
Performance Metrics
Matteo tested the results using a couple of different validations:
One approach was to visualize the RSN and regional fingerprints for assessment. This made it easy to see the macro areas where the PACS classification distinguishes the different subfields of physics. This hierarchy was not used at training time and the algorithm was able to determine the right classification.
The second method was to measure the predictive power of the algorithm by looking at each city at a given time period and listing the topics where they had a competitive advantage. Then they compared them using a standard metric like an ROC curve to see if the model was performing better than a random model.
What's Next?
While the goal is to eventually expand and apply these techniques to entire papers (vs just the PACS keywords), having a predetermined taxonomy and hierarchical structure laid out gives them a benchmark to validate their own observations.
Scaling this approach to other fields is something they are starting to work on. They've made some progress using the Microsoft Academic Graph which includes all the different fields in science. As of now, they can't replicate the results they get when they apply the algorithm to physics, but the potential for the embedding space can be evolved for tracking things like the semantics of a term over time, or how authors tend to move in this space. There's also the possibility of finding gaps in the science and making connections that the field might not know to make.