Saw an article from MIT News (Using machine learning to predict high-impact research) on how researchers there were able to train an AI model to predict which scientific research was going to be the most impactful (foundational) over time. The news article was reporting on research written up in a Nature article (Learning on knowledge graph dynamics provides an early warning of impactful research, behind paywall). The researchers proposed that institutions and VC should use their new DELPHI (Dynamic Early-warning by Learning to Predict High Impact [research]) tool to find foundational research to invest in.
Attempts to identify good research have been active for years. For example, CiteSeerX and other’s like them, use an articles citation index to rank research. Citation indexes are sort of like Google’s page rank and uses a count of how many citations a research paper has garnered since publication as their metric of importance.
Although citation indices are a single, easy to calculate metric, they don’t seem to be a foolproof method to identify foundational research and it takes a number of years to become evident. The researchers at MIT decided to see if using an AI model to identify high impact research would work better.
But first please take our new poll:
How DELPHI works
Apparently, DELPHI uses article metadata, such as one can find looking at the Nature article behind this research (linked to above), to create a knowledge graph. They then use the knowledge graph and an AI model to predict whether the research will become high impact or not. The threshold they used for their publication was any research DELPHI predicts would be in the top 5% of all research in a domain.
Not having access to their paper (or code, see below), we can’t determine if they used a DNN or some other AI/data analytics approach to come up with their prediction.
The input data (article metadata) came from a website, Lens.org which provides metadata for ~230M research articles and ~130M patent filings. The researchers focused on life sciences as the domain to analyze to predict impact, but presumably their approach would work on any scientific domain.
The research analyzed all scientific articles for 42 life sciences journals (listed in articles supplementary information). They used as their training set articles written prior to 2017. And then used their model to predict the impact for articles published since 2018.
In the Nature article’s supplementary information they provide a table (Table 2) which lists some of life-sciences articles since 2018 that DELPHI predicts will have high ((top 5%) impact . There’s ~50 articles listed in the table and they supply the (knowledge) Full-graph (citation) count as well as citation counts for the articles.
The Nature article’s home page also list links to the researchers code and data on one of the researchers GitHub repos. When I attempted to download the trained model and sample dataset, it generated a “links had expired” error message from Dropbox . The repo readme file suggested reaching out to the researcher if this happened. We did that, but had not received any response prior to this post’s publication. .
In any case, in the GitHub repository, there are a sample Jupiter notebook and dockerfile used to create a container to run the notebook in. The data they supplied, supposedly is a sample of 206 articles (metadata) and the notebook uses their model to predicts the impact level for those sample articles .
I would have liked to see more information on their model layer structure, hyper-parameters and other model information as well as prediction reliability statistics. But perhaps this is outlined in the Nature article or provided in the model download.
But the approach seems sound enough and even if the researchers didn’t use a DNN, it would easily lend itself to a DNN prediction, assuming you could :
1. Algorithmically create the knowledge graph from article metadata,
2. Digitize and quantify the metadata knowledge graph for all the articles, and
3. Had an independent assessment of impact levels for all research in the training set.
Now if we could just do this for blog posts and podcasts it might be even more useful (for us).