The webinar 'An intro to text and data mining for data scientists' explored the advanced techniques of TDM combined with AI, showcasing four case studies drawn from various disciplines. These examples — two from biomedical sciences, one from materials science, and one from financial technology (fintech) — illustrate the potential of combining TDM and AI tools on Springer Nature’s body of published research, accessed via a powerful Application Programming Interface (API). In this webinar summary, Dr. Prathik Roy and Eddie Bates explain how you can use the API for these diverse projects.
In the beginning, TDM was mainly about finding insights hidden in published research — because the amount of published material is far too big for human beings to be able to read all of it. But with the growth of large foundational models and other machine and deep learning models, data scientists can now use the corpus of published research to train their own models. These models can then provide predictive and prescriptive analytics, rather than just descriptive analytics. Google’s AlphaFold tool, for example, which predicts how proteins fold, shows how powerful these tools can be.
When you can combine data from across multiple sources — from clinical trial reports, patents, journal and book publications, and patient records — you find over 1 billion relationships between genes, symptoms, diseases, proteins, tissues, species, and candidate drugs.
BenevolentAI, a company that describes itself as applying advanced AI to accelerating biopharma drug discovery, has been using these data sets to train and build models to find the genes associated with medical conditions, and then to link that to candidate compounds that can act on those conditions. The company was even able to identify possible candidates for treating Covid-19 symptoms.
CiteAb, a company that describes itself as a reagent search engine, built and trained models to extract reagent information from literature. They did this by first using expert humans to build a sample model of reagent and antibody information, and they then trained AI on this model. They’re then able to continue to enhance and refine the model, and it can quickly find and extract reagent and antibody information from the literature.
AI models applied to TDM on materials data have extend what’s possible with these data. At first, you could use TDM to find crystal structure data, and find materials properties from that, from comprehensive materials databases.
The next stage was to use a material’s composition to find both its structure and properties. Now, AI models can use predictive analytics to generate statistically driven materials design. That means, you can use chemical and physical data to design a material with the desired composition, structure, and properties, and then you can run virtual experiments on it, even before you get to synthesizing it and evaluating it in the real world.
So far, this approach has shown to be highly impactful in semiconductor design, which in turn, fuels integrated circuit (IC) and chip design. This has reduced the schedule overrun for IC design to less than 10%, and reduced project duration by up to 10%.
Even financial firms have become interested in using TDM for published research. Applying these models and TDM to a research corpus has allowed these companies to understand and analyse supply chains — especially for chemical manufacturing. And it has also helped understand how R&D companies’ research patterns can predict how those companies’ stocks might perform in the market.
These use cases are built on Springer Nature’s research corpus, and the API provides access to the data these models use. That means there are two parts to this: The quality of the data, and the access to it.
Springer Nature has built — and continues to build — this data meticulously. We attract authors for our journals and books with frontline support for authors, reviewers, and editors across all stages, rigorously validate those submissions with first-rate peer review, and transform those manuscripts into best-in-class published articles, books, and databases.
Of course, those data can’t power these models if they can’t easily get at it. So that’s why Springer Nature works to make our whole database FAIR, which means:
The next puzzle piece is the API that provides the link between Springer Nature’s data and the models and machines that use it.
The webinar — which runs for a little less than an hour including Q&A — walks you through these case studies, to show what could be possible for you and your institution. Watch the webinar here, and then find out how to learn more and take the next steps.
Don't miss the latest news & blogs, subscribe to The Link Alerts!