White Paper

Managing Institutional Knowledge and Insight

Text and data mining (TDM) tools are enabling researchers and information managers  to enrich internal and external content and discover relationships among a variety of content and across disciplinary fields. Information and knowledge managers can play key roles in the development and implementation of TDM projects – acquiring and licensing the right tools and content, managing and linking knowledge models, and identifying data silos and specialized resource collections. Learn more about how information managers can contribute to TDM projects and what questions information managers need to ask before a TDM project is initiated in this white paper.

TDM, Information Management and Discovery

Text and data mining, broadly defined, is the automated process of selecting and analyzing large volumes of text or data in a way that can provide valuable information needed for studies and research projects. TDM enables researchers to conduct more comprehensive searches, identify patterns and trends, and learn how content relates to ideas and needs. Taxonomies—both those developed in-house and from authoritative third-party sources—can be applied to both internal and external content to facilitate the discovery of relevant internal documents, technical standards, reports, and published literature. Using TDM for enterprise knowledge management is a role often taken on by information managers and information service directors.


In addition to increasing discoverability of articles and other research material, TDM is used to define and predict relationships or patterns among concepts. This enables researchers to generate hypotheses and predictions such as which therapies are most likely to help with a specific disease or which materials are most likely to meet the required physical parameters.

Impact of TDM on Content Acquisition and Management

These two distinct uses of TDM—to improve discoverability of individual material and to discern patterns among content—have significant implications for both information managers and content producers. Publishers are shifting their business and pricing models from delivering information as published articles aggregated in journal issues read by subscribers to licensing machine-readable content that can be queried to discover new insights. This "atomization" of information requires a different value model, as the value of TDM applications cannot be calculated by the number of article downloads or other metrics tied to the full text content.


Acquiring content has changed as new TDM tools are developed and new use cases are found. A TDM license is an analysis and discovery tool rather than a subscription to content; in fact, it is often the case that not all content from a publisher is available for this type of license. One consequence of using TDM for insight is that there is often a loss of awareness of the sources being used for a TDM project; users see the results of data analysis, not the individual publications from which each data point was extracted. This effect will become even more pronounced as more apps are able to consume licensed content and the end user is even less able to discern what sources are being used or to identify any gaps or biases in the content being analyzed. That said, the proliferation of apps that can work with metadata to meet users needs in ways that a normal publisher cannot will change how people consume information.


As information managers and publishers sit down to negotiate a TDM license, both parties need to realize that they may see the value of a TDM license differently. A customer may view TDM as simply another way to access the information they already subscribe to; they may want to apply their own ontological framework rather than use the semantic enrichments provided by the publisher, and may balk at any TDM licensing fee above a nominal amount. Publishers, on the other hand, may view TDM as a transformative service that stands entirely separate from the underlying content, and each publisher that an information manager negotiates with may see the value of a TDM license differently.

The Role of (Human) Knowledge Models

Most organizations with robust TDM initiatives rely on internally developed ontologies that organize and structure knowledge in the subject domains most important to their research. In addition, consortia of subject experts, industry groups and associations, and publishers create ontologies to represent the relationships between concepts and entities in a particular field.


These knowledge models help address the need for researchers to conduct as comprehensive a scan as possible of the literature while still remaining focused on a particular field. In the past, researchers were able to conduct online or manual scans of the published literature with a reasonable assurance of comprehensive results. As sub-specializations within fields have proliferated and the number of STM publications continues to grow, researchers have greater concerns that they will miss a relevant study appearing in a niche publication. Using a knowledge model that reflects all the aspects of a domain of knowledge, a TDM algorithm can scan a much broader range of content and identify relevant content that appears in sources outside a researcher's area of focus. As scientists and researchers tackle complex and interconnected societal problems—COVID-19, climate change, clean energy—the need to find relevant insights in a wide range of domains is critical.


Information managers and KM professionals have found creative ways to use TDM to ensure the right coverage of a fragmented field of knowledge. One director of information services for a pharmaceutical company uses TDM to monitor a wide range of content, including peer-reviewed articles, company news, conference abstracts, drug pipeline data, and other resources, for an internal newsletter. The news is distributed through an internal social network and a conversation often develops among the researchers as to why a particular item was included. These debates are used to educate the TDM algorithm to better reflect the focus of the scientists; the director commented that this iterative process has greatly enhanced both the TDM system and knowledge sharing within the organization.

TDM and the Coronavirus Knowledge Graph

In the race to develop a vaccine for COVID-19, researchers are using TDM and other AI tools to analyze the relevant data. One initiative, the COVID GRAPH project, is using open datasets of over 44,000 articles as well as patents, gene data, clinical studies, molecular data and other resources to develop a COVID-19 knowledge graph. Scientists and researchers are able to run TDM algorithms to find new relationships and identify possible therapeutics.


Another initiative, the UK-based AI firm BenevolentAI, has built a drug discovery and development platform that uses machine learning to extract both structured and unstructured biomedical data from published material and build a knowledge graph of relationships between diseases, genes and drugs. Researchers using its AI algorithm have identified six compounds that appear to reduce the ability of the virus to replicate. In February 2020, BenevolentAI reported that their platform had identified baricitinib, a drug approved for the treatment of rheumatoid arthritis, as a potential treatment for COVID-19, and BenevolentAI continues to leverage its knowledge graph and work with pharmaceutical companies to identify targets for a number of diseases.

Information Managers' Role in TDM Initiatives

While information managers are often not the instigators of TDM initiatives, their role can be pivotal in the ultimate success of the project. In fact, the value of any TDM project depends on knowing what questions are being addressed, what sources should—and can—be included, and what ontologies and taxonomies to use. Information managers can raise the functionality of a TDM project by bringing in the right tools to create linkages among pieces of information. As one information manager noted, knowledge management without the structure and tools of TDM is simply more data – "The magic happens once the content is brought in-house and we figure out how to make it useful; TDM brings intelligence to the data."


One of the assets that information managers bring to a TDM project is their connection to the various user groups within their enterprise. Since many organizations develop their APIs and other finding tools internally rather than relying on outside vendors, information managers can be key to understanding how users are likely to be querying the resources, what kinds of challenges they are encountering, and what kind of customization and annotation is needed for each project team.  Information managers also serve to bring together all the groups and stakeholders who could benefit from the use of a licensed dataset or the development of an API to query a content collection, coordinating licensing and facilitating collaboration among user groups.


This issue of resource coordination is particularly important in companies that have experienced mergers or other disruptive events; there is often more of a reluctance to share internal information and resources when other user groups are unfamiliar or unknown. One drug information manager noted that one consequence of acquisitions within their industry was the growth of data silos and the resulting lack of cross-platform searchability. His response, by necessity, was to move forward with what he had access to and, as he said, "we just had to leave the data silos in the dust and let them shut down once we were able to offer resources with semantic enrichment and improved findability." In fact, he noted that his biggest obstacle to institutionalizing an enterprise-wide knowledge management program is raising awareness within the organization of how the information center can support their research with TDM tools and resources.


Among the roles for information managers in TDM projects are:

  • Raising awareness of TDM as a research tool, particularly following an acquisition or other influx of new employees. Information centers can develop AI labs or sandbox areas, in which researchers can try out TDM methods using a combination of open access content, licensed metadata, and available TDM services. Learning events and user meet-ups can be hosted by the information center. The University of Rhode Island's AI Lab (https://web.uri.edu/ai/) is an example of how a library can support exploration of new technology within the larger organization.
  • Including TDM in content licensing discussions. Every publisher takes a slightly different approach to licensing their content for TDM, and these conversations can be lengthy and technical. Both the information managers and the vendors are breaking new ground; the more that information managers bring this to the negotiation table, the more familiar all parties will be with the issues and concerns involved in TDM licensing.
  • Identifying data silos and specialized resource collections. As noted above, this is particularly an issue in enterprises that experience mergers. In addition, information managers can work to identify specialized ontologies that internal groups have developed, and use them to enhance other internal and licensed content for more relevant retrieval. APIs created or licensed by one team may be of value to other groups within the organization as well.Ensuring that the right information sources are included in a TDM project. Information managers have a unique perspective on available resources, both internally and externally. Whether that is identifying an authoritative public ontology, licensing access to technical standards or patents, or incorporating customer data, other internal documents, clinical trial data, and scientific and technical publications into a KM project, information managers have a unique understanding of the best resources for each use case.

Of course, these roles require that all information center staff develop a familiarity with TDM issues and that staff members build expertise in specific areas of licensing, access, storage and preservation, training, tools and methods, and outreach. These may require significant investment in staff time and focus, which needs to be factored into any TDM initiative. Of course, this commitment to building and maintaining TDM skill sets within the information center can also be mentioned as a feature when conducting outreach and describing what the information center brings to a project. Strategic support for TDM projects both expands the impact of the information center within the enterprise and leverages the unique expertise that information managers bring.