Literature Mining

The information retrieval system, SCAIView, allows for semantic searches in large text collections by combining free text searches with the ontological representations of entities derived by text mining systems. We have developed a dedicated system – the COVID-19 SCAIView– for research regarding the novel coronavirus, SARS-CoV-2, and the resulting COVID-19 disease. This system is tuned to find answers to questions such as “Which genes/proteins are related to COVID-19?”, “Which drugs are relevant in context to COVID-19?” or “what is the biology behind the novel coronavirus SARS-CoV-2?”.

COVID-19 SCAIView´s key features are:

  • A user-friendly search environment with a query builder supporting semantic queries with biomedical entities
  • Fast and accurate search and retrieval, based on the newest technologies of semantic search engines
  • Color-coded visualization and ranking of the most relevant entities and documents
  • Export of the search results in various file formats

Currently, COVID-19 SCAIView indexes corpora from PubMed, PubMed Central and the CORD-19 datasets that also include bioRxiv and medRxiv articles. These corpora are regularly updated in our system to allow researchers to browse the latest research articles. Documents are retrieved by precisely formulated questions using ontological representations of biomedical entities. The entities include genes/proteins, phenotypes, drug compounds, and more. COVID-19 SCAIView supports the selection of suitable entities by an autocompletion functionality and a knowledge base for each entity. This includes a textual description of the entity, alternative names, an entity identifier, and links to relevant biomedical databases.

In the near future, we are focusing on introducing chemistry and triples/relations containing drug-target and drug-side effects information. Furthermore, we will also develop text mining modules for the extraction of information that links clinical treatments to outcomes.