A Data Knowledge Base for Mega-Science Class Experiments Studies
One of the main challenges of modern data-intensive science is the rapid increasing of the data volumes from experimental infrastructure and meta-information related to data processing and analysis, such as software releases, conditions data, information about data validity, etc. This problem is particularly relevant for the scientific research, ongoing and planned on the basis of the largest research facilities, such as the LHC, XFEL, NICA, ITER, FAIR, and others. In addition to the scientific data, these research complexes accumulate vast volumes of supporting meta-information describing all stages of the experimental studies life cycle. However, there is a lack of connectivity between the metadata describing the data processing cycle and the metadata representing the life cycle of scientific research in general, including annotation, indexing and publication of the results. As a consequence, it becomes boring and time consuming to reproduce scienific results - the most important criteria of the scientific knowledge truth.
It is difficult to estimate the problem of the scientific results verification. But the sampling results in some narrow areas, for example in the biomedical applications, show that the majority of research results published even in the lead academic journals can’t be reproduced over time.
Laboratory of "Big Data" researchers have analysed the methods to store meta-information used by the ATLAS international collaboration at CERN. The task management system database stores meta-information about the computing tasks. Distributed Data Management database keeps the information of the data samples and controls the data transfer in a distributed computer environment. Internal notes, publications and conference proceedings are stored in the system of document circulation of CERN. Some meta-information is available in the form of Twiki pages and JIRA tickets. Listed storage systems are independent, information between them is loosely synchronized.
To verify the results in the ATLAS experiment, the researcher, guided by the data analysis, described in the publication, must identify data samples, based on which the research was conducted, as well as to reproduce the original state of hardware and software environment. Currently, this process is not automated and carried out manually due to the loose connectedness between data sources.
Proposed base of scientific knowledge will help to fill the existing gap in the holistic view of research lifecycle. In progress of the development the description of a scientific experiment in the ATLAS will be formalized, and the ontology will be developed. It will provide the connectivity between various sources of meta information. Processing, synchronization and aggregation of data obtained from different metadata sources, including documentaries, will be implemented in a distributed repository (Hadoop). In the developed prototype of the knowledge base, this repository currently contains representational sample of ATLAS experiment metadata.
The annotation of scientific publications will be implemented using machine learning technologies, and a system of extracting information from documentary sources will be developed. This will allow us to get a software system that integrates data for physical experiments from different sources into a single ontological repository, and get all the information on the automated experiment in a coherent view.