The ATLAS PanDA Workload Management System Evaluation For Genome Sequencing Data Analysis On NRC “Kurchatov Institute” Computing Facilities
Modern biology operates with a large amount of data and investigates very complex systems. Scientists use complicated algorithms and very sophisticated software, which is impossible to run, without access to significant computing resources. In addition, modern biology requires efficient large data volumes processing, such as DNA sequences, proteins structure, Genome Scale Modeling, and molecular dynamics simulation. Recent advances in Next Generation Genome Sequencing technology led to significant increase in amount of sequencing data that has to be processed, analysed and made available for bioinformaticians worldwide. The ancient DNA analysis is one of the most challenging and CPU consuming scientific problems.
It can take a couple of months on super-computer at NRC KI to run widely used package PALEOMIX to analyse ancient genome sequencing data. The issue of data processing at a large scale has been addressed in the past by the LHC experiments at CERN and in our studies we have evaluated and adapted Workload Management System (WMS) PanDA initially developed and deployed by the ATLAS experiment for the LHC data processing and analysis. To improve the PALEOMIX workflow performance we split input data into chunks to process them simultaneously and at the end transient results are merged. The above scenario allowed us to run many parallel tasks on the distributed computing resources (supercomputer and academic cloud) at the NRC “Kurchatov Institute”.
The dedicated PanDA instance has been installed, adapted and used at NRC KI to define and to broker data processing tasks to available computing resources. PanDA also managed tasks execution during the whole payload life cycle. It dramatically decreased the total walltime, in comparison with the traditional monolithic workflow execution. Task execution time was reduced from several weeks to several days, and it was demonstrated for Mammoths DNA samples analysis.
In this paper we will describe our recent accomplishments how bioinformatics application can be run by PanDA Workload Management System on supercomputers and clouds.