Регистрация / Вход
Прислать материал

Machine Learning Technologies to Predict the ATLAS Production System Behaviour

Name
Maksim
Surname
Gubin
Scientific organization
Tomsk Polytechnic University
Academic degree
engineer
Position
junior research assistant
Scientific discipline
Information technologies
Topic
Machine Learning Technologies to Predict the ATLAS Production System Behaviour
Abstract
The ATLAS Production System (ProdSys2) is an automated scheduling system that is responsible for data processing, data analysis and Monte-Carlo production on the Grid, supercomputers and clouds.
We proposed use of ML approach in conjunction with ProdSys2 jobs execution information to predict behavior of the system, starting with estimating task completion times. The WLCG ML R&D project was started in 2016, we will present our first results how ProdSys2 behavior could be predicted and simulated.
Keywords
ProdSys, ATLAS, GRID, distributed computing, machine learning, anomaly detection
Summary

The second generation of the ATLAS Production System (ProdSys2)  is an automated scheduling system that is responsible for data processing, data analysis and Monte-Carlo production on the Grid, supercomputers and clouds. The ProdSys2 project was started in 2014 and commissioned in  2015 (just before the LHC Run2)  and now it handles O(2M) tasks per year, O(2M) jobs per day running on more than 250000 cores, each task transforms in many jobs. ProdSys2 evolves to accommodate a growing number of users and new requirements from the ATLAS Collaboration, Physics groups and individual users. ATLAS Distributed Computing in its current state is a big and heterogenous facilities, running on the WLCG, academic and commercial clouds and supercomputers. This cyber-infrastructure presents computing conditions in which contention for resources among high-priority data analyses happens routinely. Inevitably, over-utilized computing resources cause degradation of services or significant workload and data handling interruptions. For these and other reasons, grid data management and processing must inevitably tolerate a continuous stream of failures, errors, and faults.  This makes simulating ProdSys2 behavior a very challenging task requiring unfeasibly large computing power. However, behavior of the system seems to contain regularities that can be modeled using Machine Learning (ML) algorithms. We proposed use of ML  approach in conjunction with ProdSys2 jobs execution information to predict behavior of the system, starting with estimating task completion times. The WLCG ML R&D project was started in 2016, we will present our first results how ProdSys2 behavior could be predicted and simulated. On the next phase we will use ML algorithms to predict and to find anomalies in the ProdSys2 behaviour.