Deployment of job priority mechanisms in the Italian Cloud of the ATLAS experiment

Size: px

Start display at page:

Download "Deployment of job priority mechanisms in the Italian Cloud of the ATLAS experiment"

Marvin Fleming
6 years ago
Views:

1 Journal of Physics: Conference Series Deployment of job priority mechanisms in the Italian Cloud of the ATLAS experiment To cite this article: Alessandra Doria et al 2010 J. Phys.: Conf. Ser View the article online for updates and enhancements. Related content - Enabling data analysis à la PROOF on the Italian ATLAS Tier-2s using PoD Roberto Di Nardo, Gerardo Ganis, Elisabetta Vilucchi et al. - Status and Developments of the CREAM Computing Element Service P Andreetto, S Bertocco, F Capannini et al. - Using CREAM and CEMonitor for job submission and management in the glite middleware C Aiftimiei, P Andreetto, S Bertocco et al. Recent citations - ATLAS computing activities and developments in the Italian Grid cloud L Rinaldi et al This content was downloaded from IP address on 07/03/2018 at 17:32

2 Deployment of Job Priority mechanisms in the Italian Cloud of the ATLAS experiment Alessandra Doria 1, Alex Barchiesi 2, Simone Campana 3, Gianpaolo Carlino 1, Claudia Ciocca 4, Alessandro De Salvo 1, Alessandro Italiano 4, Elisa Musto 1, Laura Perini 5, Massimo Pistolese 5, Lorenzo Rinaldi 4, Davide Salomoni 4, Luca Vaccarossa 5, Elisabetta Vilucchi 6 1 INFN Napoli, Complesso Universitario di Monte Sant'Angelo, Napoli, Italy 2 INFN Roma I, Piazzale A. Moro 2, Roma, Italy, 3 CERN, Geneva 23, Switzerland 4 INFN CNAF, Viale Berti Pichat 6/2, Bologna, Italy 5 INFN Milano, via Celoria 16, Milano, Italy 6 INFN Lab. Naz. di Frascati, via Enrico Fermi 40, Frascati, Italy Corresponding author: Alessandra.Doria@na.infn.it Abstract. An optimized use of the Grid computing resources in the ATLAS experiment requires the enforcement of a mechanism of job priorities and of resource sharing among the different activities inside the ATLAS VO. This mechanism has been implemented through the VOViews publication in the information system and the fair share implementation per UNIX group in the batch system. The VOView concept consists of publishing resource information, such as running and waiting jobs, as a function of VO groups and roles. The ATLAS Italian Cloud is composed of the CNAF Tier1 and Roma Tier2, with farms based on the LSF batch system, and the Tier2s of Frascati, Milano and Napoli based on PBS/Torque. In this paper we describe how test and deployment of the job priorities has been performed in the cloud, where the VOMS-based regional group /atlas/it has been created. We show that the VOViews are published and correctly managed by the WMS and that the resources allocated to generic VO users, users with production role and users of the /atlas/it group correspond to the defined share. 1. Introduction In large experiments, like ATLAS at LHC consisting of thousands of physicists who carry out many different activities, it is important to establish systems to organize the use of computing resources in order to assign the appropriate share to each of them. This paper describes the fairshare and VOViews mechanisms, deployed and tested in the Italian Computing Federation of the ATLAS [1] experiment, that allow to grant different priorities and predefined shares of computing resources to the various activities in the Grid. This mechanism can be applied throughout ATLAS and can be exported to any WLCG Grid infrastructure where a contention is present in resource use between activities of different types. The ATLAS computing model and the tools used for the submission of ATLAS jobs are briefly introduced to describe the context of the work. The VOViews concept, present in the glite middleware, is illustrated and the tests made in order to verify that the middleware is handling VOVIEWS properly are described. c 2010 IOP Publishing Ltd 1

3 Several tests were also performed to check that the Local Resource Management Systems (LRMS), with the proper configuration, are able to tune the priority of incoming jobs to obtain a fair resource sharing among users belonging to different VOs or to the same VO with different roles and groups (corresponding to different activities). The results of the tests are plotted, showing that the system behaves as required. 2. The ATLAS computing model ATLAS (A Toroidal LHC ApparatuS) is one of the four particle physics experiment at LHC, it is composed of more than 2000 physicists from hundreds of Research Institutes and Universities all over the world, and its layout has been set-up to exploit at best the LHC capabilities. Its technical characteristics will allow efficient reconstruction of charged particle tracks, photons and electrons identification and high precision measurements of muon momentum even by using just the external muon spectrometer. The impossibility of concentrating the needed computing power and storage capacity in a single place required the development of a world-wide distributed computing system, which allows efficient data access and makes use of all available computing resources. Thus ATLAS Collaboration adopted a computing model [2] based on the WLCG Grid paradigm [3] which provides for a high degree of decentralization and a sharing of computing resources. Such a model requires that data are distributed to various computing sites and that user jobs are sent to the data. A complex set of tools and distributed services, enabling the automatic distribution of data and their distributed processing, has been developed by ATLAS in cooperation with the middleware providers of the three large Grid infrastructures: EGEE [4] in most of Europe and the rest of the world, NorduGrid [5] in Scandinavian countries and OSG [6] in the United States The Computing Model designed by ATLAS is a hierarchical structure organized in different layers called Tiers. The Tier0 facility is based at CERN and is responsible for first pass processing and archiving of the primary RAW data and for their distribution to the Tier1s. The 10 Tier1 facilities word-wide distributed have to store and guarantee a long-term access to Raw and derived data and provide all the reprocessing activities. Each Tier1 heads up a set of Tier2 sites grouped on regional basis in federations called Clouds. Tier2s are the centres designed for the user analysis and they provide all the Monte Carlo simulation capability. In particular, the Tier1 in Italy is located at INFN- CNAF in Bologna, while Tier2 facilities are four: Frascati, Milano, Napoli and Roma hosted by the local Universities and INFN departments and laboratories. 3. Job submission frameworks The ATLAS jobs on the Grid mainly belong to two categories, user analysis jobs and production jobs. A very small fraction of jobs is of different kind, like monitoring and software management jobs. In order to understand how computing resources are shared among the main different activities, let s see the main features of the ATLAS submission frameworks for production and user analysis jobs (Fig.1) PanDA - Production ANd Distributed Analysis system The PanDA system [7, 8] has been developed to meet ATLAS requirements for a data-driven workload management system for production and distributed analysis processing. In EGEE, and in particular in the Italian Cloud, PanDA is mainly used as a centralized production system. The Pathena tool allows to distribute and to run analysis jobs (within the Athena experiment analysis framework) using the PanDA system as backend; Pathena is mainly used in OSG. 2

4 PanDA production system uses pilot jobs for acquisition of processing resources. Pilot submission is made directly to Tier1/Tier2 sites and workload management is performed by PanDA itself without WMS matchmaking. Workload jobs are assigned to successfully activated and validated pilots based on PanDA-managed brokerage criteria. Prioritization among production jobs is also managed by the PanDA server: jobs with the highest priority are assigned to the first available pilots. Pilot jobs always run with the atlas production role Using Ganga for distributed user analysis The Ganga job management system [9, 10, 11] provides a simple way of preparing, organizing and executing user analysis tasks within the experiment analysis framework Athena. Data of interest are stored in files grouped together as datasets in the ATLAS Distributed Data Management (DQ2) and distributed among the sites following the Computing Model rules. DQ2 is queried by Ganga during job submission in order to find the sites where they are located and send there the analysis jobs. When data are available in more then one site, job brokerage is committed to the backend services. Ganga can run both locally or on a variety of batch systems and Grids and in particular relies on glite WMS to submit to EGEE sites. glite WMS Figure 1. User Distributed Analysis Frameworks Ganga user jobs are submitted with the user VOMS credentials, allowing WMS and site LRMS to enforce ranking and priority policies as described in the following of this work. 4. Job priority mechanisms To achieve a proper handling of job priorities, both in a multi-vo environment and within different activities of a single VO, two requirements have to be fulfilled: 1. Site Computing Elements must be configured to map different VOMS FQANs (Fully Qualified Attribute Name, consisting of user VO, group and role) to separate groups of users, associated to different resource shares or priorities in the underlying Local Resource Management System. 2. For each group, the corresponding information should be published to the Information System, to let the Workload Management System estimate the rank of a CE based on the information available for the share the user will be mapped to. This is accomplished by means of the VOViews The implementations that fulfil these points are illustrated in the next sections. The basis for this work is in the studies of the EGEE Job Priorities Working Group that analyzed experiment requirements and proposed solutions and plans [12, 13]. 3

5 5. VOViews concept When jobs are submitted through the glite Workload Management System, as in the case of Ganga, resource selection is based on the attributes published in the information system. Before the introduction of the VOViews, the granularity of the published information was per queue. Sites usually define a separate queue for each VO, so they are able to publish the number of running, waiting jobs, free slots, etc. as a function of the VO. To separate the information of different VOs without the need to define separate queues, the VOView concept (introduced in GLUE Schema 1.2) has been used, allowing the publication of the information as a function of the VO. This is not enough: to distinguish VO activities that are performed under different groups or roles, we need to publish separate information also according to the user primary FQAN. VOViews can be used to represent share-specific information, with each VOView publishing an Access Control Base Rule containing the FQANs that are mapped to that share by LCMAPS. These VOViews must be mutually exclusive: the glite WMS must not match more than one for the same CE, otherwise there would be an unsolvable ambiguity. This is achieved via appropriate use of DENY tags in the Access Control Base Rule attribute [13] VOViews in the Information System The information published for the GlueCE object (GlueCEUniqueID corresponding to a batch system queue) contains the number of running and waiting jobs in the queue and the GlueCEAccessControlBaseRule indicates the VOs enabled to the queue. This information is not useful to distinguish intra-vo activities. The information specific to ATLAS production role is published in the object dn: GlueVOViewLocalID=/atlas/production,GlueCEUniqueID=atlasce01.na.infn.it:2119/jobmanager-lcgpbs-atlas, Mds-Vo-name=INFN-NAPOLI-ATLAS,o=grid and its ACBR is GlueCEAccessControlBaseRule: VOMS:/atlas/Role=production This VOView will be the only one to be matched if the primary FQAN in the user proxy is /atlas/role=production and similar information is published for any other defined ATLAS subgroup. All jobs which do not have a primary FQAN matching one of the defined subgroups will run using a third sub-share for ATLAS, described by the following VOView: dn: GlueVOViewLocalID=atlas, GlueCEUniqueID=atlasce01.na.infn.it:2119/jobmanager-lcgpbs-atlas, Mds-Vo-name=INFN-NAPOLI-ATLAS,o=grid GlueCEAccessControlBaseRule: VO:atlas GlueCEAccessControlBaseRule: DENY:/atlas/Role=production GlueCEAccessControlBaseRule: DENY:/atlas/it The ACBR for this view contains three items: VO:atlas: it means that all VOMS proxies having ATLAS as VO are allowed DENY:/atlas/Role=production: it means that VOMS proxies with /atlas/role=production are forbidden; 4

6 DENY:/atlas/it: it means that VOMS proxies with /atlas/it group are forbidden. where the DENY take precedence. This guarantees that all the VOViews are exclusive by construction, that is not more than one will be matched by the glite WMS Running our tests, we did verify at run time that the Information System publishes the correct values of WaitingJobs and RunningJobs for each GlueVOViewLocalID, in particular for production role, Italian users group and generic atlas proxies Yaim configuration The configurations to enable the CE to publish VOViews to the Information System were made using Yaim. Once installed a recent version (any glite > will work) of the Information Provider, in the configuration file we just had to enable the subgroups for the VO and set the variable FQANVOVOVIEWS=yes. See references [14] and [15] for the configuration procedures Use of VOViews by the WMS In this work we only refer to glite WMS since the LCG Resource Broker does not understand VOViews. glite WMS is able to perform the actual matchmaking with the only CE resource accessible to the user proxy s primary FQAN and to choose the best CE according to a rank calculated on the basis of the corresponding information. While jobs of different subgroups were running on our site CE, we created proxies with different FQANs and we made the WMS ranking with each proxy. We could verify that ranking resulted different according to the used proxy. 6. Job Priorities and Fairshare The experiment requirement is to assign to each activity a determined share of resources, defined by the policies of the experiment. These shares are not supposed to be respected at every moment, but calculated on the integrated usage of resources over time. At the level of local batch system scheduler, we can assign absolute priorities to users, groups and queues, affecting the execution order of queued jobs regardless of the relative resource usage over time. More useful to satisfy our requirements is the Fairshare configuration, which allows historical resource utilization information to be incorporated into job feasibility and priority decisions. This feature allows site administrators to set system utilization targets for any credentials (i.e., user, group, class, etc) which they wish to have affected by this information. Administrators can also specify the timeframe over which resource utilization is evaluated in determining whether or not the goal is being reached. In the Italian Cloud, lcg-ce is used with two different batch systems: PBS/Torque+Maui at LNF, Milano and Napoli Tier2s. LSF at Roma1 Tier2 and CNAF Tier1. 5

7 Both systems were configured to deploy a group based fairshare policy between the following groups, each corresponding to a different VOView: atlas, without any role or group, representing a generic user /atlas/role=production, for central production activities /atlas/it, for Italian analysis group Any other supported VO To test the scheduler behavior, we filled the system queues with jobs of the different groups and we recorded the trend in time of the resource occupancy, in order to verify that the system shares the computing time as expected Deployment and tests with PBS/Torque + MAUI The MAUI scheduler (version 3.2.6p20 used) allows a large number of parameters to contribute to the calculation of job priorities, with different weights. To focus our tests on fairshare only, we disabled the contribution of any other priority component. In real system operation some other components are taken into account with lower weight, as, for example, the amount of time a job stays queued. Fairshare based priority is regulated by the following parameters: FSTARGET - percentage of resources to be used by a credential FSINTERVAL - duration of each fairshare window FSDEPTH - number of fairshare windows factored into current fairshare utilization FSDECAY - decay factor applied to weighting the contribution of each fairshare window ( the decay has been set to 1 for all tests, to simplify result interpretation) FSPOLICY - metric to use when tracking fairshare usage. Different policies are possible, e.g. usage of CPUtime or Wall Clock time; we set FSPOLICY to job Wall Clock time for all tests. MAUI recalculates the job priorities at the end of each window. The fairshare contribution increases the priority of a job if the resource utilization by the job credential is lower then the FSTARGET, increases the priority otherwise. Tests with PBS/Torque + MAUI have been performed at Napoli and Milano Tier2s. The first one (Fig. 2) was made with two credentials, assigning a target of 70% to atlas group and 30% to production subgroup. The relative usage of resources is shown in Fig.2: it is represented by the average wall clock time calculated from the test beginning. We observe that the trend of the first 10 windows is opposite to the expected behavior, depending on the past resource usage when only production group was running. The asymptotic trend corresponds to the expected fairshare targets. Group Share (%) Average Mean WallClockTime atlas prod time windows (2h) (4h) Figure 2. MAUI test: relative share for 2 subgroups 6

8 The second test (Fig. 3) with MAUI involves three subgroups, with targets of 40% for atlas credential, 30% for production role and 30% for /atlas/it credential. In this test we used shorter jobs and shorter window size, thus reaching the expected shares faster then the previous test. The oscillations in the plots depend on the fact that the scheduler recalculates all job priorities at the end of each window. When a subgroup is under its target, the priorities of all its jobs are increased and jobs run as soon as a slot becomes free. This goes on until the subgroup exceeds it target; at the next window edge the subgroup priority starts to be decreased. Average Mean WallClockTime Group Share (%) atlas prod atlasit Time Windows (1h) Figure 3. MAUI test: relative share for 3 subgroups 6.2. Deployment and tests with LSF Test with the batch system Platform LSF (Version 7.0.2) were performed at CNAF, the Italian Tier1. For this specific test activity we set up a dedicated cluster with just 100 slots available so accounted resources were not affected by production activity. In this case accounted resources are an important aspect because in order to differentiate user job priorities we chose "Hierarchical Fairshare as scheduling algorithm that use the user accounted resources to calculate user job priorities. Of course the FairShare implementation in LSF uses other parameters such as the "group share" and "History time window, which is the time period used by the scheduler to compute the total resources used by every user in the time window. In this specific test "History time window" were set at 5 hours, a short value. In general we can say that constant running activity on the cluster takes advantage from a short time window because all the resources used before the time window are not take into account so in this way we simulated a production environment. Group shares were configured like those configured in the production cluster. For a multi-vo Tier1 the resource contention between different VO is a common situation, so two different tests were performed: intra-vo FairShare for a single VO and multiple intra-vo FairShare. 7

6.2.1. Intra-VO FairShare for a single VO Only two user subgroups, belonging to the same VO, submitted jobs so they could use all the available resources.

9 Intra-VO FairShare for a single VO Only two user subgroups, belonging to the same VO, submitted jobs so they could use all the available resources. Shares were set to 30% for atlas user and 70% for production role. The plot in Fig. 4 shows the wall clock time used every hour by each subgroup. The trend in the first part of the plot is not linear because the jobs have a different duration (from a couple of minutes to a few hours), while in the second part the trend becomes more linear because all jobs have the same length (ten minutes). However, in case of contention, the way the batch system splits resources among subgroups is clear: 30% is assigned to the atlas credential and 70% to atlas production role, as expected. Figure 4. LSF test: Wall ClockTime for 2 subgroups Multiple intra-vo FairShare. Two VOs have been involved in this test, ATLAS and LHCB; each with two users subgroups submitting jobs. In this way we have verified how the batch system splits the available resources among VOs and the four user subgroups. The plot in Fig. 5 shows the absolute percentage of wall clock time used every hour by each user s subgroup. In case of contention, the batch system grants the same amount of wall clock time to each VO because the share configured is 25% for each. At the same time the batch system splits the VO s available resources among its user subgroups according to the configured share. When there isn t contention, whoever is running is free to use all the available resources. 8

10 Figure 5. LSF test: relative share for 2 VOs, 2 subgroups each 7. Conclusions In the WLCG infrastructures the contention of resources between different VOs and, for each one, between experiment central activities and user analysis requires mechanisms to determine resource sharing according to site, cloud and/or VO level policies. The proposed system can be applied to all grid infrastructures adopting the glite WMS and has been tested for LSF and Torque/Maui, two of Local Resource Management Systems most widely used with glite. The Job Priorities and the Fairshare mechanisms have been deployed and tested in the Tier1 and Tier2s of the ATLAS Italian Cloud, providing feedback to the collaboration in order to tune an optimal configuration for ATLAS sites. Information about the VO activities performed under different groups or roles is published with the VOViews mechanism, but at the time of our tests this mechanism was not widely used. The correct handling of VOViews by the Information System and by the glite WMS has been successfully tested and validated in Italian Cloud for the use in the ATLAS collaboration. References [1] G. Aad et al. [ATLAS Collaboration], The ATLAS Experiment at the CERN Large Hadron Collider, JINST 3 (2008) S [2] Atlas Computing Group, ATLAS Computing Technical Design Report, CERN-LHCC , [3] LHC Computing Grid Technical Design Report, CERN-LHCC ,

11 [4] E. Laure and B. Jones, Enabling Grids for e-science: The EGEE Project, EGEE-PUB , [5] P.Eerola et al., The NorduGrid architecture and tools, Proceedings of CHEP2004, eds A.Aimar, J. Harvey and N.Knoors, CERN , Vol.2, 2005 [6] R.Pordes, The Open Science Grid (OSG), Proceedings of CHEP2004, eds A.Aimar, J. Harvey and N.Knoors, CERN , Vol.2, 2005, [7] T. Maeno, PanDA: distributed production and distributed analysis system for ATLAS, 2008 J. Phys.: Conf. Ser [8] Panda TWiki Webpage: [9] Ganga Project Webpage: & ATLAS TWiki Webpage: [10] F. Brochu et al., Ganga: a tool for computational-task management and easy access to Grid resources, submitted to CPC, [11] J. Elmsheuser et al. Distributed Analysis in ATLAS using GANGA & D. Van Der Ster et al. Functional and Large-Scale Testing of the ATLAS Distributed Analysis Facilities with Ganga, these proceedings. [12] Job Priorities WG [13] S.Campana and A.Sciabbà, Job Priorities Implementation Plan, on behalf of ATLAS and CMS, Private Communication, July 2008 [14] [15] 10

Job schedul in Grid batch farms

Journal of Physics: Conference Series OPEN ACCESS Job schedul in Grid batch farms To cite this article: Andreas Gellrich 2014 J. Phys.: Conf. Ser. 513 032038 Recent citations - Integration of Grid and