A worldwide e-Infrastructure for NMR and structural biology

# Automated grid submission and polling daemons

This page describes the setup of the grid submission and polling daemons implemented at the BCBR site beyond various web portals such as HADDOCK, CS-Rosetta and Gromacs.

Successfully running web portals requires a proper machinery to handle requests. This machinery involves various steps that can be categorized in three layers of operation: The server level involves handling of service requests, either by direct human interaction or through requests from another machine. This stage includes input type checking. The next level involves preparation, trafficking and monitoring of jobs between the server and the Grid. The third layer is the core layer involving the process(es) to be run on a worker node. The tasks associated with these different levels are conceptually unrelated and allow for a component based development approach, in which distinct tasks are programmed in a most generic form. This has the advantage that such building blocks can be easily maintained, adapted and reused.

This article describes the generic second level, namely the preparation, trafficking and monitoring of jobs between the server and the Grid. At this level, a daemon job is running periodically, scanning the job pool for jobs that are ready to be run on the Grid and submitting these when found. In principle, this daemon job does not require information regarding the nature of the job, although in practice different instances are run, each linked to one type of job to better control the work load associated with the different tasks.

Figure Grid job submission management using job pooling The figure shows a general scheme for managing job trafficking to and from the Grid, using server side job pooling. This scheme is characterized by a separation of three layers of operation, between which there is no direct communication. Green boxes indicate user interaction, whereas yellow boxes indicate jobs that are running periodically as daemon jobs and that use an eToken-based robot certificate for generating a Grid proxy. The blue ellipses represent ‘pools’, which are used for storage of job or result packages. User service requests are processed on the server, up to the point of generating a job package that is stored on disk. On the Grid UI (User Interface) a daemon job (grid-submission) is running on a scheduled base scanning the ‘job pool’ for job packages and submitting these to the Grid when found. Another daemon job (grid-polling) is periodically checking running jobs for their status, retrieving the results when ready and placing these in a result pool. Finally, results are presented back to the user, possibly after post-processing (results-processing). Currently the HADDOCK, CS-ROSETTA, UNIO, CYANA, MARS and MDD-NMR portals, which all send jobs to the Grid, are implemented following this model.

A separate daemon job is running, also periodically, checking the status of the jobs running on the Grid and retrieving the results when finished. Alternatively, this process can resubmit the job when it has failed. The results are put back, after validation, in a place where they can be accessed through a web page. Like the submission process, the polling and retrieval process is in principle independent, since all information regarding the job, such as the directory to place the results in, are contained in the job package.

Submission, polling and retrieval of output are handled using a standard toolbox for Grid operation, which, in the case of WeNMR, is the gLite  3.2 suite. Accordingly, the jobs that operate at this level require the use of a valid proxy. To facilitate proxy management, all of the processes at the second level of operation are running using an eToken-based robot certificate, in accordance with the security requirements for data portals formulated by the Joint Security Policy Group (https://www.jspg.org/wiki/VO_Portal_Policy).

As example, the submission and polling daemons used by the HADDOCK portal are provided. These assume that the following files be placed in the pool directory for submission:

• gridjob-XXXX.jdl : the jdl script for submission to the Grid
• gridjob-XXXX.sh: the shell script to be executed on the Grid (application specific)
• gridjob-XXXX.tar.gz: the gzipped tar archive containing the data for the job
• gridjob-XXXX.dir: a file containing the local directory and filename where the results of the job will be moved once successfully

Once a job is being processed and submitted to the grid the following additional files will appear in the pool directory:

• gridjob-XXXX.process: this file tells the submission deamon that this job has been processed
• gridjob-XXXX.status: this file is generated by the polling deamon and indicate the status of the job

The two daemons are running every 5 to 10 minutes as a cron job. You will find below the example files for the HADDOCK server. These contain some specific settings, such as the time after which a job is automatically resubmitted if it is not yet running. This is an application-specific time that should be adapted for each different application.

The scripts are also provided as attachement for download (change the extension from txt to csh).

### Grid submission daemon:

#!/bin/csh
#
set running=ps -ef | grep gridsubmit-daemon |grep -v grep | wc -l
#
# Limit the number of submission deamons running
#
if ($running <= 3) then set ncount = ls -1 ./ |grep gridjob |grep -v csrgrid |grep jdl | wc -l >&/dev/null setenv HOMED pwd # # IS THERE WORK TO BE DONE? # if ($ncount > 0) then
#
#   YES - PROCESS THEN THE JDL FILES IF NOT LOCKED ALREADY (.process)
#
foreach i(gridjob*.jdl)
set jobroot = $i:r if (-e$jobroot.process) continue
if (! -e $jobroot.tar.gz || ! -e$jobroot.sh || ! -e $jobroot.dir) then \rm$jobroot.*
continue
endif
touch $jobroot.process # # SUBMIT THE JOB (multiple WMS defined to avoid problems) # foreach WMS ( https://wms-enmr.science.uu.nl:7443/glite_wms_wmproxy_server \ https://graspol.nikhef.nl:7443/glite_wms_wmproxy_server \ https://prod-wms-01.pd.infn.it:7443/glite_wms_wmproxy_server \ https://wms-enmr.cerm.unifi.it:7443/glite_wms_wmproxy_server \ https://graszode.nikhef.nl:7443/glite_wms_wmproxy_server) setenv GLITE_WMS_WMPROXY_ENDPOINT$WMS
date
glite-wms-job-submit -o $jobroot.jobid -a$jobroot.jdl
if (! -e $jobroot.jobid) then set submitted=0 else set submitted=1 break endif end if ($submitted == 0) then
\rm  $jobroot.process >&/dev/null endif end endif set ncount = ls -1 ./ |grep process | wc -l if ($ncount > 0) then
foreach i (*.process)
if (! -e $i:r.jobid) \rm$i
end
endif
else
echo date "gridsubmit-daemon already running... exiting"
endif



### Grid polling daemon:

#!/bin/csh
#
set running=ps -ef | grep gridpoll |grep -v grep |grep -v rosetta | wc -l
if ($running <= 3) then set rundir = echo$0 | awk -v d=$0:h"/" '{i = index($0,"/") } {if (i == 0) $0 = "./"; else$0 = d } {print $0}' set ncount = ls -1$rundir |grep jobid |grep -v csr | wc -l
if ($ncount > 0) then # # CHECK THE STATUS OF THE RUNNING JOBS # foreach i(ls -1 -tr {$rundir}gridjob*.jobid)
set jobroot = $i:r set nline=wc -l$i |awk '{print $1}' if ($nline > 2) then
head -2 $i >tmpp \mv tmpp$i
endif
set ok = 0
set aborted = 0
set waiting = 0
set cleared = 0
glite-wms-job-status -i $i >$i:r.status
set ok = cat $i:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" &&$3 == "Done" {ok = 1} END{print ok}'
set aborted = cat $i:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" &&$3 == "Aborted" {aborted = 1} END{print aborted}'
set waiting = cat $i:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" &&$3 == "Waiting" {waiting = 1} END{print waiting}'
set cleared = cat $i:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" &&$3 == "Cleared" {cleared = 1} END{print cleared}'
#
# JOB FINISHED SUCCESSFULLY
#
if ($ok == 1) then glite-wms-job-output --nosubdir -i$i --dir pwd/$jobroot >&/dev/null # # CHECK IF RESULT FILE IS PRESENT # if (-e$jobroot/${jobroot}-result.tar.gz) then set errsize = grep -v grid-env$jobroot/${jobroot}.err |grep -v AAvocachedir | wc -l |awk '{print$1}'
set tarsize = du -ks $jobroot/${jobroot}-result.tar.gz | awk '{print $1}' # # CHECK IF THE RESULT FILE IS NOT EMPTY AND ERROR FILE IS NOT TOO LARGE # if ($errsize < 500 && $tarsize > 0) then echo '===========================' echo date "GRID job finished successfully" echo 'JOBID '$jobroot
grep 'Destination' $jobroot.status grep 'https'$jobroot.jobid
echo '==========================='
\rm $i \rm$jobroot.jdl
\rm $jobroot.sh set outname=cat$jobroot.dir
\mv $jobroot/${jobroot}-result.tar.gz cat $jobroot.dir chmod g+rw cat$jobroot.dir
\rm $jobroot.dir \rm -rf$jobroot
\rm $jobroot.tar.gz \rm$i:r.process $jobroot.process$jobroot.status>&/dev/null
#
#         PROBLEM DETECTED - MAKE JOB READY FOR RESUBMISSION
#
else
echo '==========================='
echo date "Error detected... resubmitting"
if ($tarsize == 0) then echo "Empty result file" endif grep HOSTNAME$jobroot/${jobroot}.out echo 'JOBID '$jobroot
grep 'Destination' $jobroot.status grep 'https'$jobroot.jobid
echo '==========================='
echo date "Error detected... resubmitting" >> /home/haddock/grid/gridpoll.err
grep HOSTNAME $jobroot/${jobroot}.out >>/home/haddock/grid/gridpoll.err
echo 'JOBID ' $jobroot >> /home/haddock/grid/gridpoll.err grep 'Destination'$jobroot.status >> /home/haddock/grid/gridpoll.err
grep 'https' $jobroot.jobid >> /home/haddock/grid/gridpoll.err cat$jobroot/${jobroot}.err >> /home/haddock/grid/gridpoll.err echo '===========================' >> /home/haddock/grid/gridpoll.err \rm -rf$jobroot
\rm $jobroot.jobid \rm$i:r.process $jobroot.process$jobroot.status>&/dev/null
endif
#
#         PROBLEM DETECTED - MISSING RESULT FILE - MAKE JOB READY FOR RESUBMISSION
#
else
echo '==========================='
echo date "Missing result file... resubmitting"
echo 'JOBID ' $jobroot grep 'Destination'$jobroot.status
grep 'https' $jobroot.jobid echo '===========================' echo '===========================' >> /home/haddock/grid/gridpoll.err echo date "Missing result file... resubmitting" >> /home/haddock/grid/gridpoll.err echo 'JOBID '$jobroot  >> /home/haddock/grid/gridpoll.err
grep 'Destination' $jobroot.status >> /home/haddock/grid/gridpoll.err grep 'https'$jobroot.jobid >> /home/haddock/grid/gridpoll.err
\rm -rf $jobroot \rm$jobroot.jobid
\rm $i:r.process$jobroot.process $jobroot.status>&/dev/null \rm$i:r.process $jobroot.process$jobroot.status>&/dev/null
endif
endif
#
#     JOB WAS ABORTED - MAKE JOB READY FOR RESUBMISSION
#
if ($aborted == 1) then echo '===========================' echo date "Aborted job... resubmitting" echo 'JOBID '$jobroot
grep 'Destination' $jobroot.status grep 'https'$jobroot.jobid
echo '==========================='
\rm $jobroot.jobid \rm$i:r.process $jobroot.process$jobroot.status>&/dev/null
endif
#
#     JOB WAS CLEARED - MAKE JOB READY FOR RESUBMISSION
#
if ($cleared == 1) then echo '===========================' echo date "Cleared job... resubmitting" echo 'JOBID '$jobroot
grep 'Destination' $jobroot.status grep 'https'$jobroot.jobid
echo '==========================='
\rm $jobroot.jobid \rm$i:r.process $jobroot.process$jobroot.status>&/dev/null
endif
end
#
# CHECK NOW FOR JOBS THAT HAVE BEEN SUBMITTED A LONG TIME AGO - FOR HADDOCK > 10 HOURS
#
foreach j (/usr/bin/find /home/haddock/grid/ -mmin +600 -name gridjob\*.process -prune)
glite-wms-job-status -i $j >$j:r.status
set ok = cat $j:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" && ($3 == "Done" || $3 == "Running") {ok = 1} END{print ok}' # # IF NOT FINISHED (OK) THEN CANCEL AND MAKE READY FOR RESUBMISSION # if ($ok != 1) then
glite-wms-job-cancel -i $j:r.jobid <<_Eod_ y _Eod_ echo '===========================' echo 'JOBID '$jobroot
echo date "Waited for more than 5 hours... resubmitting"
grep 'Destination' $jobroot.status grep 'https'$jobroot.jobid
echo '==========================='
\rm -rf $j:r.jobid \rm -rf$j $j:r.process$j:r.status >&/dev/null
endif
end
else
goto exit
endif
endif
exit:

AttachmentSize
gridsubmit-daemon.txt1.61 KB
gridpoll.txt6.53 KB

## Cite WeNMR/WestLife

Usage of the WeNMR/WestLife portals should be acknowledged in any publication:

"The FP7 WeNMR (project# 261572) and H2020 West-Life (project# 675858) European e-Infrastructure projects are acknowledged for the use of their web portals, which make use of the EGI infrastructure and DIRAC4EGI service with the dedicated support of CESNET-MetaCloud, INFN-PADOVA, NCG-INGRID-PT, RAL-LCG2, TW-NCHC, SURFsara and NIKHEF, and the additional support of the national GRID Initiatives of Belgium, France, Italy, Germany, the Netherlands, Poland, Portugal, Spain, UK, South Africa, Malaysia, Taiwan and the US Open Science Grid."

And the following article describing the WeNMR portals should be cited:
Wassenaar et al. (2012). WeNMR: Structural Biology on the Grid.J. Grid. Comp., 10:743-767.

## EGI-approved

The WeNMR Virtual Research Community has been the first to be officially recognized by the EGI.

## European Union

WeNMR is an e-Infrastructure project funded under the 7th framework of the EU. Contract no. 261572

WestLife, the follow up project of WeNMR is a Virtual Research Environment e-Infrastructure project funded under Horizon 2020. Contract no. 675858