A worldwide e-Infrastructure for NMR and structural biology

Automated grid submission and polling daemons

This page describes the setup of the grid submission and polling daemons implemented at the BCBR site beyond various web portals such as HADDOCK, CS-Rosetta and Gromacs.

Successfully running web portals requires a proper machinery to handle requests. This machinery involves various steps that can be categorized in three layers of operation: The server level involves handling of service requests, either by direct human interaction or through requests from another machine. This stage includes input type checking. The next level involves preparation, trafficking and monitoring of jobs between the server and the Grid. The third layer is the core layer involving the process(es) to be run on a worker node. The tasks associated with these different levels are conceptually unrelated and allow for a component based development approach, in which distinct tasks are programmed in a most generic form. This has the advantage that such building blocks can be easily maintained, adapted and reused.

This article describes the generic second level, namely the preparation, trafficking and monitoring of jobs between the server and the Grid. At this level, a daemon job is running periodically, scanning the job pool for jobs that are ready to be run on the Grid and submitting these when found. In principle, this daemon job does not require information regarding the nature of the job, although in practice different instances are run, each linked to one type of job to better control the work load associated with the different tasks.

Figure Grid job submission management using job pooling The figure shows a general scheme for managing job trafficking to and from the Grid, using server side job pooling. This scheme is characterized by a separation of three layers of operation, between which there is no direct communication. Green boxes indicate user interaction, whereas yellow boxes indicate jobs that are running periodically as daemon jobs and that use an eToken-based robot certificate for generating a Grid proxy. The blue ellipses represent ‘pools’, which are used for storage of job or result packages. User service requests are processed on the server, up to the point of generating a job package that is stored on disk. On the Grid UI (User Interface) a daemon job (grid-submission) is running on a scheduled base scanning the ‘job pool’ for job packages and submitting these to the Grid when found. Another daemon job (grid-polling) is periodically checking running jobs for their status, retrieving the results when ready and placing these in a result pool. Finally, results are presented back to the user, possibly after post-processing (results-processing). Currently the HADDOCK, CS-ROSETTA, UNIO, CYANA, MARS and MDD-NMR portals, which all send jobs to the Grid, are implemented following this model.

 

 

A separate daemon job is running, also periodically, checking the status of the jobs running on the Grid and retrieving the results when finished. Alternatively, this process can resubmit the job when it has failed. The results are put back, after validation, in a place where they can be accessed through a web page. Like the submission process, the polling and retrieval process is in principle independent, since all information regarding the job, such as the directory to place the results in, are contained in the job package.

Submission, polling and retrieval of output are handled using a standard toolbox for Grid operation, which, in the case of WeNMR, is the gLite  3.2 suite. Accordingly, the jobs that operate at this level require the use of a valid proxy. To facilitate proxy management, all of the processes at the second level of operation are running using an eToken-based robot certificate, in accordance with the security requirements for data portals formulated by the Joint Security Policy Group (https://www.jspg.org/wiki/VO_Portal_Policy).

 

As example, the submission and polling daemons used by the HADDOCK portal are provided. These assume that the following files be placed in the pool directory for submission:

  • gridjob-XXXX.jdl : the jdl script for submission to the Grid
  • gridjob-XXXX.sh: the shell script to be executed on the Grid (application specific)
  • gridjob-XXXX.tar.gz: the gzipped tar archive containing the data for the job
  • gridjob-XXXX.dir: a file containing the local directory and filename where the results of the job will be moved once successfully

Once a job is being processed and submitted to the grid the following additional files will appear in the pool directory:

  • gridjob-XXXX.process: this file tells the submission deamon that this job has been processed
  • gridjob-XXXX.status: this file is generated by the polling deamon and indicate the status of the job

The two daemons are running every 5 to 10 minutes as a cron job. You will find below the example files for the HADDOCK server. These contain some specific settings, such as the time after which a job is automatically resubmitted if it is not yet running. This is an application-specific time that should be adapted for each different application.

The scripts are also provided as attachement for download (change the extension from txt to csh).

 


Grid submission daemon:

#!/bin/csh
#
set running=`ps -ef | grep gridsubmit-daemon |grep -v grep | wc -l`
#
# Limit the number of submission deamons running
#
if ($running  <= 3) then
  set ncount = `ls -1 ./ |grep gridjob |grep -v csrgrid |grep jdl | wc -l` >&/dev/null
  setenv HOMED `pwd`
#
# IS THERE WORK TO BE DONE?
#
  if ($ncount > 0) then
#
#   YES - PROCESS THEN THE JDL FILES IF NOT LOCKED ALREADY (.process)
#
    foreach i(gridjob*.jdl)
      set jobroot = $i:r
      if (-e $jobroot.process) continue
      if (! -e $jobroot.tar.gz || ! -e $jobroot.sh || ! -e $jobroot.dir) then
        \rm $jobroot.*
        continue
      endif
      touch $jobroot.process
#
#     SUBMIT THE JOB (multiple WMS defined to avoid problems)
#
      foreach WMS ( https://wms-enmr.science.uu.nl:7443/glite_wms_wmproxy_server \
                    https://graspol.nikhef.nl:7443/glite_wms_wmproxy_server \
                    https://prod-wms-01.pd.infn.it:7443/glite_wms_wmproxy_server \
                    https://wms-enmr.cerm.unifi.it:7443/glite_wms_wmproxy_server \
                    https://graszode.nikhef.nl:7443/glite_wms_wmproxy_server)
          setenv  GLITE_WMS_WMPROXY_ENDPOINT $WMS
          date
          glite-wms-job-submit -o $jobroot.jobid -a $jobroot.jdl
          if (! -e $jobroot.jobid) then
            set submitted=0
          else
            set submitted=1
            break
          endif
        end
        if ($submitted == 0) then 
          \rm  $jobroot.process >&/dev/null
        endif
      end
  endif
  set ncount = `ls -1 ./ |grep process | wc -l`
  if ($ncount > 0) then
    foreach i (*.process)
      if (! -e $i:r.jobid) \rm $i
    end
  endif
else
  echo `date` "gridsubmit-daemon already running... exiting"
endif


Grid polling daemon:

#!/bin/csh
#
set running=`ps -ef | grep gridpoll |grep -v grep |grep -v rosetta | wc -l`
if ($running <= 3) then
  set rundir = `echo $0 | awk -v d=$0:h"/" '{i = index($0,"/") } {if (i == 0) $0 = "./"; else $0 = d } {print $0}'`
  set ncount = `ls -1 $rundir |grep jobid |grep -v csr | wc -l`
  if ($ncount > 0) then
#
# CHECK THE STATUS OF THE RUNNING JOBS
#
    foreach i(`ls -1 -tr {$rundir}gridjob*.jobid`)
      set jobroot = $i:r
      set nline=`wc -l $i |awk '{print $1}'`
      if ($nline > 2) then
        head -2 $i >tmpp
        \mv tmpp $i
      endif
      set ok = 0
      set aborted = 0
      set waiting = 0
      set cleared = 0
      glite-wms-job-status -i $i >$i:r.status
      set ok = `cat $i:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" && $3 == "Done" {ok = 1} END{print ok}'`
      set aborted = `cat $i:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" && $3 == "Aborted" {aborted = 1} END{print aborted}'`
      set waiting = `cat $i:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" && $3 == "Waiting" {waiting = 1} END{print waiting}'`
      set cleared = `cat $i:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" && $3 == "Cleared" {cleared = 1} END{print cleared}'`
#
# JOB FINISHED SUCCESSFULLY
#
      if ($ok == 1) then
        glite-wms-job-output --nosubdir -i $i --dir `pwd`/$jobroot    >&/dev/null
#
#       CHECK IF RESULT FILE IS PRESENT
#
        if (-e $jobroot/${jobroot}-result.tar.gz) then
          set errsize = `grep -v grid-env $jobroot/${jobroot}.err |grep -v AAvocachedir | wc -l |awk '{print $1}'`
          set tarsize = `du -ks $jobroot/${jobroot}-result.tar.gz | awk '{print $1}'`
#
#         CHECK IF THE RESULT FILE IS NOT EMPTY AND ERROR FILE IS NOT TOO LARGE
#
          if ($errsize < 500 && $tarsize > 0) then
            echo '==========================='  
            echo `date` "GRID job finished successfully"
            echo 'JOBID ' $jobroot 
            grep 'Destination' $jobroot.status
            grep 'https' $jobroot.jobid
            echo '==========================='  
            \rm $i 
            \rm $jobroot.jdl
            \rm $jobroot.sh
            set outname=`cat $jobroot.dir`
            \mv $jobroot/${jobroot}-result.tar.gz `cat $jobroot.dir`    
            chmod g+rw `cat $jobroot.dir`
            \rm $jobroot.dir    
            \rm -rf $jobroot    
            \rm $jobroot.tar.gz
	    \rm $i:r.process $jobroot.process $jobroot.status>&/dev/null
#
#         PROBLEM DETECTED - MAKE JOB READY FOR RESUBMISSION
#
          else
            echo '==========================='  
            echo `date` "Error detected... resubmitting"
            if ($tarsize == 0) then
              echo "Empty result file"
            endif
            grep HOSTNAME $jobroot/${jobroot}.out 
            echo 'JOBID ' $jobroot  
            grep 'Destination' $jobroot.status
            grep 'https' $jobroot.jobid
            echo '==========================='  
            echo '==========================='  >> /home/haddock/grid/gridpoll.err
            echo `date` "Error detected... resubmitting" >> /home/haddock/grid/gridpoll.err
            grep HOSTNAME $jobroot/${jobroot}.out >>/home/haddock/grid/gridpoll.err
            echo 'JOBID ' $jobroot >> /home/haddock/grid/gridpoll.err
            grep 'Destination' $jobroot.status >> /home/haddock/grid/gridpoll.err
            grep 'https' $jobroot.jobid  >> /home/haddock/grid/gridpoll.err
            cat  $jobroot/${jobroot}.err >> /home/haddock/grid/gridpoll.err
            echo '==========================='  >> /home/haddock/grid/gridpoll.err
            \rm -rf $jobroot
	    \rm $jobroot.jobid
	    \rm $i:r.process $jobroot.process $jobroot.status>&/dev/null
          endif
#
#         PROBLEM DETECTED - MISSING RESULT FILE - MAKE JOB READY FOR RESUBMISSION
#
        else
          echo '==========================='  
          echo `date` "Missing result file... resubmitting" 
          echo 'JOBID ' $jobroot  
          grep 'Destination' $jobroot.status
          grep 'https' $jobroot.jobid
          echo '===========================' 
          echo '==========================='  >> /home/haddock/grid/gridpoll.err
          echo `date` "Missing result file... resubmitting" >> /home/haddock/grid/gridpoll.err
          echo 'JOBID ' $jobroot  >> /home/haddock/grid/gridpoll.err
          grep 'Destination' $jobroot.status  >> /home/haddock/grid/gridpoll.err
          grep 'https' $jobroot.jobid >> /home/haddock/grid/gridpoll.err
          echo '==========================='  >> /home/haddock/grid/gridpoll.err
          \rm -rf $jobroot
	  \rm $jobroot.jobid
	  \rm $i:r.process $jobroot.process $jobroot.status>&/dev/null
	  \rm $i:r.process $jobroot.process $jobroot.status>&/dev/null
        endif
      endif   
#
#     JOB WAS ABORTED - MAKE JOB READY FOR RESUBMISSION
#
      if ($aborted == 1) then
        echo '===========================' 
        echo `date` "Aborted job... resubmitting"
        echo 'JOBID ' $jobroot  
        grep 'Destination' $jobroot.status
        grep 'https' $jobroot.jobid
        echo '===========================' 
        \rm $jobroot.jobid
	\rm $i:r.process $jobroot.process $jobroot.status>&/dev/null
      endif
#
#     JOB WAS CLEARED - MAKE JOB READY FOR RESUBMISSION
#
      if ($cleared == 1) then
        echo '===========================' 
        echo `date` "Cleared job... resubmitting"
        echo 'JOBID ' $jobroot  
        grep 'Destination' $jobroot.status
        grep 'https' $jobroot.jobid
        echo '===========================' 
        \rm $jobroot.jobid
	\rm $i:r.process $jobroot.process $jobroot.status>&/dev/null
      endif
    end
#
# CHECK NOW FOR JOBS THAT HAVE BEEN SUBMITTED A LONG TIME AGO - FOR HADDOCK > 10 HOURS
#
    foreach j (`/usr/bin/find /home/haddock/grid/ -mmin +600 -name gridjob\*.process -prune`)
      glite-wms-job-status -i $j >$j:r.status
      set ok = `cat $j:r.status | awk 'BEGIN{ok=0}$1 == "Current" && $2 == "Status:" && ($3 == "Done" || $3 == "Running") {ok = 1} END{print ok}'`
#
#     IF NOT FINISHED (OK) THEN CANCEL AND MAKE READY FOR RESUBMISSION
#
      if ($ok != 1) then
        glite-wms-job-cancel -i $j:r.jobid <<_Eod_
y
_Eod_
        echo '===========================' 
        echo 'JOBID ' $jobroot  
        echo `date` "Waited for more than 5 hours... resubmitting"
        grep 'Destination' $jobroot.status
        grep 'https' $jobroot.jobid
        echo '===========================' 
        \rm -rf $j:r.jobid
        \rm -rf $j $j:r.process $j:r.status >&/dev/null
      endif
    end
  else
    goto exit
  endif
endif
exit:
AttachmentSize
gridsubmit-daemon.txt1.61 KB
gridpoll.txt6.53 KB
0
Your rating: None

Cite WeNMR/WestLife

 
Usage of the WeNMR/WestLife portals should be acknowledged in any publication:
 
"The FP7 WeNMR (project# 261572) and H2020 West-Life (project# 675858) European e-Infrastructure projects are acknowledged for the use of their web portals, which make use of the EGI infrastructure and DIRAC4EGI service with the dedicated support of CESNET-MetaCloud, INFN-PADOVA, NCG-INGRID-PT, RAL-LCG2, TW-NCHC, SURFsara and NIKHEF, and the additional support of the national GRID Initiatives of Belgium, France, Italy, Germany, the Netherlands, Poland, Portugal, Spain, UK, South Africa, Malaysia, Taiwan and the US Open Science Grid."
 
And the following article describing the WeNMR portals should be cited:
Wassenaar et al. (2012). WeNMR: Structural Biology on the Grid.J. Grid. Comp., 10:743-767.

EGI-approved

The WeNMR Virtual Research Community has been the first to be officially recognized by the EGI.

European Union

WeNMR is an e-Infrastructure project funded under the 7th framework of the EU. Contract no. 261572

WestLife, the follow up project of WeNMR is a Virtual Research Environment e-Infrastructure project funded under Horizon 2020. Contract no. 675858

West-Life