A worldwide e-Infrastructure for NMR and structural biology

This site uses cookies for a better experience. By continuing to browse you agree to the use of cookies.

perusal option: save your job output before reaching walltime

Introduction

When running long jobs on the grid, you might hit the walltime and your job output is completly lost. Or maybe your prixy lifetime is over, and your job is cancelled, and you lost your job output.

perusal options in jdl files is a way to save at regular time the output of your job, or to save it before it is destroyed if it reach the walltime. That way you can potentially resubmit from your last "checkpoint"

Persusal mode exists only since CREAM-ce (and not with previous ce version). So when you use this option, the wms will automatically choose a CREAM-ce.

Activating the perusal mode

It is done in 2 steps

1. Modify your jdl

You need to add the PerusalFileEnable and PerusalTimeInterval flag

Executable    = "hello-perusal.sh";
StdOutput     = "hello.out";
StdError      = "hello.err";
InputSandbox  = {"/home/christophe/BASIC_TEST-GRID/hello-perusal.sh"};
OutputSandbox = {"hello.err","hello.out"};
PerusalFileEnable = true;
PerusalTimeInterval = 1000;

The PeruslaTimeInterval is in second. You can put lower value than 1000, but most likely the wms has a minimum value of 1000. So if you choose 60, probably your file will be "backed-up" every 1000 seconds anyway.

Try to use larger value, depending on the size of the data you want to save. I would recommand to save maximum every hour for production jobs, to avoid too much traffic between the nodes and the wms machine.

2. Tell the wms which file you want to save at regular time

Of course, you need first to submit your job normally and get its JOBID. Apparently, the perusal mode is working only on CREAM ce (and probably above). To confirm...

You tell the wms which file to save with:

glite-wms-job-perusal --set -f hello.out JOBID

If you decide you also want to set the hello.err file, you should do

glite-wms-job-perusal --set -f hello.out -f hello.err JOBID

Because if you only set hello.err (after you already set hello.out) then you won't be able to retreive hello.out. So you should use multiple -f flags. Look at the man of glite-wms-job-perusa

You can manually retreive the file with

glite-wms-job-perusal --get -f hello.out JOBID

This time you can't specify multiple file with multiple -f flags. You need to wait at least PerusalTimeInterval seconds before you can effectively retreive the job. If you retreive the file once, the next time you retreive it, it will only contain what has been modify since the last retreival! What does it mean? If you constanly append lines to the file, it is obvious. If you modify the file content in an other way than appending lines, I dont know. Test it and modify this wiki page!

Usefull facts

You can retreive files even if they are not part of your "OutputSandbox" of your jdl

You can wait until your job is finished / hit the walltime / get killed because of proxy expiration to retreive the last "back up"!

If your job stop before the PerusalTimeInterval (which effectively is probaly the maximum between 1000 and what you put in the jdl), then your job will still be "Running" until the end of the PerusalTimeInterval! Sounds like a kind of a bug.

 

 

0
Your rating: None

Cite WeNMR/WestLife

 
Usage of the WeNMR/WestLife portals should be acknowledged in any publication:
 
"The FP7 WeNMR (project# 261572), H2020 West-Life (project# 675858) and the EOSC-hub (project# 777536) European e-Infrastructure projects are acknowledged for the use of their web portals, which make use of the EGI infrastructure with the dedicated support of CESNET-MetaCloud, INFN-PADOVA, NCG-INGRID-PT, TW-NCHC, SURFsara and NIKHEF, and the additional support of the national GRID Initiatives of Belgium, France, Italy, Germany, the Netherlands, Poland, Portugal, Spain, UK, Taiwan and the US Open Science Grid."
 
And the following article describing the WeNMR portals should be cited:
Wassenaar et al. (2012). WeNMR: Structural Biology on the Grid.J. Grid. Comp., 10:743-767.

EGI-approved

The WeNMR Virtual Research Community has been the first to be officially recognized by the EGI.

European Union

WeNMR is an e-Infrastructure project funded under the 7th framework of the EU. Contract no. 261572

WestLife, the follow up project of WeNMR is a Virtual Research Environment e-Infrastructure project funded under Horizon 2020. Contract no. 675858

West-Life