Globus timeout wnels2  2009-10-08 06:46
Status: Closed
 
I'm running a very long X!Tandem search. In my ms2Config.xml the maxWallTIme is set to 1440 (24hrs)and this is working as expected. I'm trying to override this with an input parameter:

<note label="xtandem, globus max cpu-time" type="input">10080</note> (7 days)

but is still times out at 24 hours.

Am I doing this correctly?

Thanks,
Bill
 
 
Brian Connolly responded:  2009-10-08 09:49
Bill,

The maxWallTime variable sets the maxwalltime variable for your cluster. You will need to also change the "Termination Time" setting for the Globus server.

The Globus server is configured by default to allow jobs to run for no more than 24hrs. In 9.1 we made this setting configurable. (See https://www.labkey.org/issues/home/Developer/issues/details.view?issueId=7823)

If you want to increase the Globus Termination time for your Enterprise Pipeline you need to do the following

1) Open the pipelineConfig.xml file on your webserver
2) Find the bean defined as <bean class="org.labkey.pipeline.api.properties.GlobusClientPropertiesImpl">
3) Add a new property for this bean similar to

<property name="terminationTime" value="500000" />

Note the value is in minutes.

4) Restart the web server.


Brian
 
wnels2 responded:  2009-10-12 06:53
I must have missed something, it's still timing out after 24hrs. I'm setting my search input param:
<note label="xtandem, globus max cpu-time" type="input">10080</note>

On my webserver's pipelineConfig.xml I added termination time and restarted tomcat.

<bean class="org.labkey.pipeline.api.properties.GlobusClientPropertiesImpl">
                <property name="jobFactoryType" value="SGE" />
                <property name="queue" value="labkey" />
                <property name="javaHome" value="/share/apps/java/jdk1.5.0_12" />
                <property name="labKeyDir" value="/share/apps/cpas/bin/labkey" />
                <property name="globusServer" value="https://msfcluster.gws.uky.edu:8443" />
                <property name="terminationTime" value="10080" />


Below is the log file from the search:



08 Oct 2009 17:50:14,747 INFO : X! TANDEM 2 (2007.07.01.2)
08 Oct 2009 17:50:14,749 INFO :
08 Oct 2009 17:50:15,532 INFO : Loading spectra .... loaded.
08 Oct 2009 17:50:15,534 INFO : Spectra matching criteria = 7644
08 Oct 2009 17:50:15,537 INFO : Pluggable scoring enabled.
08 Oct 2009 17:50:15,540 INFO : Starting threads . started.
08 Oct 2009 17:50:15,542 INFO : Computing models:
09 Oct 2009 17:50:10,389 ERROR: Fault received from Globus on "submit" command
09 Oct 2009 17:50:10,513 ERROR: The provided 'gass_cache' parameter is invalid
09 Oct 2009 17:50:10,543 ERROR: Fault received from Globus
09 Oct 2009 17:50:10,555 ERROR: org.globus.exec.generated.FaultType
09 Oct 2009 17:50:10,612 ERROR: Fault received from Globus
09 Oct 2009 17:50:10,623 ERROR: ProcessDied
09 Oct 2009 17:50:10,659 ERROR: Fault received from Globus
09 Oct 2009 17:50:10,688 ERROR: java.lang.Exception
09 Oct 2009 17:50:10,696 ERROR: Fault received from Globus
09 Oct 2009 17:50:10,702 ERROR: org.oasis.wsrf.faults.BaseFaultType
09 Oct 2009 17:50:10,709 INFO : Reading log file /home/massspec/cpas/pipeline/projects/UPS/nelson/PTM/Gygi/010319_f16/xtandem/tandemPTM_max_test_7day_Copy1/010319_f16.cluster.out, which is now of size 1288
09 Oct 2009 17:50:10,714 INFO : Content of stdout:
Deploying resources from exploded modules to web app directory...
Module extraction and deployment complete.

Thanks,
Bill
 
wnels2 responded:  2009-10-13 06:08
Can you clarify which parameter to use?
[task name], globus max time
[task name], globus max cpu-time
or   
[task name], globu max wall-time   

Will they override the max wall-time in the ms2Config on the Globus server? Or should they be irrelevant now because the terminationTime has been extended on the web server?
Thanks,
Bill
 
jeckels responded:  2009-10-14 11:24
Hi Bill,

I believe that you'll need to specify the termination time AND a max cpu/wall time.

The termination time is interpreted relative to when the job was submitted. For example, if a job is submitted and all the cluster nodes are busy, the job can be canceled after its termination time is exceeded even if it never started.

Generally speaking, I've used termination time in conjunction with max wall-time.

Can you include more of your log file so that we can see when it was submitted and when the job (not just the actual invocation of X!Tandem) started running on the cluster node?

Thanks,
Josh
 
wnels2 responded:  2009-10-15 14:22
Well, it's been 26 hours since I started the search and it is still running so I'll optimistically say that using [task name], globus max cpu-time worked - knock on wood.

Thanks for your help.