Troubleshooting Condor Batch System - Computing Environment Setup - User Software and Computing

Troubleshooting the Condor Batch System at the CMS LPC CAF

There is a search bar in the top right of this page to search this whole site with google. In addition, you can search this page for a key phrase from your error.

Failures that indicate you need a valid voms-proxy in CMS VO

All of these numbered errors indicate that you need to get a valid proxy in the CMS VO

voms-proxy-init --valid 192:00 -voms cms

Note that I have added obtaining a proxy as long as possible to the command

Error with many different condor commands including submission (one example shown):


[username@cmslpc333 condor]$ condor_submit condor_test.jdl
Querying the CMS LPC pool and trying to find an available schedd...

Attempting to submit jobs to lpcschedd3.fnal.gov

Submitting job(s)
ERROR: Failed to connect to local queue manager
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:28).  
There is probably a problem with your credentials.  (Did you run grid-proxy-init?)

If you did grid-proxy-init, you will need to destroy your proxy and get a new one with the cms vo:
voms-proxy-destroy; voms-proxy-init --valid 192:00 -voms cms
Note that this may also fail if the FNAL CMS LPC CAF doesn't have your CMS grid certificate associated with your username. Follow the directions below to Enable your EOS area.

Condor jobs held:


[username@cmslpc333 condor]$ condor_q
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.57:9618?... @ 09/08/22 15:12:37
 ID          OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
76596545.0   username          9/8  15:12   0+00:00:00 H  0    0.0 condor_test.sh 76596545 0
76596545.1   username          9/8  15:12   0+00:00:00 H  0    0.0 condor_test.sh 76596545 1

First, find out why the job is held (it could be for other reasons!):


[username@cmslpc333 condor]$ condor_q -better-analyze 76596545.0

The key line is:


Hold reason: Error from slot1_1@cmswn1518.fnal.gov: SHADOW at 131.225.188.57 
failed to send file(s) to <131.225.206.130:36653>: error reading from 
/uscms/home/username/x509up_u55555: (errno 2) No such file or directory; 
STARTER failed to receive file(s) from <131.225.188.57:9618>

Get your proxy: voms-proxy-init --valid 192:00 -voms cms, and release your jobs from the scheduler they are submitted to:

[username@cmslpc333 condor]$ condor_release -name lpcschedd3.fnal.gov -all
All jobs have been released

It could instead be that due to being over quota on the home directory NFS disk you were unable to write a proxy file.

If you get an error as follows, it shows that something is wrong with your authentication. Usually this is because you have not associated your CMS grid certificate to the Fermilab CMS LPC CAF account.


Submitting job(s)
ERROR: Failed to connect to local queue manager
SECMAN:2007:Failed to received post-auth ClassAd

CMS Storage Space Request: Grid certificate at FNAL

The CMS LPC CAF system must know about the association of your grid certificate and FNAL username (or your CERN full name or grid certificate identity (like your name) changed if you have registered a certificate previously). This is usually done as part of Enable EOS area ticket. You must do this at least once for your grid certificate to be associated with your account (or again if your grid certificate changes), which also lets you write to your EOS area from CRAB.
Go to the LPC Service Portal: https://fermi.servicenowservices.com/lpc
- And do "CMS Storage Space Request", and select "Enable" under "Action Required"
- It will prompt you for for your DN (Your DN is the result of voms-proxy-info --identity). and CERN username. Submit that to register your DN. It will take a up to one business day (FNAL hours) for it to propagate everywhere.

Condor executable (script) errors: `"exec format error"`

Symptom: standard_init_linux.go:190: exec user process caused "exec format error" in your condor stderr.

In general, this means that the Docker container on the worker node doesn't know enough to be able to execute your executable
The first line of your script should give the executable shell to use, like:

#!/bin/sh
#!/bin/bash
#!/bin/tcsh

Do not have any extra space or anything (not even a ###comment) before the first line of your script that condor executes

My jobs failed due to missing libraries on the worker node

Errors are like the following three examples:
```
./Analyzer: error while loading shared libraries: libCore.so: cannot open shared object file: No such file or directory                                              
xrdcp: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by xrdcp)
./condor_exec.exe: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./condor_exec.exe)
```
- Solution: Something is wrong in setting up your CMSSW or other environment on the condor worker node. If you have been using in your condor.jdl the line getenv = true to pass your interactive environment, an example is given on the condor refactor page
- Solution: the ./condor_exec.exe error was an example of passing a system installed executable to the condor job from a SL6 system. The condor schedulers are SL7, so will pass a SL7 executable to a SL6 container and it will fail. Call your system executables (if needed) inside a shell script.
None of that worked? Fill out a LPC Service Portal ticket/incident (Click on "I'm having a problem!") and give the details of your jobs and the errors they had.

Unable to locate local daemon

Error in running script (gridpack generation) looks like:

root: could not import htcondor python API:
Unable to locate local daemon

The gridpack scripts were written assuming that the condor schedulers (schedd) are on the local machine you are using. In the case of the condor refactor, they are not. condor_submit is actually a python wrapper around the real condor_submit command which first talks to negotiators and then the remote schedulers.
This is reported here: https://github.com/cms-sw/genproductions/issues/2101
Specifically the following line in the code is related: self.schedd = htcondor.Schedd()
Until that issue above is resolved, the workaround is to use CMS Connect

General troubleshooting methods

The following techniques may be useful to troubleshoot condor job problems. The condor user's manual guide to managing jobs goes into more depth for troubleshooting.

Web-based condor monitoring

The Landscape LPC and User batch summary web page is quite useful to understand the status of jobs at the cmslpc. The landscape web page requires a valid CMS grid certificate for authentication.

In the User Batch summary, choose or type in the username from the pulldown menu
Note that landscape updates every ~5 minutes so command line condor monitoring will give you quicker results, although neither are instantaneous
In the user batch summary, you can click on "Cluster" to find out all the jobs with the same Cluster (the first number in JobID before the period) running on a particular node. In this view you can find the condor_q -long information, including reason job is held, CPU efficiency, resident set size (memory), and disk usage.
In the first LPC landscape screen, the average job wait time is shown as well as some brief information about EOS health
The LPC batch summary screen gives more detailed information about the batch system, including condor user priority
In all landscape screens, the user can adjust the time range in the upper right corner
By logging into landscape (Fermilab services password) in the orange swirl menu in the upper left, a user can customize plots for their view, for instance making a log scale, or a white background

Command line job monitoring

From the command line: condor_status -submitters tells you the status of all condor jobs. If your jobs are Idle, they are not yet running, and waiting for an empty slot that matches your job requirements. If your jobs are Held they are stopped for some reason.

User priority

condor_userprio tells you the condor user priority of all the recent users of the cmslpc batch cluster. For more about condor user priority, see the condor user's manual guide to user priority.

Job logs

You can find useful troubleshooting information in your job logs, the jobname_Cluster_Process.log file contains information about where the job is running/ran, the jobname_Cluster_Process.stdout, and jobname_Cluster_Process.stderr files contain the actual executable stdout and stderr. The stdout and stderr are transferred after the job is finished.

condor_q

To understand job status (ST) from condor_q, you can refer to the condor user manual (8.7) - see below for condor job troubleshooting to understand why a job is in each status:
- "Current status of the job, which varies somewhat according to the job universe and the timing of updates. H = on hold, R = running, I = idle (waiting for a machine to execute on), C = completed, X = removed, S = suspended (execution of a running job temporarily suspended on execute node), < = transferring input (or queued to do so), and > = transferring output (or queued to do so)."
Be sure to know which scheduler your job was submitted to and is running on. You can query all the schedulers with condor_q. For instance if your condor jobs were scheduled from lpcschedd3.fnal.gov, then you may use the argument -name to query that one along, for instance: condor_q -name lpcschedd3.fnal.gov.
Find the jobID of the job you are concerned about with condor_q , you can see why it has that status it has with:

Analyze a condor job

The command to use it (example):
condor_q -name lpcschedd3.fnal.gov -analyze jobID

Note that -analyze and -better-analyze may give misleading results. This is because there are two matchmakers (negotiators) in the pool. One is for T1 pilots from MC production and one for cmslpc jobs. Therefore not only will more machines be reported than you may have access to, but you may occasionally get a message that I am not considering your job for matchmaking (when that's not actually the case for your job, you just landed on the wrong matchmaker).


[username@cmslpc333 condor]$ condor_q -submitter username
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?... @ 09/08/22 15:12:37
 ID          OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
76596545.0   tonjes          9/8  15:12   0+00:00:00 H  0    0.0 condor_test.sh
[username@cmslpc333 condor]$ condor_q -analyze 974.0 -name lpcschedd3.fnal.gov
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?...
---
76596545.000:  Request is held.

Hold reason: Error from slot1@cmswn1898.fnal.gov: STARTER at 131.225.191.192 failed 
to send file(s) to <131.225.188.235:9618>; SHADOW at 131.225.188.235 failed to write to file 
/uscms_data/d1/username/filename.root: (errno 122) Disk quota exceeded

/uscms_data

In another case, the job requirements didn't allow for the job to run on any machines. It is important to note that due to the way the lpc cluster is partitioned, individual job slots can use more (or less) memory and disk than a standard amount. Therefore in a busy cmslpc farm, job requirements for memory and disk will needlessly restrict a job. It is best unless you need them for certain (and are willing to wait for resources to be available), to not add Requirements.

Here is an example output from condor_q -name lpcschedd3.fnal.gov -analyze where an old condor jdl was used:


[username@cmslpc333 condor]$ condor_q -name lpcschedd3.fnal.gov -analyze 76596545
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?...
	Last successful match: Wed Sep 7 14:25:07 2022
	Last failed match: Thu Sep 8 14:32:46 2022

	Reason for last match failure: no match found 

The Requirements expression for your job is:

    ( ( OpSys == "LINUX" ) && ( Arch != "DUMMY" ) ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( TARGET.HasFileTransfer )


Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( OpSys == "LINUX" )              0                   REMOVE
2   ( Arch != "DUMMY" )               0                   REMOVE
3   ( TARGET.Memory >= 2500 )         1302                 
4   ( TARGET.Disk >= 1000000 )        3736                 
5   ( TARGET.HasFileTransfer )        6504

In the above case, condor_q -name lpcschedd3.fnal.gov -better-analyze is a better tool. However, condor_q -name lpcschedd3.fnal.gov -better-analyze wouldn't print the error about being over quota. It turned out the user above had extra commands in the condor job file to hold the job if it went over requested memory. Additionally there were some no longer supported requirements being implemented, which analyze suggested removing.

Here is an example with condor_q -name lpcschedd3.fnal.gov -better-analyze, where the user requested a machine with more memory available than any in the cluster.


[username@cmslpc333 testJobRestrict]$ condor_q -name lpcschedd3.fnal.gov -better-analyze 76596545


-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?...
User priority for username@fnal.gov is not available, attempting to analyze without it.
---
76596545.000:  Run analysis summary.  Of 6076 machines,
   6076 are rejected by your job's requirements 
      0 reject your job because of their own requirements 
      0 match and are already running your jobs 
      0 match but are serving other users 
      0 are available to run your job
	No successful match recorded.
	Last failed match: Tu Jan 22 17:02:40 2019

	Reason for last match failure: no match found 

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

    ( TARGET.Arch == "x86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( TARGET.HasFileTransfer )

Your job defines the following attributes:

    RequestDisk = 10000000
    RequestMemory = 210000

The Requirements expression for your job reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]        6076  TARGET.Arch == "x86_64"
[1]        6076  TARGET.OpSys == "LINUX"
[3]        1718  TARGET.Disk >= RequestDisk
[5]           0  TARGET.Memory >= RequestMemory

Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.Memory >= 210000 )       0                   MODIFY TO 24028
2   ( TARGET.Disk >= 10000000 )       1718                 
3   ( TARGET.Arch == "x86_64" )       6076                 
4   ( TARGET.OpSys == "LINUX" )       6076                 
5   ( TARGET.HasFileTransfer )        6076

condor_tail

If you wish to see the end of a current running job stdout, you can use the condor_tail command together with the job number. The results will give you ~20 lines of output, it is cut here for space in this example:

condor_tail 76596545.1


== CMSSW: 09-Mar2019 22:26:32 CST  Successfully opened file root://cmseos.fnal.gov//store/user/username/myfile.root
== CMSSW: Begin processing the 1st record. Run 1, Event 254, LumiSection 6 at 09-Mar-2019 22:30:12.001 CST
== CMSSW: Begin processing the 101st record. Run 1, Event 353, LumiSection 8 at 09-Mar-2019 22:30:28.673 CST
== CMSSW: Begin processing the 201st record. Run 1, Event 451, LumiSection 10 at 09-Mar-2019 22:30:32.670 CST

Note that you can also query the -stderr instead of the default stdout, for instance:

condor_tail -stderr 76596545.1

More options can be found with condor_tail -help

condor_ssh

The command condor_ssh_to_job and similar ones will not work as condor_ssh is disabled by policy at cmslpc condor servers. Please use other methods to troubleshoot your job.

Querying condor ClassAds

Querying some or all of the job ClassAds is a useful technique to understand what's wrong with jobs. To find the most information, you can query on the interactive machine the jobs were submitted from with: condor_q -long JobID
Another way to find out the reason a job is held from the command line (use the appropriate -name lpcschedd3.fnal.gov for your job:

For all the jobs that are on a particular interactive machine: condor_q -af:jh User HoldReason -name lpcschedd3.fnal.gov.
For a single job, put in the jobID, for instance: condor_q 76596545.0 -af:jh User HoldReason -name lpcschedd3.fnal.gov

What are the system limits for jobs running on the cmslpc condor batch system?

Time: Jobs are removed for: Running for more than 2 days, held for more than 7 days, idle for more than 30 days
Disk: Jobs cannot use more than 40GB of local disk space on the condor worker node
Memory: The default memory (if not specified) is 2100 MB, if your job uses more than this, or more than you requested, then it is removed. Got to the Additional condor commands page to learn more about profiling and submitting high memory jobs
Technical details:
To find out the configuration of the complete system defaults, you can use the command: condor_config_val -dump -name lpcschedd3.fnal.gov, the most important line is:

SYSTEM_PERIODIC_REMOVE = (JobUniverse == 5 && JobStatus == 2 && ((time() - EnteredCurrentStatus) > 86400*2)) || (JobRunCount > 10) || (JobStatus == 5 && (time() - EnteredCurrentStatus) > 86400*7) || (DiskUsage > 40000000) || (ResidentSetSize > (RequestMemory * 1000) || (JobStatus == 1 && (time() - EnteredCurrentStatus) > 86400*30) )
If your job is removed, you get this which is more in English:
SYSTEM_PERIODIC_REMOVE_REASON = strcat("LPC job removed by SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobUniverse == 5 && JobStatus == 2 && ((time() - EnteredCurrentStatus) > 86400*2)), "job running for more than 2 days.", ifThenElse((JobRunCount > 10), "job restarting more than 10 times.", ifThenElse((JobStatus == 5 && (time() - EnteredCurrentStatus) > 86400*7), "job held for more than 7 days.", ifThenElse((DiskUsage > 40000000), "job exceeding 40GB disk usage.", ifThenElse((ResidentSetSize > (RequestMemory * 1000)), "job exceeding requested memory.", ifThenElse((JobStatus == 1 && (time() - EnteredCurrentStatus) > 86400*30), "idle more than 30 days", "unknown reasons." ) ) ) ) ) ) )

Advanced condor topics

Advanced condor topics such as: higher memory, more cpu, more disk space, and partitionable slots can be found in the separate Batch System Advanced Topics web page.

US CMS News

In This Section:

Troubleshooting the Condor Batch System at the CMS LPC CAF

Failures that indicate you need a valid voms-proxy in CMS VO

All of these numbered errors indicate that you need to get a valid proxy in the CMS VO

CMS Storage Space Request: Grid certificate at FNAL

Condor executable (script) errors: `"exec format error"`

My jobs failed due to missing libraries on the worker node

Unable to locate local daemon

General troubleshooting methods

Web-based condor monitoring

Command line job monitoring

User priority

Job logs

condor_q

Analyze a condor job

condor_tail

condor_ssh

Querying condor ClassAds

What are the system limits for jobs running on the cmslpc condor batch system?

Advanced condor topics

US CMS News

In This Section:

Troubleshooting the Condor Batch System at the CMS LPC CAF

Failures that indicate you need a valid voms-proxy in CMS VO

All of these numbered errors indicate that you need to get a valid proxy in the CMS VO

CMS Storage Space Request: Grid certificate at FNAL

Condor executable (script) errors: "exec format error"

My jobs failed due to missing libraries on the worker node

Unable to locate local daemon

General troubleshooting methods

Web-based condor monitoring

Command line job monitoring

User priority

Job logs

condor_q

Analyze a condor job

condor_tail

condor_ssh

Querying condor ClassAds

What are the system limits for jobs running on the cmslpc condor batch system?

Advanced condor topics

Condor executable (script) errors: `"exec format error"`