Troubleshooting the Condor Batch System at the CMS LPC CAF
NOTE: This page is newly created to be up to date for the Condor refactor.
There is a search bar in the top right of this page to search this whole site with google. In addition, you can search this page for a key phrase from your error.
Failures that indicate you need a valid voms-proxy in CMS VO
- Error with many different condor commands including submission (one example shown):
condor_submit condor_test.jdlQuerying the CMS LPC pool and trying to find an available schedd... Attempting to submit jobs to lpcschedd2.fnal.gov Submitting job(s) ERROR: Failed to connect to local queue manager AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5003:Failed to authenticate. Globus is reporting error (851968:28). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
All of these numbered errors indicate that you need to get a valid proxy in the CMS VOwith:
voms-proxy-init --valid 192:00 -voms cms
Note that I have added obtaining a proxy as long as possible to the command
- If you did grid-proxy-init, you will need to destroy your proxy and get a new one with the cms vo:
voms-proxy-destroy; voms-proxy-init --valid 192:00 -voms cms
- Note that this may also fail if the FNAL CMS LPC CAF doesn't have your CMS grid certificate associated with your username. Follow the directions below to Enable your EOS area.
condor_q-- Schedd: lpcschedd2.fnal.gov : <22.214.171.124:9618?... @ 01/22/19 18:54:16 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 30000998.0 username 1/22 18:53 0+00:00:00 H 0 0.0 condor_test.sh 30000998 0 30000998.1 username 1/22 18:53 0+00:00:00 H 0 0.0 condor_test.sh 30000998 1
- First, find out why the job is held (it could be for other reasons!):
condor_q -better-analyze 30000998.0
Hold reason: Error from email@example.com: SHADOW at 126.96.36.199 failed to send file(s) to <188.8.131.52:36653>: error reading from /uscms/home/username/x509up_u55555: (errno 2) No such file or directory; STARTER failed to receive file(s) from <184.108.40.206:9618>
- Get your proxy:
voms-proxy-init --valid 192:00 -voms cms, and release your jobs from the scheduler they are submitted to:
condor_release -name lpcschedd2.fnal.gov -allAll jobs have been released
- The CMS LPC CAF system must know about the association of your grid certificate and FNAL username. This is usually done as part of Enable EOS area ticket. You must do this at least once for your grid certificate to be associated with your account, which also lets you write to your EOS area from CRAB.
- Go to the LPC Service Portal: https://fermi.servicenowservices.com/lpc
- And do "CMS Storage Space Request", and select "Enable" under "Action Required"
- It will prompt you for for your DN (Your DN is the result of
voms-proxy-info --identity). and CERN username. Submit that to register your DN. It will take a few hours (during weekdays) for it to propagate everywhere
Submitting job(s) ERROR: Failed to connect to local queue manager SECMAN:2007:Failed to received post-auth ClassAd
CMS Storage Space Request: Grid certificate at FNAL
Condor executable (script) errors:
"exec format error"
standard_init_linux.go:190: exec user process caused "exec format error"in your condor stderr.
- In general, this means that the Docker container on the worker node doesn't know enough to be able to execute your executable
- The first line of your script should give the executable shell to use, like:
- Do not have any extra space or anything (not even a
###comment) before the first line of your script that condor executes
My jobs failed due to missing libraries on the worker node
- Errors are like the following three examples:
./Analyzer: error while loading shared libraries: libCore.so: cannot open shared object file: No such file or directory
xrdcp: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by xrdcp)
./condor_exec.exe: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./condor_exec.exe)
- Solution: Something is wrong in setting up your CMSSW or other environment on the condor worker node. If you have been using in your
getenv = trueto pass your interactive environment, an example is given on the condor refactor page
- Solution: the
./condor_exec.exeerror was an example of passing a system installed executable to the condor job from a SL6 system. The condor schedulers are SL7, so will pass a SL7 executable to a SL6 container and it will fail. Call your system executables (if needed) inside a shell script.
- Solution: Something is wrong in setting up your CMSSW or other environment on the condor worker node. If you have been using in your
- None of that worked? Fill out a LPC Service Portal ticket/incident (Click on "I'm having a problem!") and give the details of your jobs and the errors they had.
Unable to locate local daemon
- Error in running script (gridpack generation) looks like:
root: could not import htcondor python API: Unable to locate local daemon
condor_submitis actually a python wrapper around the real
condor_submitcommand which first talks to negotiators and then the remote schedulers.
self.schedd = htcondor.Schedd()
General troubleshooting methodsThe following techniques may be useful to troubleshoot condor job problems. The condor user's manual guide to managing jobs goes into more depth for troubleshooting.
Web-based condor monitoring
- The Landscape LPC and User batch summary web page is quite useful to understand the status of jobs at the cmslpc. The landscape web page requires a valid CMS grid certificate for authentication.
- In the User Batch summary, choose or type in the username from the pulldown menu
- Note that landscape updates every ~5 minutes so command line condor monitoring will give you quicker results, although neither are instantaneous
- In the user batch summary, you can click on "Cluster" to find out all the jobs with the same Cluster (the first number in JobID before the period) running on a particular node. In this view you can find the
condor_q -longinformation, including reason job is held, CPU efficiency, resident set size (memory), and disk usage.
- In the first LPC landscape screen, the average job wait time is shown as well as some brief information about EOS health
- The LPC batch summary screen gives more detailed information about the batch system, including condor user priority
- In all landscape screens, the user can adjust the time range in the upper right corner
- By logging into landscape (Fermilab services password) in the orange swirl menu in the upper left, a user can customize plots for their view, for instance making a log scale, or a white background
Command line job monitoring
- From the command line:
condor_status -submitterstells you the status of all condor jobs. If your jobs are Idle, they are not yet running, and waiting for an empty slot that matches your job requirements. If your jobs are Held they are stopped for some reason.
condor_userpriotells you the condor user priority of all the recent users of the cmslpc batch cluster. For more about condor user priority, see the condor user's manual guide to user priority.
- You can find useful troubleshooting information in your job logs, the
jobname_Cluster_Process.logfile contains information about where the job is running/ran, the
jobname_Cluster_Process.stderrfiles contain the actual executable stdout and stderr. The stdout and stderr are transferred after the job is finished.
- To understand job status (ST) from
condor_q, you can refer to the condor user manual (8.7) - see below for condor job troubleshooting to understand why a job is in each status:
- "Current status of the job, which varies somewhat according to the job universe and the timing of updates. H = on hold, R = running, I = idle (waiting for a machine to execute on), C = completed, X = removed, S = suspended (execution of a running job temporarily suspended on execute node), < = transferring input (or queued to do so), and > = transferring output (or queued to do so)."
- Be sure to know which scheduler your job was submitted to and is running on. You can query all the schedulers with
condor_q. For instance if your condor jobs were scheduled from
lpcschedd2.fnal.gov, then you may use the argument
-nameto query that one along, for instance:
condor_q -name lpcschedd2.fnal.gov.
- Find the jobID of the job you are concerned about with
condor_q, you can see why it has that status it has with:
Analyze a condor job
The command to use it (example):
condor_q -name lpcschedd1.fnal.gov -analyze jobID
-better-analyzemay give misleading results. This is because there are two matchmakers (negotiators) in the pool. One is for T1 pilots from MC production and one for cmslpc jobs. Therefore not only will more machines be reported than you may have access to, but you may occasionally get a message that
I am not considering your job for matchmaking(when that's not actually the case for your job, you just landed on the wrong matchmaker).
- In another case, the job requirements didn't allow for the job to run on any machines. It is important to note that due to the way the lpc cluster is partitioned, individual job slots can use more (or less) memory and disk than a standard amount. Therefore in a busy cmslpc farm, job requirements for memory and disk will needlessly restrict a job. It is best unless you need them for certain (and are willing to wait for resources to be available), to not add Requirements.
- Here is an
example output from
condor_q -name lpcschedd1.fnal.gov -analyzewhere an old condor jdl was used:
condor_q -name lpcschedd1.fnal.gov -analyze 1033325-- Schedd: lpcschedd1.fnal.gov : <220.127.116.11:9618?... Last successful match: Tue Jan 22 14:25:07 2019 Last failed match: Wed Jan 23 14:32:46 2019 Reason for last match failure: no match found The Requirements expression for your job is: ( ( OpSys == "LINUX" ) && ( Arch != "DUMMY" ) ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) Suggestions: Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( OpSys == "LINUX" ) 0 REMOVE 2 ( Arch != "DUMMY" ) 0 REMOVE 3 ( TARGET.Memory >= 2500 ) 1302 4 ( TARGET.Disk >= 1000000 ) 3736 5 ( TARGET.HasFileTransfer ) 6504
condor_q -name lpcschedd1.fnal.gov -better-analyzeis a better tool. However,
condor_q -name lpcschedd1.fnal.gov -better-analyzewouldn't print the error about being over quota. It turned out the user above had extra commands in the condor job file to hold the job if it went over requested memory. Additionally there were some no longer supported requirements being implemented, which
- Here is an example with
condor_q -name lpcschedd1.fnal.gov -better-analyze, where the user requested a machine with more memory available than any in the cluster.
In the above case, the user's quota on
condor_q -submitter username-- Submitter: lpcschedd1.fnal.gov : <18.104.22.168:9618?... : lpcschedd1.fnal.gov @ 01/22/19 19:40:11 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 974.0 username 1/22 19:39 0+00:00:00 I 0 0.0 condor_test.sh 974 0 [username@cmslpc136 condor]$ condor_q -analyze 974.0 -name lpcschedd1.fnal.gov -- Schedd: lpcschedd1.fnal.gov : <22.214.171.124:9618?... --- 970.000: Request is held. Hold reason: Error from firstname.lastname@example.org: STARTER at 126.96.36.199 failed to send file(s) to <188.8.131.52:9618>; SHADOW at 184.108.40.206 failed to write to file /uscms_data/d1/username/filename.root: (errno 122) Disk quota exceeded
/uscms_data, the NFS disk is exceeded.
condor_q -name lpcschedd1.fnal.gov -better-analyze 1545569-- Schedd: lpcschedd1.fnal.gov : <220.127.116.11:9618?... User priority for email@example.com is not available, attempting to analyze without it. --- 1545569.000: Run analysis summary. Of 6076 machines, 6076 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are available to run your job No successful match recorded. Last failed match: Tu Jan 22 17:02:40 2019 Reason for last match failure: no match found WARNING: Be advised: No resources matched request's constraints The Requirements expression for your job is: ( TARGET.Arch == "x86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) Your job defines the following attributes: RequestDisk = 10000000 RequestMemory = 210000 The Requirements expression for your job reduces to these conditions: Slots Step Matched Condition ----- -------- ---------  6076 TARGET.Arch == "x86_64"  6076 TARGET.OpSys == "LINUX"  1718 TARGET.Disk >= RequestDisk  0 TARGET.Memory >= RequestMemory Suggestions: Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( TARGET.Memory >= 210000 ) 0 MODIFY TO 24028 2 ( TARGET.Disk >= 10000000 ) 1718 3 ( TARGET.Arch == "x86_64" ) 6076 4 ( TARGET.OpSys == "LINUX" ) 6076 5 ( TARGET.HasFileTransfer ) 6076
If you wish to see the end of a current running job stdout, you can use the
together with the job number. The results will give you ~20 lines of output, it is cut here for space in this example:
== CMSSW: 09-Mar2019 22:26:32 CST Successfully opened file root://cmseos.fnal.gov//store/user/username/myfile.root == CMSSW: Begin processing the 1st record. Run 1, Event 254, LumiSection 6 at 09-Mar-2019 22:30:12.001 CST == CMSSW: Begin processing the 101st record. Run 1, Event 353, LumiSection 8 at 09-Mar-2019 22:30:28.673 CST == CMSSW: Begin processing the 201st record. Run 1, Event 451, LumiSection 10 at 09-Mar-2019 22:30:32.670 CST
Note that you can also query the
-stderr instead of the default
stdout, for instance:
condor_tail -stderr 60000042.1
More options can be found with
condor_ssh_to_joband similar ones will not work as condor_ssh is disabled by policy at cmslpc condor servers. Please use other methods to troubleshoot your job.
Querying condor ClassAds
- Querying some or all of the job ClassAds is a useful technique to understand what's wrong with jobs. To find the most information, you can query on the interactive machine the jobs were submitted from with:
condor_q -long JobID
- Another way to find out the reason a job is held from the command line (use the appropriate
-name lpcschedd1.fnal.govfor your job:
- For all the jobs that are on a particular interactive machine:
condor_q -af:jh User HoldReason -name lpcschedd1.fnal.gov.
- For a single job, put in the jobID, for instance:
condor_q 3988530.0 -af:jh User HoldReason -name lpcschedd1.fnal.gov
What are the system limits for jobs running on the cmslpc condor batch system?
- Time: Jobs are removed for: Running for more than 2 days, held for more than 7 days, idle for more than 30 days
- Disk: Jobs cannot use more than 40GB of local disk space on the condor worker node
- Memory: The default memory (if not specified) is 2100 MB, if your job uses more than this, or more than you requested, then it is removed. Got to the Additional condor commands page to learn more about profiling and submitting high memory jobs
- Technical details:
- To find out the configuration of the complete system defaults, you can use the command:
condor_config_val -dump -name lpcschedd1.fnal.gov, the most important line is:
SYSTEM_PERIODIC_REMOVE = (JobUniverse == 5 && JobStatus == 2 && ((time() - EnteredCurrentStatus) > 86400*2)) || (JobRunCount > 10) || (JobStatus == 5 && (time() - EnteredCurrentStatus) > 86400*7) || (DiskUsage > 40000000) || (ResidentSetSize > (RequestMemory * 1000) || (JobStatus == 1 && (time() - EnteredCurrentStatus) > 86400*30) )
- If your job is removed, you get this which is more in English:
SYSTEM_PERIODIC_REMOVE_REASON = strcat("LPC job removed by SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobUniverse == 5 && JobStatus == 2 && ((time() - EnteredCurrentStatus) > 86400*2)), "job running for more than 2 days.", ifThenElse((JobRunCount > 10), "job restarting more than 10 times.", ifThenElse((JobStatus == 5 && (time() - EnteredCurrentStatus) > 86400*7), "job held for more than 7 days.", ifThenElse((DiskUsage > 40000000), "job exceeding 40GB disk usage.", ifThenElse((ResidentSetSize > (RequestMemory * 1000)), "job exceeding requested memory.", ifThenElse((JobStatus == 1 && (time() - EnteredCurrentStatus) > 86400*30), "idle more than 30 days", "unknown reasons." ) ) ) ) ) ) )