Troubleshooting the Condor Batch System at the CMS LPC CAF
There is a search bar in the top right of this page to search this whole site with google. In addition, you can search this page for a key phrase from your error.
Failures that indicate you need a valid voms-proxy in CMS VO
- Error with many different condor commands including submission (one example shown):
[username@cmslpc333 condor]$
condor_submit condor_test.jdl
Querying the CMS LPC pool and trying to find an available schedd... Attempting to submit jobs to lpcschedd3.fnal.gov Submitting job(s) ERROR: Failed to connect to local queue manager AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5003:Failed to authenticate. Globus is reporting error (851968:28). There is probably a problem with your credentials. (Did you run grid-proxy-init?) - If you did grid-proxy-init, you will need to destroy your proxy and get a new one with the cms vo:
voms-proxy-destroy; voms-proxy-init --valid 192:00 -voms cms
- Note that this may also fail if the FNAL CMS LPC CAF doesn't have your CMS grid certificate associated with your username. Follow the directions below to Enable your EOS area.
- Condor jobs held:
[username@cmslpc333 condor]$
condor_q
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.57:9618?... @ 09/08/22 15:12:37 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 76596545.0 username 9/8 15:12 0+00:00:00 H 0 0.0 condor_test.sh 76596545 0 76596545.1 username 9/8 15:12 0+00:00:00 H 0 0.0 condor_test.sh 76596545 1- First, find out why the job is held (it could be for other reasons!):
[username@cmslpc333 condor]$
condor_q -better-analyze 76596545.0
Hold reason: Error from slot1_1@cmswn1518.fnal.gov: SHADOW at 131.225.188.57 failed to send file(s) to <131.225.206.130:36653>: error reading from /uscms/home/username/x509up_u55555: (errno 2) No such file or directory; STARTER failed to receive file(s) from <131.225.188.57:9618>
- Get your proxy:
voms-proxy-init --valid 192:00 -voms cms
, and release your jobs from the scheduler they are submitted to:
[username@cmslpc333 condor]$
condor_release -name lpcschedd3.fnal.gov -all
All jobs have been released - It could instead be that due to being over quota on the home directory NFS disk you were unable to write a proxy file.
- If you get an error as follows, it shows that something is wrong with your authentication. Usually this is because you have not associated your CMS grid certificate to the Fermilab CMS LPC CAF account.
- The CMS LPC CAF system must know about the association of your grid certificate and FNAL username (or your CERN full name or grid certificate identity (like your name) changed if you have registered a certificate previously). This is usually done as part of Enable EOS area ticket. You must do this at least once for your grid certificate to be associated with your account (or again if your grid certificate changes), which also lets you write to your EOS area from CRAB.
- Go to the LPC Service Portal: https://fermi.servicenowservices.com/lpc
- And do "CMS Storage Space Request", and select "Enable" under "Action Required"
- It will prompt you for for your DN (Your DN is the result of
voms-proxy-info --identity
). and CERN username. Submit that to register your DN. It will take a up to one business day (FNAL hours) for it to propagate everywhere.
All of these numbered errors indicate that you need to get a valid proxy in the CMS VO
with:voms-proxy-init --valid 192:00 -voms cms
Note that I have added obtaining a proxy as long as possible to the command
Submitting job(s)
ERROR: Failed to connect to local queue manager
SECMAN:2007:Failed to received post-auth ClassAd
CMS Storage Space Request: Grid certificate at FNAL
Condor executable (script) errors: "exec format error"
Symptom: standard_init_linux.go:190: exec user process caused "exec format error"
in your condor stderr.
- In general, this means that the Docker container on the worker node doesn't know enough to be able to execute your executable
- The first line of your script should give the executable shell to use, like:
#!/bin/sh
#!/bin/bash
#!/bin/tcsh
- Do not have any extra space or anything (not even a
###comment
) before the first line of your script that condor executes
My jobs failed due to missing libraries on the worker node
- Errors are like the following three examples:
./Analyzer: error while loading shared libraries: libCore.so: cannot open shared object file: No such file or directory
xrdcp: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by xrdcp)
./condor_exec.exe: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./condor_exec.exe)
- Solution: Something is wrong in setting up your CMSSW or other environment on the condor worker node. If you have been using in your
condor.jdl
the linegetenv = true
to pass your interactive environment, an example is given on the condor refactor page - Solution: the
./condor_exec.exe
error was an example of passing a system installed executable to the condor job from a SL6 system. The condor schedulers are SL7, so will pass a SL7 executable to a SL6 container and it will fail. Call your system executables (if needed) inside a shell script.
- Solution: Something is wrong in setting up your CMSSW or other environment on the condor worker node. If you have been using in your
- None of that worked? Fill out a LPC Service Portal ticket/incident (Click on "I'm having a problem!") and give the details of your jobs and the errors they had.
Unable to locate local daemon
- Error in running script (gridpack generation) looks like:
root: could not import htcondor python API:
Unable to locate local daemon
condor_submit
is actually a python wrapper around the real condor_submit
command which first talks to negotiators and then the remote
schedulers.self.schedd = htcondor.Schedd()
General troubleshooting methods
The following techniques may be useful to troubleshoot condor job problems. The condor user's manual guide to managing jobs goes into more depth for troubleshooting.Web-based condor monitoring
- The Landscape LPC and User batch summary web page is quite useful to understand the status of jobs at the cmslpc. The landscape web page requires a valid CMS grid certificate for authentication.
- In the User Batch summary, choose or type in the username from the pulldown menu
- Note that landscape updates every ~5 minutes so command line condor monitoring will give you quicker results, although neither are instantaneous
- In the user batch summary, you can click on "Cluster" to find out all the jobs with the same Cluster (the first number in JobID before the period) running on a particular node. In this view you can find the
condor_q -long
information, including reason job is held, CPU efficiency, resident set size (memory), and disk usage. - In the first LPC landscape screen, the average job wait time is shown as well as some brief information about EOS health
- The LPC batch summary screen gives more detailed information about the batch system, including condor user priority
- In all landscape screens, the user can adjust the time range in the upper right corner
- By logging into landscape (Fermilab services password) in the orange swirl menu in the upper left, a user can customize plots for their view, for instance making a log scale, or a white background
Command line job monitoring
- From the command line:
condor_status -submitters
tells you the status of all condor jobs. If your jobs are Idle, they are not yet running, and waiting for an empty slot that matches your job requirements. If your jobs are Held they are stopped for some reason.
User priority
condor_userprio
tells you the condor user priority of all the recent users of the cmslpc batch cluster. For more about condor user priority, see the condor user's manual guide to user priority.
Job logs
- You can find useful troubleshooting information in your job logs, the
jobname_Cluster_Process.log
file contains information about where the job is running/ran, thejobname_Cluster_Process.stdout
, andjobname_Cluster_Process.stderr
files contain the actual executable stdout and stderr. The stdout and stderr are transferred after the job is finished.
condor_q
- To understand job status (ST) from
condor_q
, you can refer to the condor user manual (8.7) - see below for condor job troubleshooting to understand why a job is in each status:- "Current status of the job, which varies somewhat according to the job universe and the timing of updates. H = on hold, R = running, I = idle (waiting for a machine to execute on), C = completed, X = removed, S = suspended (execution of a running job temporarily suspended on execute node), < = transferring input (or queued to do so), and > = transferring output (or queued to do so)."
- Be sure to know which scheduler your job was submitted to and is running on. You can query all the schedulers with
condor_q
. For instance if your condor jobs were scheduled fromlpcschedd3.fnal.gov
, then you may use the argument-name
to query that one along, for instance:condor_q -name lpcschedd3.fnal.gov
.
- Find the jobID of the job you are concerned about with
condor_q
, you can see why it has that status it has with:
Analyze a condor job
The command to use it (example):
condor_q -name lpcschedd3.fnal.gov -analyze jobID
Note that
-analyze
and -better-analyze
may give misleading results. This is because there are two matchmakers (negotiators)
in the pool. One is for T1 pilots from MC production and one for cmslpc jobs. Therefore not only will more machines be reported than you may have access to, but you may occasionally get a
message that I am not considering your job for matchmaking
(when that's not actually the case for your job, you just landed on the wrong matchmaker).
- In another case, the job requirements didn't allow for the job to run on any machines. It is important to note that due to the way the lpc cluster is partitioned, individual job slots can use more (or less) memory and disk than a standard amount. Therefore in a busy cmslpc farm, job requirements for memory and disk will needlessly restrict a job. It is best unless you need them for certain (and are willing to wait for resources to be available), to not add Requirements.
- Here is an
example output from
condor_q -name lpcschedd3.fnal.gov -analyze
where an old condor jdl was used:[username@cmslpc333 condor]$
condor_q -name lpcschedd3.fnal.gov -analyze 76596545
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?... Last successful match: Wed Sep 7 14:25:07 2022 Last failed match: Thu Sep 8 14:32:46 2022 Reason for last match failure: no match found The Requirements expression for your job is: ( ( OpSys == "LINUX" ) && ( Arch != "DUMMY" ) ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) Suggestions: Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( OpSys == "LINUX" ) 0 REMOVE 2 ( Arch != "DUMMY" ) 0 REMOVE 3 ( TARGET.Memory >= 2500 ) 1302 4 ( TARGET.Disk >= 1000000 ) 3736 5 ( TARGET.HasFileTransfer ) 6504condor_q -name lpcschedd3.fnal.gov -better-analyze
is a better tool. However,condor_q -name lpcschedd3.fnal.gov -better-analyze
wouldn't print the error about being over quota. It turned out the user above had extra commands in the condor job file to hold the job if it went over requested memory. Additionally there were some no longer supported requirements being implemented, whichanalyze
suggested removing.
- Here is an example with
condor_q -name lpcschedd3.fnal.gov -better-analyze
, where the user requested a machine with more memory available than any in the cluster.
[username@cmslpc333 condor]$
condor_q -submitter username
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?... @ 09/08/22 15:12:37
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
76596545.0 tonjes 9/8 15:12 0+00:00:00 H 0 0.0 condor_test.sh
[username@cmslpc333 condor]$ condor_q -analyze 974.0 -name lpcschedd3.fnal.gov
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?...
---
76596545.000: Request is held.
Hold reason: Error from slot1@cmswn1898.fnal.gov: STARTER at 131.225.191.192 failed
to send file(s) to <131.225.188.235:9618>; SHADOW at 131.225.188.235 failed to write to file
/uscms_data/d1/username/filename.root: (errno 122) Disk quota exceeded
In the above case, the user's quota on /uscms_data
, the NFS disk is exceeded.
[username@cmslpc333 testJobRestrict]$
condor_q -name lpcschedd3.fnal.gov -better-analyze 76596545
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?...
User priority for username@fnal.gov is not available, attempting to analyze without it.
---
76596545.000: Run analysis summary. Of 6076 machines,
6076 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are available to run your job
No successful match recorded.
Last failed match: Tu Jan 22 17:02:40 2019
Reason for last match failure: no match found
WARNING: Be advised:
No resources matched request's constraints
The Requirements expression for your job is:
( TARGET.Arch == "x86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
( TARGET.HasFileTransfer )
Your job defines the following attributes:
RequestDisk = 10000000
RequestMemory = 210000
The Requirements expression for your job reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 6076 TARGET.Arch == "x86_64"
[1] 6076 TARGET.OpSys == "LINUX"
[3] 1718 TARGET.Disk >= RequestDisk
[5] 0 TARGET.Memory >= RequestMemory
Suggestions:
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( TARGET.Memory >= 210000 ) 0 MODIFY TO 24028
2 ( TARGET.Disk >= 10000000 ) 1718
3 ( TARGET.Arch == "x86_64" ) 6076
4 ( TARGET.OpSys == "LINUX" ) 6076
5 ( TARGET.HasFileTransfer ) 6076
condor_tail
If you wish to see the end of a current running job stdout, you can use the condor_tail
command
together with the job number. The results will give you ~20 lines of output, it is cut here for space in this example:
condor_tail 76596545.1
== CMSSW: 09-Mar2019 22:26:32 CST Successfully opened file root://cmseos.fnal.gov//store/user/username/myfile.root == CMSSW: Begin processing the 1st record. Run 1, Event 254, LumiSection 6 at 09-Mar-2019 22:30:12.001 CST == CMSSW: Begin processing the 101st record. Run 1, Event 353, LumiSection 8 at 09-Mar-2019 22:30:28.673 CST == CMSSW: Begin processing the 201st record. Run 1, Event 451, LumiSection 10 at 09-Mar-2019 22:30:32.670 CST
Note that you can also query the -stderr
instead of the default stdout
, for instance:
condor_tail -stderr 76596545.1
More options can be found with condor_tail -help
condor_ssh
The commandcondor_ssh_to_job
and similar ones will not work as condor_ssh is disabled by policy at cmslpc condor servers. Please use other methods to troubleshoot your job.Querying condor ClassAds
- Querying some or all of the job ClassAds is a useful technique to understand what's wrong with jobs. To find the most information, you can query on the interactive machine the jobs were submitted from with:
condor_q -long JobID
- Another way to find out the reason a job is held from the command line (use the appropriate
-name lpcschedd3.fnal.gov
for your job: - For all the jobs that are on a particular interactive machine:
condor_q -af:jh User HoldReason -name lpcschedd3.fnal.gov
. - For a single job, put in the jobID, for instance:
condor_q 76596545.0 -af:jh User HoldReason -name lpcschedd3.fnal.gov
What are the system limits for jobs running on the cmslpc condor batch system?
- Time: Jobs are removed for: Running for more than 2 days, held for more than 7 days, idle for more than 30 days
- Disk: Jobs cannot use more than 40GB of local disk space on the condor worker node
- Memory: The default memory (if not specified) is 2100 MB, if your job uses more than this, or more than you requested, then it is removed. Got to the Additional condor commands page to learn more about profiling and submitting high memory jobs
- Technical details:
- To find out the configuration of the complete system defaults, you can use the command:
condor_config_val -dump -name lpcschedd3.fnal.gov
, the most important line is: SYSTEM_PERIODIC_REMOVE = (JobUniverse == 5 && JobStatus == 2 && ((time() - EnteredCurrentStatus) > 86400*2)) || (JobRunCount > 10) || (JobStatus == 5 && (time() - EnteredCurrentStatus) > 86400*7) || (DiskUsage > 40000000) || (ResidentSetSize > (RequestMemory * 1000) || (JobStatus == 1 && (time() - EnteredCurrentStatus) > 86400*30) )
- If your job is removed, you get this which is more in English:
SYSTEM_PERIODIC_REMOVE_REASON = strcat("LPC job removed by SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobUniverse == 5 && JobStatus == 2 && ((time() - EnteredCurrentStatus) > 86400*2)), "job running for more than 2 days.", ifThenElse((JobRunCount > 10), "job restarting more than 10 times.", ifThenElse((JobStatus == 5 && (time() - EnteredCurrentStatus) > 86400*7), "job held for more than 7 days.", ifThenElse((DiskUsage > 40000000), "job exceeding 40GB disk usage.", ifThenElse((ResidentSetSize > (RequestMemory * 1000)), "job exceeding requested memory.", ifThenElse((JobStatus == 1 && (time() - EnteredCurrentStatus) > 86400*30), "idle more than 30 days", "unknown reasons." ) ) ) ) ) ) )