CMS LPC cluster Condor refactor
Implemented on most nodes: January 22, 2019 the condor refactor is implemented on:
cmslpc-el9.fnal.gov
for Alma9 condor submission
- Note: you must obtain your CMS VO proxy on one of these nodes to have the
X509_USER_PROXY
environment variable set properly to your home directory and use condor commands!
Required changes for users to understand are regarding the first two items below: X509_USER_PROXY and commands needing -name
argument. The remainder of the changes you may notice and is documented for your information.
X509_USER_PROXY
environment variable and your CMS grid certificate required
- Jobs, and all condor queries now require the user to have a valid grid proxy in the CMS VO
- When you obtain your proxy (
voms-proxy-init --valid 192:00 -voms cms
), it will be saved in your home directory where it can be read by the condor job on the worker node
- The condor refactor system will automatically use the following line, and you do NOT need to specify it in your
condor.jdl
file. It should work without it.
x509userproxy = $ENV(X509_USER_PROXY)
- If you have hard coded your reference of the
x509userproxy
or X509_USER_PROXY
to /tmp
or some other location, please remove those references
- The CMS LPC CAF system must know about the association of your grid certificate and FNAL username. This is usually done as part of Enable EOS area ticket. You must do this at least once for your grid certificate to be associated with your account, which also lets you write to your EOS area from CRAB.
- Go to the LPC Service Portal: https://fermi.servicenowservices.com/lpc
- And do "CMS Storage Space Request", and select "Enable" under "Action Required"
- It will prompt you for for your DN (Your DN is the result of
voms-proxy-info --identity
) and CERN username. Submit that to register your DN. It will take a few hours (during weekdays) for it to propagate everywhere
Commands requiring -name
argument
Commands that need to "read" or "write" to a specific scheduler require -name
, for example (use the particular scheduler for your jobs):
condor_hold -name lpcschedd3.fnal.gov -all
condor_release -name lpcschedd3.fnal.gov -all
condor_tail -name lpcschedd3.fnal.gov 76596545.0
condor_qedit -name lpcschedd3.fnal.gov 76596545.0 OriginalCpus 1
condor_rm -name lpcschedd3.fnal.gov 76596545
If you want to remove all your jobs that went to all the schedulers, you would have do do the following:
condor_rm -name lpcschedd3.fnal.gov -all; condor_rm -name lpcschedd4.fnal.gov -all; condor_rm -name lpcschedd5.fnal.gov -all
You can configure similar commands for other needs.
Extra information:
CMS LPC cluster Condor refactor background
The condor refactor provides the following improved functionality:
- Jobs can be submitted from (
cmslpc-el9.fnal.gov
) nodes
- Jobs run in Apptainer/Singularity containers on the worker node. Condor will sense the Operating System (OS) of the submission node and automatically set jobs to run on the same OS on the worker nodes. However, if you wish to specify, you can set (in the condor jdl) this additional line:
+DesiredOS="SL7"
cmslpc-el9.fnal.gov
for Alma9 condor submissionX509_USER_PROXY
environment variable set properly to your home directory and use condor commands!X509_USER_PROXY
environment variable and your CMS grid certificate requiredvoms-proxy-init --valid 192:00 -voms cms
), it will be saved in your home directory where it can be read by the condor job on the worker nodecondor.jdl
file. It should work without it.x509userproxy = $ENV(X509_USER_PROXY)
x509userproxy
or X509_USER_PROXY
to /tmp
or some other location, please remove those references- Go to the LPC Service Portal: https://fermi.servicenowservices.com/lpc
- And do "CMS Storage Space Request", and select "Enable" under "Action Required"
- It will prompt you for for your DN (Your DN is the result of
voms-proxy-info --identity
) and CERN username. Submit that to register your DN. It will take a few hours (during weekdays) for it to propagate everywhere
-name
argumentcondor_hold -name lpcschedd3.fnal.gov -all
condor_release -name lpcschedd3.fnal.gov -all
condor_tail -name lpcschedd3.fnal.gov 76596545.0
condor_qedit -name lpcschedd3.fnal.gov 76596545.0 OriginalCpus 1
condor_rm -name lpcschedd3.fnal.gov 76596545
condor_rm -name lpcschedd3.fnal.gov -all; condor_rm -name lpcschedd4.fnal.gov -all; condor_rm -name lpcschedd5.fnal.gov -all
cmslpc-el9.fnal.gov
) nodes+DesiredOS="SL7"
schedd
) are no longer logged into by users, so if a particular interactive node needs to be rebooted, or is inaccessible due to technical problems, a user's condor jobs are not affectedScientific Linux 7 or other operating system jobs
Note: Updated Apptainer documentation to run Scientific Linux 6 jobs can be found at the main condor batch systems page Condor jobs submitted from the Alma9 nodes (cmslpc-el9.fnal.gov
) should automatically run in Alma9 containers. - To choose SL7 containers, add the following line to your job description file (i.e.,
condor.jdl
) - If you are re-purposing condor scripts used in CMS Connect, you use
+REQUIRED_OS = "rhel7"
, you can add a line to take advantage of that in the cmslpc condor refactor:+REQUIRED_OS="rhel7"
+DesiredOS = REQUIRED_OS
+DesiredOS="SL7"
376596545
from schedd lpcschedd3
:condor_q -name lpcschedd3 76596545 -af:h desiredos
condor_submit changes
- You will continue to use
condor_submit condor.jdl
(example) as you did before, but the output will change:[username@cmslpc333 ~/]$ condor_submit multiplication-random.jdl Querying the CMS LPC pool and trying to find an available schedd... Attempting to submit jobs to lpcschedd3.fnal.gov Submitting job(s)............... 15 job(s) submitted to cluster 15.
condor_submit
fails, the user is advised to pass a -debugfile
to capture the debug log.
This can be sent in a LPC Service Portal ticket to help understand problems. Example:[username@cmslpc333 ~]$ condor_submit multiplication-random.jdl -debugfile /tmp/username_date.log Querying the CMS LPC pool and trying to find an available schedd... Attempting to submit jobs to lpcschedd3.fnal.gov Submitting job(s) ERROR: Failed to connect to local queue manager AUTHENTICATE:1003:Failed to authenticate with any method AUTHENTICATE:1004:Failed to authenticate using GSI GSI:5003:Failed to authenticate. Globus is reporting error (851968:24). There is probably a problem with your credentials. (Did you run grid-proxy-init?) AUTHENTICATE:1004:Failed to authenticate using FS username@cmslpc333 ~]$ ls -al /tmp/username.log -rw-r--r-- 1 username us_cms 2412 Nov 19 13:55 /tmp/username.log
grid-proxy-init
, but you should do voms-proxy-init --valid 192:00 -voms cms
to get a proxy in the CMS VOcondor_q differences
condor_q
andcondor_q -allusers
will now report results from all schedulers.- If a shorter report is needed, a user can use
condor_q -batch
- Note: at this time,
condor_status -submitters
will report jobs running both at the cmslpc condor cluster as well as the T1_US_FNAL cluster. - If there are none of your jobs running on any scheduler,
condor_q
will return 0 jobs for each scheduler.
JobID differences
- Users will note that their condor JobIDs have a different range depending on the scheduler. This allows for unique jobID should different jobs land on different schedulers
- Example: JobID is obtained in a condor jdl with the variable
$(Cluster)
, usually used with log files, for instance:TestNode_$(Cluster)_$(Process).stdout
- Condor v8.7 manual for the variables Cluster and Process
Commands have wrappers
Be aware that many commands have wrapper scripts to allow communication with all the schedulers. You may see, for instance, a different executable location for a command such ascondor_tail
when doing condor_tail --help
compared to which condor_tail
.
Other differences users have found
- The
$USER
environment variable is no longer on the worker nodes. To be able to have access to it, place into your condor script (for instance bash scriptcondor.sh
):
$USER
environment variable
export USER = $(whoami)
getenv in condor jdl
getenv = true
in your condor.jdl (the file you condor submit
).- This is dangerous for a number of reasons, one is that your environment will depend on CMSSW paths on NFS disks like
/uscms_data/d1
which are not mounted on the condor worker nodes, and therefore it will look for libraries and may pick them up from a place you don't expect, like cvmfs CMSSW without any of your custom compiled libaries - Interactive environment variables which are NOT needed on a worker node not work well with Docker containers. We've already discovered and fixed the
LS_COLORS
tcsh environment variable, but there may be more - If your scripts worked before with this, you will now find you have to setup your working environment on the worker node, here is just one example for a bash script
wrapper.sh
:
#!/bin/bash -e export SCRAM_ARCH=slc7_amd64_gcc630 ls -lrth source /cvmfs/cms.cern.ch/cmsset_default.sh eval `scramv1 project CMSSW CMSSW_10_1_11` cd CMSSW_10_1_11 ls -lrth eval `scramv1 runtime -sh` cp ../$@ run.sh chmod u+x run.sh ./run.sh
EOS fuse mount
- This is a good thing. Please review the Things you shouldn't do directions on the EOS page
- Using the EOS fuse mount in batch, interactively for processing, and things like that will stall or bring down FNAL EOS, you should use xrootd to interact with it instead
Condor sends emails to users
- If you have any of the following in your
condor.jdl
file, you will get emails about your jobs (note that the old system didn't email even if you had this configured!): Notification = Always
Notification = Complete
Notification = Error