Condor Refactor - Computing Environment Setup - User Software and Computing

CMS LPC cluster Condor refactor

Implemented on most nodes: January 22, 2019 the condor refactor is implemented on:

cmslpc-el9.fnal.gov for Alma9 condor submission
Note: you must obtain your CMS VO proxy on one of these nodes to have the X509_USER_PROXY environment variable set properly to your home directory and use condor commands!

Required changes for users to understand are regarding the first two items below: X509_USER_PROXY and commands needing -name argument. The remainder of the changes you may notice and is documented for your information.

`X509_USER_PROXY` environment variable and your CMS grid certificate required

Jobs, and all condor queries now require the user to have a valid grid proxy in the CMS VO
When you obtain your proxy (voms-proxy-init --valid 192:00 -voms cms), it will be saved in your home directory where it can be read by the condor job on the worker node
The condor refactor system will automatically use the following line, and you do NOT need to specify it in your condor.jdl file. It should work without it.

x509userproxy = $ENV(X509_USER_PROXY)

If you have hard coded your reference of the x509userproxy or X509_USER_PROXY to /tmp or some other location, please remove those references
The CMS LPC CAF system must know about the association of your grid certificate and FNAL username. This is usually done as part of Enable EOS area ticket. You must do this at least once for your grid certificate to be associated with your account, which also lets you write to your EOS area from CRAB.

Go to the LPC Service Portal: https://fermi.servicenowservices.com/lpc
- And do "CMS Storage Space Request", and select "Enable" under "Action Required"
- It will prompt you for for your DN (Your DN is the result of voms-proxy-info --identity) and CERN username. Submit that to register your DN. It will take a few hours (during weekdays) for it to propagate everywhere

Commands requiring `-name` argument

Commands that need to "read" or "write" to a specific scheduler require -name, for example (use the particular scheduler for your jobs):

condor_hold -name lpcschedd3.fnal.gov -all
condor_release -name lpcschedd3.fnal.gov -all
condor_tail -name lpcschedd3.fnal.gov 76596545.0
condor_qedit -name lpcschedd3.fnal.gov 76596545.0 OriginalCpus 1
condor_rm -name lpcschedd3.fnal.gov 76596545

If you want to remove all your jobs that went to all the schedulers, you would have do do the following:


condor_rm -name lpcschedd3.fnal.gov -all; condor_rm -name lpcschedd4.fnal.gov -all; condor_rm -name lpcschedd5.fnal.gov -all

You can configure similar commands for other needs.
Extra information:

CMS LPC cluster Condor refactor background

The condor refactor provides the following improved functionality:

Jobs can be submitted from (cmslpc-el9.fnal.gov) nodes
Jobs run in Apptainer/Singularity containers on the worker node. Condor will sense the Operating System (OS) of the submission node and automatically set jobs to run on the same OS on the worker nodes. However, if you wish to specify, you can set (in the condor jdl) this additional line:
Nodes available in the cmslpc condor pool can be seamlessly switched from T1_US_FNAL worker nodes to T3_US_FNALLPC batch nodes (accessible via condor or CRAB3 to T3_US_FNALLPC)
Condor scheduler nodes (schedd) are no longer logged into by users, so if a particular interactive node needs to be rebooted, or is inaccessible due to technical problems, a user's condor jobs are not affected
A wrapper script sends your condor jobs to the most advantageous schedd
Each schedd has a capacity to have 10,000 condor jobs running in parallel, should there be that many job slots available to it

Scientific Linux 7 or other operating system jobs

Note: Updated Apptainer documentation to run Scientific Linux 6 jobs can be found at the main condor batch systems page Condor jobs submitted from the Alma9 nodes (cmslpc-el9.fnal.gov) should automatically run in Alma9 containers.

To choose SL7 containers, add the following line to your job description file (i.e., condor.jdl)

+DesiredOS="SL7"

If you are re-purposing condor scripts used in CMS Connect, you use +REQUIRED_OS = "rhel7", you can add a line to take advantage of that in the cmslpc condor refactor:

To understand what OS your job is running under, for example for job 376596545 from schedd lpcschedd3:

condor_q -name lpcschedd3 76596545 -af:h desiredos

condor_submit changes

You will continue to use condor_submit condor.jdl (example) as you did before, but the output will change:

[username@cmslpc333 ~/]$ condor_submit multiplication-random.jdl
Querying the CMS LPC pool and trying to find an available schedd...

Attempting to submit jobs to lpcschedd3.fnal.gov

Submitting job(s)...............
15 job(s) submitted to cluster 15.

If condor_submit fails, the user is advised to pass a -debugfile to capture the debug log. This can be sent in a LPC Service Portal ticket to help understand problems. Example:

[username@cmslpc333 ~]$ condor_submit multiplication-random.jdl -debugfile /tmp/username_date.log
Querying the CMS LPC pool and trying to find an available schedd...

Attempting to submit jobs to lpcschedd3.fnal.gov

Submitting job(s)
ERROR: Failed to connect to local queue manager
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:24).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using FS

username@cmslpc333 ~]$ ls -al /tmp/username.log 
-rw-r--r-- 1 username us_cms 2412 Nov 19 13:55 /tmp/username.log

Note that in this case it prompts you to do grid-proxy-init, but you should do voms-proxy-init --valid 192:00 -voms cms to get a proxy in the CMS VO

condor_q differences

condor_q and condor_q -allusers will now report results from all schedulers.
If a shorter report is needed, a user can use condor_q -batch
Note: at this time, condor_status -submitters will report jobs running both at the cmslpc condor cluster as well as the T1_US_FNAL cluster.
If there are none of your jobs running on any scheduler, condor_q will return 0 jobs for each scheduler.

JobID differences

Users will note that their condor JobIDs have a different range depending on the scheduler. This allows for unique jobID should different jobs land on different schedulers

Example: JobID is obtained in a condor jdl with the variable $(Cluster), usually used with log files, for instance: TestNode_$(Cluster)_$(Process).stdout

Condor v8.7 manual for the variables Cluster and Process

Commands have wrappers

Be aware that many commands have wrapper scripts to allow communication with all the schedulers. You may see, for instance, a different executable location for a command such as condor_tail when doing condor_tail --help compared to which condor_tail.

Other differences users have found

`$USER` environment variable

The $USER environment variable is no longer on the worker nodes. To be able to have access to it, place into your condor script (for instance bash script condor.sh):

export USER = $(whoami)

Users may notice a longer delay (compared to before the refactor) to get reports from commands, no more than a minute, as commands are querying all the schedulers

getenv in condor jdl

Do NOT use getenv = true in your condor.jdl (the file you condor submit).

This is dangerous for a number of reasons, one is that your environment will depend on CMSSW paths on NFS disks like /uscms_data/d1 which are not mounted on the condor worker nodes, and therefore it will look for libraries and may pick them up from a place you don't expect, like cvmfs CMSSW without any of your custom compiled libaries
Interactive environment variables which are NOT needed on a worker node not work well with Docker containers. We've already discovered and fixed the LS_COLORS tcsh environment variable, but there may be more
If your scripts worked before with this, you will now find you have to setup your working environment on the worker node, here is just one example for a bash script wrapper.sh:

#!/bin/bash -e
export SCRAM_ARCH=slc7_amd64_gcc630
ls -lrth
source /cvmfs/cms.cern.ch/cmsset_default.sh
eval `scramv1 project CMSSW CMSSW_10_1_11`
cd CMSSW_10_1_11
ls -lrth
eval `scramv1 runtime -sh`
cp ../$@ run.sh
chmod u+x run.sh
./run.sh

EOS fuse mount

EOS fuse mount is not on the condor refactor schedulers

This is a good thing. Please review the Things you shouldn't do directions on the EOS page
Using the EOS fuse mount in batch, interactively for processing, and things like that will stall or bring down FNAL EOS, you should use xrootd to interact with it instead

Condor sends emails to users

If you have any of the following in your condor.jdl file, you will get emails about your jobs (note that the old system didn't email even if you had this configured!):

Notification = Always
Notification = Complete
Notification = Error

Get help to modify your workflow

Contact LPC Computing support or email the lpc-howto community support mailing list to get help modifying your workflow.

Troubleshooting the condor refactor and condor in general

Follow this link (also in the side menu): Troubleshooting Condor Batch System.

US CMS News

In This Section:

CMS LPC cluster Condor refactor

`X509_USER_PROXY` environment variable and your CMS grid certificate required

Commands requiring `-name` argument

CMS LPC cluster Condor refactor background

Scientific Linux 7 or other operating system jobs

condor_submit changes

condor_q differences

JobID differences

Commands have wrappers

Other differences users have found

`$USER` environment variable

getenv in condor jdl

EOS fuse mount

Condor sends emails to users

Get help to modify your workflow

Troubleshooting the condor refactor and condor in general

US CMS News

In This Section:

CMS LPC cluster Condor refactor

X509_USER_PROXY environment variable and your CMS grid certificate required

Commands requiring -name argument

CMS LPC cluster Condor refactor background

Scientific Linux 7 or other operating system jobs

condor_submit changes

condor_q differences

JobID differences

Commands have wrappers

Other differences users have found

$USER environment variable

getenv in condor jdl

EOS fuse mount

Condor sends emails to users

Get help to modify your workflow

Troubleshooting the condor refactor and condor in general

`X509_USER_PROXY` environment variable and your CMS grid certificate required

Commands requiring `-name` argument

`$USER` environment variable