uscms.org  www 

CMS LPC cluster Condor refactor

Implemented on most nodes: January 22, 2019 the condor refactor is implemented on:
  • cmslpc-sl7.fnal.gov for SL7 condor submission
  • Note: you must obtain your CMS VO proxy on one of these nodes to have the X509_USER_PROXY environment variable set properly to your home directory and use condor commands!

Required changes for users to understand are regarding the first two items below: X509_USER_PROXY and commands needing -name argument. The remainder of the changes you may notice and is documented for your information.

X509_USER_PROXY environment variable and your CMS grid certificate required

  • Jobs, and all condor queries now require the user to have a valid grid proxy in the CMS VO
  • When you obtain your proxy (voms-proxy-init --valid 192:00 -voms cms), it will be saved in your home directory where it can be read by the condor job on the worker node
  • The condor refactor system will automatically use the following line, and you do NOT need to specify it in your condor.jdl file. It should work without it.
    • x509userproxy = $ENV(X509_USER_PROXY)
  • If you have hard coded your reference of the x509userproxy or X509_USER_PROXY to /tmp or some other location, please remove those references
  • The CMS LPC CAF system must know about the association of your grid certificate and FNAL username. This is usually done as part of Enable EOS area ticket. You must do this at least once for your grid certificate to be associated with your account, which also lets you write to your EOS area from CRAB.
    • Go to the LPC Service Portal: https://fermi.servicenowservices.com/lpc
      • And do "CMS Storage Space Request", and select "Enable" under "Action Required"
      • It will prompt you for for your DN (Your DN is the result of voms-proxy-info --identity) and CERN username. Submit that to register your DN. It will take a few hours (during weekdays) for it to propagate everywhere

Commands requiring -name argument

Commands that need to "read" or "write" to a specific scheduler require -name, for example (use the particular scheduler for your jobs):
  • condor_hold -name lpcschedd3.fnal.gov -all
  • condor_release -name lpcschedd3.fnal.gov -all
  • condor_tail -name lpcschedd2.fnal.gov 30000041.0
  • condor_qedit -name lpcschedd1.fnal.gov 14.0 OriginalCpus 1
  • condor_rm -name lpcschedd3.fnal.gov 60000042
If you want to remove all your jobs that went to all the schedulers, you would have do do the following:
condor_rm -name lpcschedd1.fnal.gov -all; condor_rm -name lpcschedd2.fnal.gov -all; condor_rm -name lpcschedd3.fnal.gov -all
You can configure similar commands for other needs.
Extra information:

CMS LPC cluster Condor refactor background

The condor refactor provides the following improved functionality:
  • Jobs can be submitted from SL7 (cmslpc-sl7.fnal.gov) nodes
  • Jobs run in Docker containers on the worker node. Condor will sense the Operating System (OS) of the submission node and automatically set jobs to run on the same OS on the worker nodes. However, if you wish to specify, you can set (in the condor jdl) this additional line:
  • Nodes available in the cmslpc condor pool can be seamlessly switched from T1_US_FNAL worker nodes to T3_US_FNALLPC batch nodes (accessible via condor or CRAB3 to T3_US_FNALLPC)
  • Condor scheduler nodes (schedd) are no longer logged into by users, so if a particular interactive node needs to be rebooted, or is inaccessible due to technical problems, a user's condor jobs are not affected
  • A wrapper script sends your condor jobs to the most advantageous schedd
  • Each schedd has a capacity to have 10,000 condor jobs running in parallel, should there be that many job slots available to it

Scientific Linux 7 or Scientific Linux 6 jobs

Note: Updated SL6 Singularity documentation to run Scientific Linux 6 jobs can be found at the main condor batch systems page Condor jobs submitted from the Scientific Linux 7 nodes (cmslpc-sl7.fnal.gov) should automatically run in SL7 containers.
  • To choose SL7 Docker containers, add the following line to your job description file (i.e., condor.jdl)
    • +DesiredOS="SL7"
  • If you are re-purposing condor scripts used in CMS Connect, you use +REQUIRED_OS = "rhel7", you can add a line to take advantage of that in the cmslpc condor refactor:
      +DesiredOS = REQUIRED_OS
  • To understand what OS your job is running under, for example for job 30000093 from schedd lpcschedd2:
    • condor_q -name lpcschedd2 30000093 -af:h desiredos

    condor_submit changes

    • You will continue to use condor_submit condor.jdl (example) as you did before, but the output will change:
        [username@cmslpc42 ~/]$ condor_submit multiplication-random.jdl
        Querying the CMS LPC pool and trying to find an available schedd...
        Attempting to submit jobs to lpcschedd2.fnal.gov
        Submitting job(s)...............
        15 job(s) submitted to cluster 15.
    • If condor_submit fails, the user is advised to pass a -debugfile to capture the debug log. This can be sent in a LPC Service Portal ticket to help understand problems. Example:
      • [username@cmslpc42 ~]$ condor_submit multiplication-random.jdl -debugfile /tmp/username_date.log
        Querying the CMS LPC pool and trying to find an available schedd...
        Attempting to submit jobs to lpcschedd2.fnal.gov
        Submitting job(s)
        ERROR: Failed to connect to local queue manager
        AUTHENTICATE:1003:Failed to authenticate with any method
        AUTHENTICATE:1004:Failed to authenticate using GSI
        GSI:5003:Failed to authenticate.  Globus is reporting error (851968:24).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
        AUTHENTICATE:1004:Failed to authenticate using FS
        username@cmslpc42 ~]$ ls -al /tmp/username.log 
        -rw-r--r-- 1 username us_cms 2412 Nov 19 13:55 /tmp/username.log
      • Note that in this case it prompts you to do grid-proxy-init, but you should do voms-proxy-init --valid 192:00 -voms cms to get a proxy in the CMS VO

    condor_q differences

    • condor_q and condor_q -allusers will now report results from all schedulers.
    • If a shorter report is needed, a user can use condor_q -batch
    • Note: at this time, condor_status -submitters will report jobs running both at the cmslpc condor cluster as well as the T1_US_FNAL cluster.
    • If there are no jobs running on any scheduler, condor_q will not return anything

    JobID differences

    • Users will note that their condor JobIDs have a different range depending on the scheduler. This allows for unique jobID should different jobs land on different schedulers
      • Example: JobID is obtained in a condor jdl with the variable $(Cluster), usually used with log files, for instance: TestNode_$(Cluster)_$(Process).stdout
    • Condor v8.7 manual for the variables Cluster and Process
    • The schedulers are setup with the following JobID ranges:
      • lpcschedd1.fnal.gov: 1 - 29999999
      • lpcschedd2.fnal.gov: 30000000 - 59999999
      • lpcschedd3.fnal.gov: 60000000 - 89999999

    Commands have wrappers

    Be aware that many commands have wrapper scripts to allow communication with all the schedulers. You may see, for instance, a different executable location for a command such as condor_tail when doing condor_tail --help compared to which condor_tail.

    Other differences users have found

      $USER environment variable

    • The $USER environment variable is no longer on the worker nodes. To be able to have access to it, place into your condor script (for instance bash script condor.sh):
      • export USER = $(whoami)
      • Users may notice a longer delay (compared to before the refactor) to get reports from commands, no more than a minute, as commands are querying all the schedulers

      getenv in condor jdl

    • Do NOT use getenv = true in your condor.jdl (the file you condor submit).
      • This is dangerous for a number of reasons, one is that your environment will depend on CMSSW paths on NFS disks like /uscms_data/d1 which are not mounted on the condor worker nodes, and therefore it will look for libraries and may pick them up from a place you don't expect, like cvmfs CMSSW without any of your custom compiled libaries
      • Interactive environment variables which are NOT needed on a worker node not work well with Docker containers. We've already discovered and fixed the LS_COLORS tcsh environment variable, but there may be more
      • If your scripts worked before with this, you will now find you have to setup your working environment on the worker node, here is just one example for a bash script wrapper.sh:
      • #!/bin/bash -e
        export SCRAM_ARCH=slc7_amd64_gcc630
        ls -lrth
        source /cvmfs/cms.cern.ch/cmsset_default.sh
        eval `scramv1 project CMSSW CMSSW_10_1_11`
        cd CMSSW_10_1_11
        ls -lrth
        eval `scramv1 runtime -sh`
        cp ../$@ run.sh
        chmod u+x run.sh

      EOS fuse mount

    • EOS fuse mount is not on the condor refactor schedulers
      • This is a good thing. Please review the Things you shouldn't do directions on the EOS page
      • Using the EOS fuse mount in batch, interactively for processing, and things like that will stall or bring down FNAL EOS, you should use xrootd to interact with it instead

    Condor sends emails to users

    • If you have any of the following in your condor.jdl file, you will get emails about your jobs (note that the old system didn't email even if you had this configured!):
      • Notification = Always
      • Notification = Complete
      • Notification = Error

    Get help to modify your workflow

    Contact LPC Computing support or email the lpc-howto community support mailing list to get help modifying your workflow.

    Troubleshooting the condor refactor and condor in general

    Follow this link (also in the side menu): Troubleshooting Condor Batch System.
    Webmaster | Last modified: Tuesday, 08-Sep-2020 13:29:18 CDT