U.S. CMS
Search
uscms.org  www 

CMS LPC cluster Condor refactor


Test Phase:December 4, 2018 the condor refactor is implemented on:
  • cmslpc37.fnal.gov for SL6 condor submission (Note the node is removed from the cmslpc-sl6.fnal.gov round robin)
  • cmslpc-sl7.fnal.gov for SL7 condor submission
  • Note: you must obtain your CMS VO proxy on one of these nodes to have the X509_USER_PROXY environment variable set properly to your home directory

CMS LPC cluster Condor refactor background


The condor refactor provides the following improved functionality:
  • Jobs can be submitted from SL6 (cmslpc-sl6.fnal.gov - see note above about cmslpc37 in test phase) or SL7 (cmslpc-sl7.fnal.gov) nodes
  • Jobs run in Docker containers on the worker node. Condor will sense the Operating System (OS) of the submission node and automatically set jobs to run on the same OS on the worker nodes. However, if you wish to specify, you can set (in the condor jdl) this additional line:
      +DesiredOS="SL7"
  • Nodes available in the cmslpc condor pool can be seamlessly switched from T1_US_FNAL worker nodes to T3_US_FNALLPC batch nodes (accessible via condor or CRAB3 to T3_US_FNALLPC)
  • Condor scheduler nodes (schedd) are no longer logged into by users, so if a particular interactive node needs to be rebooted, or is inaccessible due to technical problems, a user's condor jobs are not affected
  • A wrapper script sends your condor jobs to the most advantageous schedd
  • Each schedd has a capacity to have 10,000 condor jobs running in parallel, should there be that many job slots available to it

Required changes for users to understand are regarding the first two items below: X509_USER_PROXY and commands needing -name argument. The remainder of the changes you may notice and is documented for your information.

X509_USER_PROXY and your CMS grid certificate required

  • Jobs, and all condor queries now require the user to have a valid grid proxy in the CMS VO
  • When you obtain your proxy (voms-proxy-init --valid 192:00 -voms cms), it will be saved in your home directory where it can be read by the condor job on the worker node
  • The condor refactor system will automatically use the following line, and you do NOT need to specify it in your condor.jdl file. It should work without it.
    • x509userproxy = $ENV(X509_USER_PROXY)
  • If you have hard coded your reference of the x509userproxy or X509_USER_PROXY to /tmp or some other location, please remove those references

Commands requiring -name argument

Commands that need to "read" or "write" to a specific scheduler require -name, for example (use the particular scheduler for your jobs):
  • condor_hold -name lpcschedd3.fnal.gov -all
  • condor_release -name lpcschedd3.fnal.gov -all
  • condor_tail -name lpcschedd2.fnal.gov 30000041.0
  • condor_qedit -name lpcschedd1.fnal.gov 14.0 OriginalCpus 1
  • condor_rm -name lpcschedd3.fnal.gov 60000042
Extra information:

Scientific Linux 7 or Scientific Linux 6 jobs

Condor jobs submitted from the Scientific Linux 7 nodes (cmslpc-sl7.fnal.gov) should automatically run in SL7 containers.
SL6 jobs submitted from the SL6 nodes will automatically run in SL6 containers.
  • To choose SL7 Docker containers, add the following line to your job description file (i.e., condor.jdl)
    • +DesiredOS="SL7"
  • To choose SL6 Docker containers, add the following line to your job description file (i.e., condor.jdl)
    • +DesiredOS="SL6"
  • If you are re-purposing condor scripts used in CMS Connect, you use +REQUIRED_OS = "rhel7", you can add a line to take advantage of that in the cmslpc condor refactor:
      +REQUIRED_OS="rhel7"
      +DesiredOS = $(REQUIRED_OS)
  • To understand what OS your job is running under, for example for job 30000093 from schedd lpcschedd2:
    • condor_q -name lpcschedd2 30000093 -af:h desiredos

    condor_submit changes

    • You will continue to use condor_submit condor.jdl (example) as you did before, but the output will change:
        [username@cmslpc42 ~/]$ condor_submit multiplication-random.jdl
        Querying the CMS LPC pool and trying to find an available schedd...
        
        Attempting to submit jobs to lpcschedd2.fnal.gov
        
        Submitting job(s)...............
        15 job(s) submitted to cluster 15.
        
    • If condor_submit fails, the user is advised to pass a -debugfile to capture the debug log. This can be sent in a LPC Service Portal ticket to help understand problems. Example:
      • [username@cmslpc42 ~]$ condor_submit multiplication-random.jdl -debugfile /tmp/username_date.log
        Querying the CMS LPC pool and trying to find an available schedd...
        
        Attempting to submit jobs to lpcschedd2.fnal.gov
        
        Submitting job(s)
        ERROR: Failed to connect to local queue manager
        AUTHENTICATE:1003:Failed to authenticate with any method
        AUTHENTICATE:1004:Failed to authenticate using GSI
        GSI:5003:Failed to authenticate.  Globus is reporting error (851968:24).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)
        AUTHENTICATE:1004:Failed to authenticate using FS
        
        username@cmslpc42 ~]$ ls -al /tmp/username.log 
        -rw-r--r-- 1 username us_cms 2412 Nov 19 13:55 /tmp/username.log
        
      • Note that in this case it prompts you to do grid-proxy-init, but you should do voms-proxy-init --valid 192:00 -voms cms to get a proxy in the CMS VO

    condor_q differences

    • condor_q and condor_q -allusers will now report results from all schedulers.
    • If a shorter report is needed, a user can use condor_q -batch
    • Note: at this time, condor_status -submitters will report jobs running both at the cmslpc condor cluster as well as the T1_US_FNAL cluster.
    • If there are no jobs running on any scheduler, condor_q will not return anything
    • For the time being (Nov. 28, 2018), condor_q with the -totals argument doesn't work with the wrapper

    JobID differences

    • Users will note that their condor JobIDs have a different range depending on the scheduler. This allows for unique jobID should different jobs land on different schedulers
      • Example: JobID is obtained in a condor jdl with the variable $(Cluster), usually used with log files, for instance: TestNode_$(Cluster)_$(Process).stdout
    • Condor v8.7 manual for the variables Cluster and Process
    • The schedulers are setup with the following JobID ranges:
      • lpcschedd1.fnal.gov: 1 - 29999999
      • lpcschedd2.fnal.gov: 30000000 - 59999999
      • lpcschedd3.fnal.gov: 60000000 - 89999999

    Commands have wrappers

    Be aware that many commands have wrapper scripts to allow communication with all the schedulers. You may see, for instance, a different executable location for a command such as condor_tail when doing condor_tail --help compared to which condor_tail.

    Other differences users have found

    • The $USER environment variable is no longer on the worker nodes. To be able to have access to it, place into your condor script (for instance bash script condor.sh):
      • export USER = $(whoami)
        
      • Users may notice a longer delay (compared to before the refactor) to get reports from commands, no more than a minute, as commands are querying all the schedulers

    Get help to modify your workflow

    Contact LPC Computing support or email the lpc-howto community support mailing list to get help modifying your workflow.
    Webmaster | Last modified: Wednesday, 05-Dec-2018 11:05:35 CST