U.S. CMS
Search
uscms.org  www 

System Status: Condor Batch Monitoring

Landscape

  • Landscape LPC and User batch summary for LPC farm
    • Authenticate to Landscape "CMS LPC Grafana" with your CERN grid certificate
    • "LPC Job Summary" shows the overall jobs for all users
    • "User Job Summary" - choose which user, and look at currently running jobs. From that page, also choose "User Batch History"
      • The "User Job Summary" and "User Batch History" pages are very useful to find out things like average job time, CPU efficiency

Basic Command line

  • From the command line on cmslpc-el9: condor_status -submitters, and condor_userprio, as described on the cmslpc batch systems troubleshooting web page

  • More command line tricks (each command is one line at prompt):
    • condor_q: reports only your batch jobs from recent submissions
    • condor_status -submitters: This reports jobs in both the T1 and cmslpc batch (T3) together
    • condor_status -schedd
    • condor_status -run
    • condor_userprio

Threshold for the Tier1/T3 (cmslpc batch) workers

  • First you need to know where the threshold is for LPC workers so you can filter out T1 usage, for that, you'd have to query the collector to find that out:
    • condor_status -collector cmst1mgr2 -af FERMIHTC_LPC_MAX_WN_NUM
      2071
    • That means the threshold is cmswn2071. This threshold can change since this documentation is written, so you may find in commands below you will need to check this number.

    What is the cluster occupancy in terms of cpus/memory that people are using?

  • Once you have the T1/T3 threshold number (2071 in the example), you can construct a command to find cmslpc batch cluster occupancy (use the value you extracted above, as it may change):
    • condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Claimed"' -pool cmst1mgr2.fnal.gov -af:h cpus memory | sort | uniq -c
    •       1 1    10000 
         1153 1    2048  
         2656 1    2100  
          208 1    7000  
           17 1    8400  
          299 8    15000 
            2 8    2048  
            1 cpus memory
      
    • Do not be alarmed about the 8 core, 15GB slots. Those will most likely be pilots coming from CMS global pool, which can contain more than one CRAB job sent to T3_US_FNALLPC
    • Results are shown with #jobs RequestCpus RequestMemory
    • In the example above there are two jobs running 8 core slots using 2048MB of memory

    What are the per user statistics for jobs running on cmslpc worker nodes?

    • In the command below, we use the T1/T3 threshold number (2071 in the example)
    • condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Claimed"' -pool cmst1mgr2.fnal.gov -af:h accountinggroup cpus memory | sort | uniq -c
    •       1 accountinggroup                cpus memory
           19 group_cmslpc.adatta@fnal.gov   1    2048  
            1 group_cmslpc.algomez@fnal.gov  8    15000 
            9 group_cmslpc.alimena@fnal.gov  1    2048  
            8 group_cmslpc.bburkle@fnal.gov  8    15000 
          220 group_cmslpc.charaf@fnal.gov   1    7000  
            1 group_cmslpc.cmantill@fnal.gov 1    10000 
            2 group_cmslpc.cmantill@fnal.gov 8    15000 
            3 group_cmslpc.dbrehm@fnal.gov   1    2048  
            2 group_cmslpc.jbabbar@fnal.gov  1    2048  
           17 group_cmslpc.jingyu@fnal.gov   1    8400  
         1111 group_cmslpc.jkrupa@fnal.gov   1    2048  
          142 group_cmslpc.kwei726@fnal.gov  8    15000 
            2 group_cmslpc.matteoc@fnal.gov  8    2048  
         2656 group_cmslpc.rsyarif@fnal.gov  1    2100  
            1 group_cmslpc.sbein@fnal.gov    1    2048  
           46 group_cmslpc.sbein@fnal.gov    8    15000 
           28 group_cmslpc.shogan@fnal.gov   1    2048  
           99 group_cmslpc.ssekmen@fnal.gov  8    15000 
            1 group_cmslpc.zwang4@fnal.gov   8    15000 
      
    • Again, do not be alarmed by the 8 core, 15GB slots, as they are most likely pilots coming from the CMS global pool (CRAB jobs going to T3_US_FNALLPC, they can contain more than one CRAB job). In this example, we see both local batch jobs and pilot (CRAB) batch jobs from user cmantill
    • Results are shown with #jobs Owner RequestCpus RequestMemory

    How many CPUs are free/available for local batch jobs?

    • Be sure to get the updated T1/T3 threshold number (2071 in the example)
    • condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Unclaimed" && (Memory/(1000*cpus))>=2' -pool cmst1mgr2.fnal.gov -af:h cpus memory | sort | uniq -c
    •       1 1    12614 
            1 1    14206 
          386 1    2048  
            2 1    4665  
            1 1    4717  
            2 1    4769  
            1 1    9409  
            1 1    9721  
            1 1    9773  
            1 2    11509 
            1 2    11665 
            1 2    11717 
            1 2    5417  
            4 3    8709  
            2 3    8761  
           40 8    24109 
            1 cpus memory
      

    What's the average wait time for a job?

    Monitoring CRAB at T3_US_FNALLPC

    Monitoring CMS Connect at T3_US_FNALLPC

    Webmaster | Last modified: Tuesday, 23-Jul-2024 15:09:28 CDT