Condor Monitoring (System Status): U.S. CMS - U.S. CMS @ Work - Doing Physics

System Status: Condor Batch Monitoring

Landscape

Landscape LPC and User batch summary for LPC farm

Authenticate to Landscape "CMS LPC Grafana" with your CERN grid certificate
"LPC Job Summary" shows the overall jobs for all users
"User Job Summary" - choose which user, and look at currently running jobs. From that page, also choose "User Batch History"

The "User Job Summary" and "User Batch History" pages are very useful to find out things like average job time, CPU efficiency

Basic Command line

From the command line on cmslpc-el9: condor_status -submitters, and condor_userprio, as described on the cmslpc batch systems troubleshooting web page

More command line tricks (each command is one line at prompt):
- condor_q: reports only your batch jobs from recent submissions
- condor_status -submitters: This reports jobs in both the T1 and cmslpc batch (T3) together
- condor_status -schedd
- condor_status -run
- condor_userprio

Threshold for the Tier1/T3 (cmslpc batch) workers

First you need to know where the threshold is for LPC workers so you can filter out T1 usage, for that, you'd have to query the collector to find that out:

condor_status -collector cmst1mgr2 -af FERMIHTC_LPC_MAX_WN_NUM
```
2071
```
That means the threshold is cmswn2071. This threshold can change since this documentation is written, so you may find in commands below you will need to check this number.

What is the cluster occupancy in terms of cpus/memory that people are using?

Once you have the T1/T3 threshold number (2071 in the example), you can construct a command to find cmslpc batch cluster occupancy (use the value you extracted above, as it may change):

condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Claimed"' -pool cmst1mgr2.fnal.gov -af:h cpus memory | sort | uniq -c

      1 1    10000 
   1153 1    2048  
   2656 1    2100  
    208 1    7000  
     17 1    8400  
    299 8    15000 
      2 8    2048  
      1 cpus memory

Do not be alarmed about the 8 core, 15GB slots. Those will most likely be pilots coming from CMS global pool, which can contain more than one CRAB job sent to T3_US_FNALLPC
Results are shown with #jobs RequestCpus RequestMemory
In the example above there are two jobs running 8 core slots using 2048MB of memory

What are the per user statistics for jobs running on cmslpc worker nodes?

In the command below, we use the T1/T3 threshold number (2071 in the example)
condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Claimed"' -pool cmst1mgr2.fnal.gov -af:h accountinggroup cpus memory | sort | uniq -c

      1 accountinggroup                cpus memory
     19 group_cmslpc.adatta@fnal.gov   1    2048  
      1 group_cmslpc.algomez@fnal.gov  8    15000 
      9 group_cmslpc.alimena@fnal.gov  1    2048  
      8 group_cmslpc.bburkle@fnal.gov  8    15000 
    220 group_cmslpc.charaf@fnal.gov   1    7000  
      1 group_cmslpc.cmantill@fnal.gov 1    10000 
      2 group_cmslpc.cmantill@fnal.gov 8    15000 
      3 group_cmslpc.dbrehm@fnal.gov   1    2048  
      2 group_cmslpc.jbabbar@fnal.gov  1    2048  
     17 group_cmslpc.jingyu@fnal.gov   1    8400  
   1111 group_cmslpc.jkrupa@fnal.gov   1    2048  
    142 group_cmslpc.kwei726@fnal.gov  8    15000 
      2 group_cmslpc.matteoc@fnal.gov  8    2048  
   2656 group_cmslpc.rsyarif@fnal.gov  1    2100  
      1 group_cmslpc.sbein@fnal.gov    1    2048  
     46 group_cmslpc.sbein@fnal.gov    8    15000 
     28 group_cmslpc.shogan@fnal.gov   1    2048  
     99 group_cmslpc.ssekmen@fnal.gov  8    15000 
      1 group_cmslpc.zwang4@fnal.gov   8    15000

Again, do not be alarmed by the 8 core, 15GB slots, as they are most likely pilots coming from the CMS global pool (CRAB jobs going to T3_US_FNALLPC, they can contain more than one CRAB job). In this example, we see both local batch jobs and pilot (CRAB) batch jobs from user cmantill
Results are shown with #jobs Owner RequestCpus RequestMemory

How many CPUs are free/available for local batch jobs?

Be sure to get the updated T1/T3 threshold number (2071 in the example)
condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Unclaimed" && (Memory/(1000*cpus))>=2' -pool cmst1mgr2.fnal.gov -af:h cpus memory | sort | uniq -c

      1 1    12614 
      1 1    14206 
    386 1    2048  
      2 1    4665  
      1 1    4717  
      2 1    4769  
      1 1    9409  
      1 1    9721  
      1 1    9773  
      1 2    11509 
      1 2    11665 
      1 2    11717 
      1 2    5417  
      4 3    8709  
      2 3    8761  
     40 8    24109 
      1 cpus memory

What's the average wait time for a job?

https://landscape.fnal.gov/lpc, authenticate with your CMS grid certificate, and see that and much more on this page

Monitoring CRAB at T3_US_FNALLPC

See the T3_US_FNALLPC glidein monitoring page, which may be about ~20 minutes behind the actual job - use CERN SSO
Other CRAB monitors are documented on the SWGuideCRAB twiki
T3_US_FNALLPC site administrator view of everything running from the global pool: CRAB + CMS Connect

Monitoring CMS Connect at T3_US_FNALLPC

All of CMS Connect - grafana plot, use CERN SSO

US CMS News

In This Section: