System Status: Condor Batch Monitoring
Landscape
- Landscape LPC and User batch summary for LPC farm
- Authenticate to Landscape "CMS LPC Grafana" with your CERN grid certificate
- "LPC Job Summary" shows the overall jobs for all users
- "User Job Summary" - choose which user, and look at currently running jobs. From that page, also choose "User Batch History"
- The "User Job Summary" and "User Batch History" pages are very useful to find out things like average job time, CPU efficiency
Basic Command line
- From the command line on cmslpc-el9:
condor_status -submitters
, andcondor_userprio
, as described on the cmslpc batch systems troubleshooting web page - More command line tricks (each command is one line at prompt):
condor_q
: reports only your batch jobs from recent submissionscondor_status -submitters
: This reports jobs in both the T1 and cmslpc batch (T3) togethercondor_status -schedd
condor_status -run
condor_userprio
Threshold for the Tier1/T3 (cmslpc batch) workers
condor_status -collector cmst1mgr2 -af FERMIHTC_LPC_MAX_WN_NUM
2071
- That means the threshold is
cmswn2071
. This threshold can change since this documentation is written, so you may find in commands below you will need to check this number.
What is the cluster occupancy in terms of cpus/memory that people are using?
condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Claimed"' -pool cmst1mgr2.fnal.gov -af:h cpus memory | sort | uniq -c
1 1 10000 1153 1 2048 2656 1 2100 208 1 7000 17 1 8400 299 8 15000 2 8 2048 1 cpus memory
What are the per user statistics for jobs running on cmslpc worker nodes?
- In the command below, we use the T1/T3 threshold number (2071 in the example)
condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Claimed"' -pool cmst1mgr2.fnal.gov -af:h accountinggroup cpus memory | sort | uniq -c
1 accountinggroup cpus memory 19 group_cmslpc.adatta@fnal.gov 1 2048 1 group_cmslpc.algomez@fnal.gov 8 15000 9 group_cmslpc.alimena@fnal.gov 1 2048 8 group_cmslpc.bburkle@fnal.gov 8 15000 220 group_cmslpc.charaf@fnal.gov 1 7000 1 group_cmslpc.cmantill@fnal.gov 1 10000 2 group_cmslpc.cmantill@fnal.gov 8 15000 3 group_cmslpc.dbrehm@fnal.gov 1 2048 2 group_cmslpc.jbabbar@fnal.gov 1 2048 17 group_cmslpc.jingyu@fnal.gov 1 8400 1111 group_cmslpc.jkrupa@fnal.gov 1 2048 142 group_cmslpc.kwei726@fnal.gov 8 15000 2 group_cmslpc.matteoc@fnal.gov 8 2048 2656 group_cmslpc.rsyarif@fnal.gov 1 2100 1 group_cmslpc.sbein@fnal.gov 1 2048 46 group_cmslpc.sbein@fnal.gov 8 15000 28 group_cmslpc.shogan@fnal.gov 1 2048 99 group_cmslpc.ssekmen@fnal.gov 8 15000 1 group_cmslpc.zwang4@fnal.gov 8 15000
cmantill
How many CPUs are free/available for local batch jobs?
- Be sure to get the updated T1/T3 threshold number (2071 in the example)
condor_status -const 'FERMIHTC_NODE_NUMBER < 2071 && State=?="Unclaimed" && (Memory/(1000*cpus))>=2' -pool cmst1mgr2.fnal.gov -af:h cpus memory | sort | uniq -c
1 1 12614 1 1 14206 386 1 2048 2 1 4665 1 1 4717 2 1 4769 1 1 9409 1 1 9721 1 1 9773 1 2 11509 1 2 11665 1 2 11717 1 2 5417 4 3 8709 2 3 8761 40 8 24109 1 cpus memory
What's the average wait time for a job?
- https://landscape.fnal.gov/lpc, authenticate with your CMS grid certificate, and see that and much more on this page
Monitoring CRAB at T3_US_FNALLPC
- See the T3_US_FNALLPC glidein monitoring page, which may be about ~20 minutes behind the actual job - use CERN SSO
- Other CRAB monitors are documented on the SWGuideCRAB twiki
- T3_US_FNALLPC site administrator view of everything running from the global pool: CRAB + CMS Connect
Monitoring CMS Connect at T3_US_FNALLPC
- All of CMS Connect - grafana plot, use CERN SSO
- All of CMS Connect - grafana plot, use CERN SSO