U.S. CMS
Search
uscms.org  www 

User Software and Computing

Computing Environment Setup: Batch System Advanced Topics

This page contains description of advanced topics for users of the condor batch system and presumes the user already knows the basics of how to configure condor batch jobs (or crab) at the cmslpc user access facility, documented on the main batch system web page.

Note that this condor matrix monitor for the cmslpc cluster lets you see in real time how slots being used are partitioned for more memory or CPU usage. You can click on users, machines, memory, or CPU.

How to submit multicore jobs on the cmslpc nodes

Any additional job requirements will restrict which nodes your job can run on. Consider carefully any requirements/requests in a condor jdl given to you from other lpc users. For a complete and long list of possible requirement settings, see the condor user's manual.

How do I request more memory for my batch jobs?

Any additional job requirements will restrict which nodes your job can run on. Consider carefully any requirements/requests in a condor jdl given to you from other lpc users.
  • The requirements here specify a machine with at least 2100 megabytes of memory available (default slot size).

    request_memory = 2100

  • Use the memory requirement only if you do need it, keeping in mind that you may wait a significant amount of time for slots to be free with sufficient memory available.
  • It will also negatively affect your condor priority and combine the default job slots together (see dynamic(partitionable) slots below).
For a complete and long list of possible requirement settings, see the condor user's manual.

How do I request more local disk space for my batch jobs?

Any additional job requirements will restrict which nodes your job can run on. Consider carefully any requirements/requests in a condor jdl given to you from other lpc users.
  • A setting documented here in the past included request_disk = 1000000 (KiB), which restricts to nodes with disk of 1.024 GB or more. As of March 3, 2017, adding the disk requirement of 1GB will remove 1767 CPU from running your job. Increasing these numbers will restrict the number of machines you can run on (completely if too high).
  • In default partitioning, users have 40GB per 1core, 2GB memory slot
  • Additionally, due to slot partitioning (see dynamic(partitionable) slots below), requesting large disk will reduce the resources available for other users.
  • Be sure to only request disk you need.

For a complete and long list of possible requirement settings, see the condor user's manual.

An explanation of the cmslpc dynamic(partitionable) condor slots


For a longer explanation and discussion of cmslpc dynamic(partitionable) slots, see the August 4, 2017 LPC Computing Discussion slides and minutes
  • cmslpc worker nodes have anywhere from 8 to 24 "CPU" (actually cores) available. Each core has ~2GB memory each, and 40GB disk space for a job.
  • The job slots are created when a condor job asks for them
  • The default job slot is 1CPU, 2GB memory, and 40GB disk space
  • A job slot can take more CPU, and/or memory, and/or disk space
  • Examples: to see how a worker node is partitioned, we look at a 8 core machine with 16GB RAM. We will ignore disk space and presume it's default.
    1. Default job of 1 CPU, 2GB memory, running job in red, unclaimed, not yet existing, slots in green: 1 red box with 1 CPU and 2GB; 7 green boxes separate with 1CPU and 2GB per box, each box the same size
    2. request_cpus = 2, default 2GB memory, running job in red, unclaimed, not yet existing, slots in green: 1 red box with 2 CPU and 2GB; 1 green boxes separate with 1CPU and 4GB, 5 green boxes with 1CPU and 2GB per box, each green box the same size, red box twice the width of a single green box
    3. request_cpus = 4, default 2GB memory, running job in red, unclaimed, not yet existing, slots in green: 1 red box with 4 CPU and 2GB; 3 green boxes separate with 1CPuU and 4GB per box, one green box with 1CPU and 2GB, each green box the same size, red box four times the width of a single green box
  • To know how many default slots are available, condor_status will not tell you as slots only exist as they are made. Instead, go to CMS LPC System Status Landscape monitoring, the red line in "Slots" tells the maximum default slots available
  • Note that when you do condor_status, the "slot1@cmswn" for each machine will never get claimed, as that's the slot that creates the others, such as "slot1_1", "slot1_21", etc.
Note that this condor matrix monitor for the cmslpc cluster lets you see in real time how slots being used are partitioned. You can click on users, machines, memory, or CPU.

How is condor configured on the cmslpc nodes? Useful for troubleshooting

Please see also the main condor troubleshooting instructions.

To find out how a certain variable is configured in condor by default on the cmslpc nodes, use the condor_config_val command.

  • condor_config_val MAX_JOBS_SUBMITTED: this one is useful to find out what the value of MAX_JOBS_SUBMITTED is. Occasionally, if a sysadmin needs to reboot an interactive node (with a condor scheduler (schedd)), they will set MAX_JOBS_SUBMITTED to 0, and let the currently running condor jobs finish. A user would be unable to submit new jobs. Typically there is a warning in the message of the day that a user sees when they login in this case.
  • condor_config_val -dump: this dumps all the condor configuration variables that are set by default. To understand more about these variables, consult the Condor user's manual.

Other condor topics

Condor user's manual

For more information, visit the Condor user's manual. Find the version of condor running on lpc with condor_q -version

About compilation of code for cmslpc condor

One important note: compilation of code is not supported on the remote worker node. This includes ROOT's ACLiC (i.e. the plus signs at the end of root -b -q foo.C+). You can compile on the LPC interactive nodes and transfer the executables and shared libraries for use on the worker. For root -b -q foo.C+, you will need to transfer the recently compiled foo* to the condor job.


Webmaster | Last modified: Tuesday, 04-Dec-2018 14:44:20 CST