NFS disks and condor worker nodes

Jan. 22, 2019: Page is being completely rewritten, many examples below are not 100% working in the new condor refactor.

NFS disk migration background

After October 1, 2017, the ~nobackup NFS disks (mounted as /uscms_data) will no longer be mounted on the worker nodes in the cmslpc condor batch system. Users will be able to interact with files on NFS disks as usual on interactive nodes when logging in.

Changes may need to be made to condor batch jobs to not require reading from /uscms_data disks during running. Code snippets are given below for condor.jdl, the condor jdl file that is submitted, as well as condor.sh, the condor script which is run within the job. Note that you may need to consult the main condor batch system web page for complete working examples and information on condor batch system troubleshooting.

"The right way" examples gives code in green, and is denoted with "right way".
"The wrong way" examples gives code in red, and is denoted with "wrong way".

The home directory /uscms/home/username (which is a soft link to /uscms/homes/u/username) is already NOT mounted on the condor worker nodes. One can use that fact to test your workflow by moving code to be accessed from home and if the condor job works, you have configured it correctly. Keep in mind that the home directory has a default quota of 2GB.
cmslpc NFS disks are not accessible by xrootd, only the cmslpc EOS filesystem is accessible through xrootd.

Note: It is important for users to check that condor scripts they obtained from previous analyzers or to run on other sites are validated and modified appropriately. What worked in the past or at other sites may no longer work due to changes that are made to ensure a healthier condor worker node infrastructure for the cmslpc. Test changes to code interactively and condor jobs for 1-2 jobs before submitting large workflows.

CMSSW environment: bare CMSSW made within job

Here we have a bare CMSSW environment, that is, one that doesn't have any additional files in $CMSSW_BASE/src or other subdirectories, and is checked out new for the job. This is done within the condor.sh

The right way to get a bare CMSSW in your condor.sh (bash script):


#!/bin/bash
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SCRAM_ARCH=slc6_amd64_gcc530
eval `scramv1 project CMSSW CMSSW_8_0_25`
cd CMSSW_8_0_25/src/
eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers
echo "CMSSW: "$CMSSW_BASE

The right way to get a bare CMSSW in your condor.csh (tcsh script):


#!/bin/tcsh
source /cvmfs/cms.cern.ch/cmsset_default.csh
setenv SCRAM_ARCH slc6_amd64_gcc530
eval `scramv1 project CMSSW CMSSW_8_0_25`
cd CMSSW_8_0_25/src/
eval `scramv1 runtime -csh` # cmsenv is an alias not on the workers
echo "CMSSW: "$CMSSW_BASE

The wrong way to get a bare CMSSW in your condor.sh - the problem is the cd /uscms_data:


#!/bin/bash
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SCRAM_ARCH=slc6_amd64_gcc530
cd /uscms_data/d1/username/CMSSW_8_0_25/src
eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers

User input overview: custom executables, files, etc.

For just about anything a user might wish to access within a job (custom CMSSW, custom executable, input files, etc.), there are two basic techniques to transport those files to the job on the worker node:

Preferred: Transfer the input file(s) to EOS, and xrdcp the file(s) to the worker node at the start of each job, this is preferred as EOS has faster networking than NFS. Example here is a simple root macro (good1EOS.C), bash script snippet (good1EOS.sh) and input file (input_fileEOS.root).

First, be sure you have your EOS area on cmslpc configured properly and enough quota by following the checks at this link.

The right, preferred way to transfer file in condor.jdl:

universe = vanilla
Executable = good1EOS.sh
Output = good1EOS.out
Error = good1EOS.err
Log = good1EOS.log
transfer_input_files = good1.C
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
x509userproxy = $ENV(X509_USER_PROXY)

You will need to authenticate your grid certificate before submitting the condor job to have the x509userproxy line work.

The right, preferred associated good1.sh (bash script snippet). The inputs are transferred to a temporary area on the worker node's local disk, pointed to by the variable _CONDOR_SCRATCH_DIR:

#!/bin/bash
# cms software setup not included here for brevity, use setup above for bare CMSSW
cd ${_CONDOR_SCRATCH_DIR}
xrdcp root://cmseos.fnal.gov//store/user/username/input_file.root .
root -b -q good1.C
xrdcp output_file.root root://cmseos.fnal.gov//store/user/username/output_file.root
### the output file is removed before the job ends because otherwise
### condor will transfer it to your submit directory at the end of the job
rm output_file.root

The right, preferred associated good1.C (root macro snippet), files are opened locally on the worker node:

{
TFile f("input_file.root");
output_ntuples = process(f); // do some calculation
TFile g("output_file.root", "w");
g.write(output_ntuples);
}

Note the following details to understand if you use xrdcp from EOS

EOS has better networking throughput for file transfer compared to NFS disk
EOS works best for files of 1-5GB of size. If you have a lot of small files, you may wish to:
- [username@cmslpc25 ~]$ tar -zcvf MyDirectory.tgz MyDirectory
- [username@cmslpc25 ~]$ xrdcp MyDirectory.tgz root://cmseos.fnal.gov//store/user/username/MyDirectory.tgz
- In MyShellScript.sh: tar -xf MyDirectory.tgz before running
- In MyShellScript.sh: be sure to cd MyDirectory or move files as appropriate in the local worker node _CONDOR_SCRATCH_DIR
- In MyShellScript.sh: rm MyDirectory.tgz as files will get transferred automatically back if you have should_transfer_files = YES, and when_to_transfer_output = ON_EXIT
- Note that condor will NOT transfer any files automatically in a subdirectory, or a directory. This is why the above example does xrdcp to EOS at the end of running the job

Use transfer_input_files in the condor.jdl to transfer files from NFS disk when the job is starting to the worker node. Example here is a simple root macro (good1.C), bash script snippet (good1.sh) and input file (input_file.root).

The right, but not preferred way to transfer file in condor.jdl:

universe = vanilla
Executable = good1.sh
Output = good1.out
Error = good1.err
Log = good1.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = /uscms_data/d1/username/input_file.root, /uscms_data/d1/username/good1.C

The right, but not preferred associated good1.sh (bash script snippet). The inputs are transferred to a temporary area on the worker node's local disk, pointed to by the variable _CONDOR_SCRATCH_DIR:
```
#!/bin/bash
# cms software setup not included here for brevity, use setup above for bare CMSSW
cd ${_CONDOR_SCRATCH_DIR}
root -b -q good1.C
```

The right, but not preferred associated good1.C (root macro snippet), files are opened locally on the worker node:

{
TFile f("input_file.root");
output_ntuples = process(f); // do some calculation
TFile g("output_file.root", "w");
g.write(output_ntuples);
}

Note the following details to understand if you choose to use transfer_input_files instead of EOS

Each individual condor job will transfer files from NFS disk to the worker node before starting the job (indicated by < in condor_q. This can take significant time and is much slower than having the files transfer from EOS at the start of the job inside the job's shell script.
The amount of space your input files takes up will automatically modify the disk space request by the condor job, which may reduce the number of job slots available to run on. Each job slot on the worker node has by default 40GB for input, output, and temporary files made per 1 CPU/core job. Click here to learn more about condor partitionable slots at cmslpc, more or less disk space can be used
Following examples will use the EOS method for file transfer

CMSSW environment: user modified

For a user-modified CMSSW environment, you will need to tar the working area, transfer it to EOS, transfer from EOS to the worker node, untar, setup the environment, and proceed.

The right way to tar a user modified CMSSW and transfer to EOS on the command line:

[username@cmslpc25 ~]$ tar -zcvf CMSSW8025.tgz CMSSW_8_0_25

Note that you can exclude large files for instance from your tar with the following argument:
--exclude="Filename*.root"
Note that you can exclude CMSSW caches for instance with a command like this one:
tar --exclude-caches-all --exclude-vcs -zcf CMSSW_8_0_25.tar.gz -C CMSSW_8_0_25/.. CMSSW_8_0_25 --exclude=src --exclude=tmp
When using --exclude-caches-all, you should mark directories you want to exclude with a CACHEDIR.TAG file, for more, see these links for information and examples:

https://github.com/TreeMaker/TreeMaker/blob/Run2/Production/test/cache_all.sh

[username@cmslpc25 ~]$ xrdcp CMSSW8025.tgz root://cmseos.fnal.gov//store/user/username/CMSSW8025.tgz

Any large input root files should be transferred to your personal EOS area, and referred to following the EOS instructions for file reference inside a script:


xrdcp Filename1.root root://cmseos.fnal.gov//store/user/username/Filename1.root
xrdcp Filename2.root root://cmseos.fnal.gov//store/user/username/Filename2.root

The example executable file cmsRun.csh (tsch script) will take arguments ${1} (the name of the python configuration file ExampleConfig.py), and ${2}, some other variable you are passing to the configuration file, like number of events. As always, be sure to test these scripts and python configuration files interactively for a single test before submitting many condor jobs.
The right way to use a user modified CMSSW in your cmsRun.csh, copied from EOS:

#!/bin/tcsh
echo "Starting job on " `date` #Date/time of start of job
echo "Running on: `uname -a`" #Condor job is running on this node
echo "System software: `cat /etc/redhat-release`" #Operating System on that node
source /cvmfs/cms.cern.ch/cmsset_default.csh  ## if a bash script, use .sh instead of .csh
### copy the input root files if they are needed only if you require local reading
xrdcp root://cmseos.fnal.gov//store/user/username/Filename1.root .
xrdcp root://cmseos.fnal.gov//store/user/username/Filename2.root .
xrdcp -s root://cmseos.fnal.gov//store/user/username/CMSSW8025.tgz .
tar -xf CMSSW8025.tgz
rm CMSSW8025.tgz
setenv SCRAM_ARCH slc6_amd64_gcc530
cd CMSSW_8_0_25/src/
scramv1 b ProjectRename
eval `scramv1 runtime -csh` # cmsenv is an alias not on the workers
echo "Arguments passed to this script are: for 1: $1, and for 2: $2"
cmsRun ${1} ${2}
xrdcp nameOfOutputFile.root root://cmseos.fnal.gov//store/user/username/nameOfOutputFile.root
### remove the input and output files if you don't want it automatically transferred when the job ends
rm nameOfOutputFile.root
rm Filename1.root
rm Filename2.root
cd ${_CONDOR_SCRATCH_DIR}
rm -rf CMSSW_8_0_25

Be sure to make your cmsRun.csh executable: chmod +x cmsRun.csh

The right way to use a user modified CMSSW condor.jdl will look something like this:

universe = vanilla
Executable = cmsRun.csh
Should_Transfer_Files = YES
WhenToTransferOutput = ON_EXIT
Transfer_Input_Files = cmsRun.csh, ExampleConfig.py
Output = sleep_$(Cluster)_$(Process).stdout
Error = sleep_$(Cluster)_$(Process).stderr
Log = sleep_$(Cluster)_$(Process).log
x509userproxy = $ENV(X509_USER_PROXY)
Arguments = ExampleConfig.py 100
Queue 5

The right way to use a user modified CMSSW ExampleConfig.py will look something like this (partial), you could read the files directly from EOS or transfer them:

Directly from EOS (you can specify root://cmsxrootd.fnal.gov//store if you wish):

process.source = cms.Source ("PoolSource",
    fileNames=cms.untracked.vstring(
        '/store/user/username/Filename1.root',
        '/store/user/username/Filename2.root'
    )
)

Or local files copied to the _CONDOR_SCRATCH_DIR:

process.source = cms.Source ("PoolSource",
    fileNames=cms.untracked.vstring(
        'file:Filename1.root',
        'file:Filename2.root'
    )
)

Other examples of code that will not work in condor batch system

In your cmsRun.py, do not read files in with any of these, instead use the examples above

file:/uscms_data/d1/username/infile.root
file:/uscmst1b_scratch/lpc1/3DayLifetime/username/infile.root
file:/uscms/home/username/infile.root

Note that the last one does not work now, the home directories are not mounted in any of 2017 on the condor worker nodes, so can be used as a TEST location for your scripts.

EOS usage

While this migration does not affect EOS, while you are checking your condor scripts and workflow, you can check that:

In your complete workflow, you don't do anything that typically stresses the EOS filesystem
Move your input files to EOS: you can copy them in bulk using one of these scripts written by LPC users
You may wish to modify the output of your condor job to transfer all files to EOS automatically at the end (shown below)

A good example cmsRun.sh that sets up a bare CMSSW and loops over all root files at the end of the job and transfers them to EOS. We then remove the files from the local job node working area after transferring them so that the condor jdl doesn't pick up any of those already transferred root files with the option: Should_Transfer_Files = YES. Thanks to Kevin Pedro for the code.

#!/bin/bash
echo "Starting job on " `date` #Date/time of start of job
echo "Running on: `uname -a`" #Condor job is running on this node
echo "System software: `cat /etc/redhat-release`" #Operating System on that node
source /cvmfs/cms.cern.ch/cmsset_default.sh  ## if a tcsh script, use .csh instead of .sh
export SCRAM_ARCH=slc6_amd64_gcc530
eval `scramv1 project CMSSW CMSSW_8_0_25`
cd CMSSW_8_0_25/src/
eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers
echo "CMSSW: "$CMSSW_BASE
echo "Arguments passed to this script are: for 1: $1, and for 2: $2"
cmsRun ${1} ${2}
### Now that the cmsRun is over, there is one or more root files created
echo "List all root files = "
ls *.root
echo "List all files"
ls 
echo "*******************************************"
OUTDIR=root://cmseos.fnal.gov//store/user/username/MyCondorOutputArea/
echo "xrdcp output for condor"
for FILE in *.root
do
  echo "xrdcp -f ${FILE} ${OUTDIR}/${FILE}"
  xrdcp -f ${FILE} ${OUTDIR}/${FILE} 2>&1
  XRDEXIT=$?
  if [[ $XRDEXIT -ne 0 ]]; then
    rm *.root
    echo "exit code $XRDEXIT, failure in xrdcp"
    exit $XRDEXIT
  fi
  rm ${FILE}
done

Get help to modify your workflow

Contact LPC Computing support or email the lpc-howto community support mailing list to get help modifying your workflow.

US CMS News

In This Section: