Jan. 22, 2019: Page is being completely rewritten, many examples below are not 100% working in the new condor refactor.
NFS disks and condor worker nodes
NFS disk migration background
After October 1, 2017, the ~nobackup
NFS disks (mounted as /uscms_data
)
will no longer be mounted on the worker nodes in the cmslpc
condor batch system.
Users will be able to interact with files on NFS disks as usual on interactive nodes when logging in.
Changes may need to be made to condor batch jobs to not require reading from /uscms_data
disks during running. Code snippets are given below for condor.jdl
, the
condor jdl file that is submitted, as well as condor.sh
, the condor script which is run within the job.
Note that you may need to consult the main condor batch system web page
for complete working examples and information on condor batch system troubleshooting.
- "The right way" examples gives
code in green
, and is denoted with "right way".
- "The wrong way" examples gives
code in red
, and is denoted with "wrong way".
- The home directory
/uscms/home/username
(which is a soft link to /uscms/homes/u/username
) is already NOT mounted on the condor worker nodes. One
can use that fact to test your workflow by moving code to be accessed from home and if the condor job works, you have configured it correctly. Keep in mind
that the home directory has a default quota of 2GB.
- cmslpc NFS disks are not accessible by xrootd,
only the cmslpc EOS filesystem is accessible through xrootd.
Note: It is important for users to check that condor scripts they
obtained from previous analyzers or to run on other sites are validated and
modified appropriately. What worked in the past or at other sites may no longer work due to changes
that are made to ensure a healthier condor worker node infrastructure for the
cmslpc. Test changes to code interactively and condor jobs for 1-2 jobs before submitting
large workflows.
CMSSW environment: bare CMSSW made within job
Here we have a bare CMSSW environment, that is, one that doesn't have any additional files in $CMSSW_BASE/src
or other subdirectories,
and is checked out new for the job. This is done within the condor.sh
- The right way to get a bare CMSSW in your
condor.sh
(bash script):
#!/bin/bash
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SCRAM_ARCH=slc6_amd64_gcc530
eval `scramv1 project CMSSW CMSSW_8_0_25`
cd CMSSW_8_0_25/src/
eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers
echo "CMSSW: "$CMSSW_BASE
- The right way to get a bare CMSSW in your
condor.csh
(tcsh script):
#!/bin/tcsh
source /cvmfs/cms.cern.ch/cmsset_default.csh
setenv SCRAM_ARCH slc6_amd64_gcc530
eval `scramv1 project CMSSW CMSSW_8_0_25`
cd CMSSW_8_0_25/src/
eval `scramv1 runtime -csh` # cmsenv is an alias not on the workers
echo "CMSSW: "$CMSSW_BASE
- The wrong way to get a bare CMSSW in your
condor.sh
- the problem is the
cd /uscms_data
:
#!/bin/bash
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SCRAM_ARCH=slc6_amd64_gcc530
cd /uscms_data/d1/username/CMSSW_8_0_25/src
eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers
User input overview: custom executables, files, etc.
For just about anything a user might wish to access within a job (custom CMSSW, custom executable, input files, etc.),
there are two basic techniques to transport those files to the job on the worker node:
- Preferred: Transfer
the input file(s) to EOS, and
xrdcp
the file(s) to the worker node at the start of each job,
this is preferred as EOS has faster networking than NFS. Example here is a simple root macro (good1EOS.C
), bash script snippet (good1EOS.sh) and input file (input_fileEOS.root
).
- First, be sure you have your EOS area on cmslpc configured properly and enough quota by following the checks at this link.
- The right, preferred way to transfer file in
condor.jdl
:
universe = vanilla
Executable = good1EOS.sh
Output = good1EOS.out
Error = good1EOS.err
Log = good1EOS.log
transfer_input_files = good1.C
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
x509userproxy = $ENV(X509_USER_PROXY)
You will need to
authenticate your grid certificate before submitting the condor job to have the x509userproxy
line work.
- The right, preferred associated
good1.sh
(bash script snippet). The inputs are transferred to a temporary area on the worker node's
local disk, pointed to by the variable _CONDOR_SCRATCH_DIR
:
#!/bin/bash
# cms software setup not included here for brevity, use setup above for bare CMSSW
cd ${_CONDOR_SCRATCH_DIR}
xrdcp root://cmseos.fnal.gov//store/user/username/input_file.root .
root -b -q good1.C
xrdcp output_file.root root://cmseos.fnal.gov//store/user/username/output_file.root
### the output file is removed before the job ends because otherwise
### condor will transfer it to your submit directory at the end of the job
rm output_file.root
- The right, preferred associated
good1.C
(root macro snippet), files are opened locally on the worker node:
{
TFile f("input_file.root");
output_ntuples = process(f); // do some calculation
TFile g("output_file.root", "w");
g.write(output_ntuples);
}
- Note the following details to understand if you use
xrdcp
from EOS
- EOS has better networking throughput for file transfer compared to NFS disk
- EOS works best for files of 1-5GB of size. If you have a lot of small files, you may wish to:
[username@cmslpc25 ~]$
tar -zcvf MyDirectory.tgz MyDirectory
[username@cmslpc25 ~]$
xrdcp MyDirectory.tgz root://cmseos.fnal.gov//store/user/username/MyDirectory.tgz
- In
MyShellScript.sh
: tar -xf MyDirectory.tgz
before running
- In
MyShellScript.sh
: be sure to cd MyDirectory
or move files as appropriate in the
local worker node _CONDOR_SCRATCH_DIR
- In
MyShellScript.sh
: rm MyDirectory.tgz
as files will get transferred
automatically back if you have should_transfer_files = YES
, and
when_to_transfer_output = ON_EXIT
- Note that condor will NOT transfer any files automatically in a subdirectory, or a directory. This is why the above example does
xrdcp
to EOS at the end of running the job
- Use
transfer_input_files
in the condor.jdl
to transfer files from NFS disk when the job is starting to the worker node. Example here is a simple root macro (good1.C
), bash script snippet (good1.sh) and input file (input_file.root
).
- The right, but not preferred way to transfer file in
condor.jdl
:
universe = vanilla
Executable = good1.sh
Output = good1.out
Error = good1.err
Log = good1.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = /uscms_data/d1/username/input_file.root, /uscms_data/d1/username/good1.C
- The right, but not preferred associated
good1.sh
(bash script snippet). The inputs are transferred to a temporary area on the worker node's
local disk, pointed to by the variable _CONDOR_SCRATCH_DIR
:
#!/bin/bash
# cms software setup not included here for brevity, use setup above for bare CMSSW
cd ${_CONDOR_SCRATCH_DIR}
root -b -q good1.C
- The right, but not preferred associated
good1.C
(root macro snippet), files are opened locally on the worker node:
{
TFile f("input_file.root");
output_ntuples = process(f); // do some calculation
TFile g("output_file.root", "w");
g.write(output_ntuples);
}
- Note the following details to understand if you choose to use
transfer_input_files
instead of EOS
- Each individual condor job will transfer files from NFS disk to the worker node before starting the job (indicated by
<
in condor_q
.
This can take significant time and is much slower than having the files transfer from
EOS at the start of the job inside the job's shell script.
- The amount of space your input files takes up will automatically modify the disk
space request by the condor job, which may reduce the number of job slots available to run on.
Each job slot on the worker node has by default 40GB for input, output, and temporary files made per 1 CPU/core job.
Click here to learn more about
condor partitionable slots at cmslpc, more or less disk space can be used
- Following examples will use the EOS method for file transfer
CMSSW environment: user modified
For a user-modified CMSSW environment, you will need to tar the working area,
transfer it to EOS, transfer from EOS to the worker node, untar, setup the
environment, and proceed.
- The right way to tar a user modified CMSSW and transfer to EOS on the command line:
[username@cmslpc25 ~]$
tar -zcvf CMSSW8025.tgz CMSSW_8_0_25
- Note that you can exclude large files for instance from your tar with the
following argument:
--exclude="Filename*.root"
- Note that you can exclude CMSSW caches for instance with a command like this one:
tar --exclude-caches-all --exclude-vcs -zcf CMSSW_8_0_25.tar.gz -C CMSSW_8_0_25/.. CMSSW_8_0_25 --exclude=src --exclude=tmp
- When using --exclude-caches-all, you should mark directories you want to exclude with a CACHEDIR.TAG file, for more, see these links for information and examples:
[username@cmslpc25 ~]$
xrdcp CMSSW8025.tgz root://cmseos.fnal.gov//store/user/username/CMSSW8025.tgz
- Any large input root files should be transferred to your personal EOS area, and referred to
following the EOS instructions for file reference inside a script:
xrdcp Filename1.root root://cmseos.fnal.gov//store/user/username/Filename1.root
xrdcp Filename2.root root://cmseos.fnal.gov//store/user/username/Filename2.root
- The example executable file
cmsRun.csh
(tsch script) will take arguments ${1}
(the name of the python configuration file ExampleConfig.py
), and
${2}
, some other variable you are passing to the configuration file,
like number of events. As always, be sure to test these scripts and python
configuration files interactively for a single test before submitting many condor jobs.
- The right way to use a user modified CMSSW in your
cmsRun.csh
, copied from EOS:
#!/bin/tcsh
echo "Starting job on " `date` #Date/time of start of job
echo "Running on: `uname -a`" #Condor job is running on this node
echo "System software: `cat /etc/redhat-release`" #Operating System on that node
source /cvmfs/cms.cern.ch/cmsset_default.csh ## if a bash script, use .sh instead of .csh
### copy the input root files if they are needed only if you require local reading
xrdcp root://cmseos.fnal.gov//store/user/username/Filename1.root .
xrdcp root://cmseos.fnal.gov//store/user/username/Filename2.root .
xrdcp -s root://cmseos.fnal.gov//store/user/username/CMSSW8025.tgz .
tar -xf CMSSW8025.tgz
rm CMSSW8025.tgz
setenv SCRAM_ARCH slc6_amd64_gcc530
cd CMSSW_8_0_25/src/
scramv1 b ProjectRename
eval `scramv1 runtime -csh` # cmsenv is an alias not on the workers
echo "Arguments passed to this script are: for 1: $1, and for 2: $2"
cmsRun ${1} ${2}
xrdcp nameOfOutputFile.root root://cmseos.fnal.gov//store/user/username/nameOfOutputFile.root
### remove the input and output files if you don't want it automatically transferred when the job ends
rm nameOfOutputFile.root
rm Filename1.root
rm Filename2.root
cd ${_CONDOR_SCRATCH_DIR}
rm -rf CMSSW_8_0_25
code in green
, and is denoted with "right way".code in red
, and is denoted with "wrong way"./uscms/home/username
(which is a soft link to /uscms/homes/u/username
) is already NOT mounted on the condor worker nodes. One
can use that fact to test your workflow by moving code to be accessed from home and if the condor job works, you have configured it correctly. Keep in mind
that the home directory has a default quota of 2GB.condor.sh
(bash script):
#!/bin/bash source /cvmfs/cms.cern.ch/cmsset_default.sh export SCRAM_ARCH=slc6_amd64_gcc530 eval `scramv1 project CMSSW CMSSW_8_0_25` cd CMSSW_8_0_25/src/ eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers echo "CMSSW: "$CMSSW_BASE
condor.csh
(tcsh script):
#!/bin/tcsh source /cvmfs/cms.cern.ch/cmsset_default.csh setenv SCRAM_ARCH slc6_amd64_gcc530 eval `scramv1 project CMSSW CMSSW_8_0_25` cd CMSSW_8_0_25/src/ eval `scramv1 runtime -csh` # cmsenv is an alias not on the workers echo "CMSSW: "$CMSSW_BASE
condor.sh
- the problem is the
cd /uscms_data
:
#!/bin/bash source /cvmfs/cms.cern.ch/cmsset_default.sh export SCRAM_ARCH=slc6_amd64_gcc530 cd /uscms_data/d1/username/CMSSW_8_0_25/src eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers
xrdcp
the file(s) to the worker node at the start of each job,
this is preferred as EOS has faster networking than NFS. Example here is a simple root macro (good1EOS.C
), bash script snippet (good1EOS.sh) and input file (input_fileEOS.root
).- First, be sure you have your EOS area on cmslpc configured properly and enough quota by following the checks at this link.
- The right, preferred way to transfer file in
condor.jdl
:universe = vanilla Executable = good1EOS.sh Output = good1EOS.out Error = good1EOS.err Log = good1EOS.log transfer_input_files = good1.C should_transfer_files = YES when_to_transfer_output = ON_EXIT x509userproxy = $ENV(X509_USER_PROXY)
You will need to authenticate your grid certificate before submitting the condor job to have thex509userproxy
line work. - The right, preferred associated
good1.sh
(bash script snippet). The inputs are transferred to a temporary area on the worker node's local disk, pointed to by the variable_CONDOR_SCRATCH_DIR
:#!/bin/bash # cms software setup not included here for brevity, use setup above for bare CMSSW cd ${_CONDOR_SCRATCH_DIR} xrdcp root://cmseos.fnal.gov//store/user/username/input_file.root . root -b -q good1.C xrdcp output_file.root root://cmseos.fnal.gov//store/user/username/output_file.root ### the output file is removed before the job ends because otherwise ### condor will transfer it to your submit directory at the end of the job rm output_file.root
- The right, preferred associated
good1.C
(root macro snippet), files are opened locally on the worker node:{ TFile f("input_file.root"); output_ntuples = process(f); // do some calculation TFile g("output_file.root", "w"); g.write(output_ntuples); }
- Note the following details to understand if you use
xrdcp
from EOS - EOS has better networking throughput for file transfer compared to NFS disk
- EOS works best for files of 1-5GB of size. If you have a lot of small files, you may wish to:
[username@cmslpc25 ~]$
tar -zcvf MyDirectory.tgz MyDirectory
[username@cmslpc25 ~]$
xrdcp MyDirectory.tgz root://cmseos.fnal.gov//store/user/username/MyDirectory.tgz
- In
MyShellScript.sh
:tar -xf MyDirectory.tgz
before running - In
MyShellScript.sh
: be sure tocd MyDirectory
or move files as appropriate in the local worker node_CONDOR_SCRATCH_DIR
- In
MyShellScript.sh
:rm MyDirectory.tgz
as files will get transferred automatically back if you haveshould_transfer_files = YES
, andwhen_to_transfer_output = ON_EXIT
- Note that condor will NOT transfer any files automatically in a subdirectory, or a directory. This is why the above example does
xrdcp
to EOS at the end of running the job
transfer_input_files
in the condor.jdl
to transfer files from NFS disk when the job is starting to the worker node. Example here is a simple root macro (good1.C
), bash script snippet (good1.sh) and input file (input_file.root
).- The right, but not preferred way to transfer file in
condor.jdl
:universe = vanilla Executable = good1.sh Output = good1.out Error = good1.err Log = good1.log should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = /uscms_data/d1/username/input_file.root, /uscms_data/d1/username/good1.C
- The right, but not preferred associated
good1.sh
(bash script snippet). The inputs are transferred to a temporary area on the worker node's local disk, pointed to by the variable_CONDOR_SCRATCH_DIR
:#!/bin/bash # cms software setup not included here for brevity, use setup above for bare CMSSW cd ${_CONDOR_SCRATCH_DIR} root -b -q good1.C
- The right, but not preferred associated
good1.C
(root macro snippet), files are opened locally on the worker node:{ TFile f("input_file.root"); output_ntuples = process(f); // do some calculation TFile g("output_file.root", "w"); g.write(output_ntuples); }
- Note the following details to understand if you choose to use
transfer_input_files
instead of EOS - Each individual condor job will transfer files from NFS disk to the worker node before starting the job (indicated by
<
incondor_q
. This can take significant time and is much slower than having the files transfer from EOS at the start of the job inside the job's shell script. - The amount of space your input files takes up will automatically modify the disk space request by the condor job, which may reduce the number of job slots available to run on. Each job slot on the worker node has by default 40GB for input, output, and temporary files made per 1 CPU/core job. Click here to learn more about condor partitionable slots at cmslpc, more or less disk space can be used
- Following examples will use the EOS method for file transfer
[username@cmslpc25 ~]$
tar -zcvf CMSSW8025.tgz CMSSW_8_0_25
- Note that you can exclude large files for instance from your tar with the
following argument:
--exclude="Filename*.root"
- Note that you can exclude CMSSW caches for instance with a command like this one:
tar --exclude-caches-all --exclude-vcs -zcf CMSSW_8_0_25.tar.gz -C CMSSW_8_0_25/.. CMSSW_8_0_25 --exclude=src --exclude=tmp
- When using --exclude-caches-all, you should mark directories you want to exclude with a CACHEDIR.TAG file, for more, see these links for information and examples:
[username@cmslpc25 ~]$
xrdcp CMSSW8025.tgz root://cmseos.fnal.gov//store/user/username/CMSSW8025.tgz
xrdcp Filename1.root root://cmseos.fnal.gov//store/user/username/Filename1.root
xrdcp Filename2.root root://cmseos.fnal.gov//store/user/username/Filename2.root
cmsRun.csh
(tsch script) will take arguments ${1}
(the name of the python configuration file ExampleConfig.py
), and
${2}
, some other variable you are passing to the configuration file,
like number of events. As always, be sure to test these scripts and python
configuration files interactively for a single test before submitting many condor jobs.
cmsRun.csh
, copied from EOS:#!/bin/tcsh
echo "Starting job on " `date` #Date/time of start of job
echo "Running on: `uname -a`" #Condor job is running on this node
echo "System software: `cat /etc/redhat-release`" #Operating System on that node
source /cvmfs/cms.cern.ch/cmsset_default.csh ## if a bash script, use .sh instead of .csh
### copy the input root files if they are needed only if you require local reading
xrdcp root://cmseos.fnal.gov//store/user/username/Filename1.root .
xrdcp root://cmseos.fnal.gov//store/user/username/Filename2.root .
xrdcp -s root://cmseos.fnal.gov//store/user/username/CMSSW8025.tgz .
tar -xf CMSSW8025.tgz
rm CMSSW8025.tgz
setenv SCRAM_ARCH slc6_amd64_gcc530
cd CMSSW_8_0_25/src/
scramv1 b ProjectRename
eval `scramv1 runtime -csh` # cmsenv is an alias not on the workers
echo "Arguments passed to this script are: for 1: $1, and for 2: $2"
cmsRun ${1} ${2}
xrdcp nameOfOutputFile.root root://cmseos.fnal.gov//store/user/username/nameOfOutputFile.root
### remove the input and output files if you don't want it automatically transferred when the job ends
rm nameOfOutputFile.root
rm Filename1.root
rm Filename2.root
cd ${_CONDOR_SCRATCH_DIR}
rm -rf CMSSW_8_0_25
cmsRun.csh
executable: chmod +x cmsRun.csh
condor.jdl
will look something like this:
universe = vanilla
Executable = cmsRun.csh
Should_Transfer_Files = YES
WhenToTransferOutput = ON_EXIT
Transfer_Input_Files = cmsRun.csh, ExampleConfig.py
Output = sleep_$(Cluster)_$(Process).stdout
Error = sleep_$(Cluster)_$(Process).stderr
Log = sleep_$(Cluster)_$(Process).log
x509userproxy = $ENV(X509_USER_PROXY)
Arguments = ExampleConfig.py 100
Queue 5
ExampleConfig.py
will look something like this (partial), you could read the files directly from EOS or transfer them:
- Directly from EOS (you can specify
root://cmsxrootd.fnal.gov//store
if you wish):process.source = cms.Source ("PoolSource", fileNames=cms.untracked.vstring( '/store/user/username/Filename1.root', '/store/user/username/Filename2.root' ) )
- Or local files copied to the
_CONDOR_SCRATCH_DIR
:process.source = cms.Source ("PoolSource", fileNames=cms.untracked.vstring( 'file:Filename1.root', 'file:Filename2.root' ) )
Other examples of code that will not work in condor batch system
- In your
cmsRun.py
, do not read files in with any of these, instead use the examples above
file:/uscms_data/d1/username/infile.root
file:/uscmst1b_scratch/lpc1/3DayLifetime/username/infile.root
file:/uscms/home/username/infile.root
EOS usage
While this migration does not affect EOS, while you are checking your condor scripts and workflow, you can check that:- In your complete workflow, you don't do anything that typically stresses the EOS filesystem
- Move your input files to EOS: you can copy them in bulk using one of these scripts written by LPC users
- You may wish to modify the output of your condor job to transfer all files to EOS automatically at the end (shown below)
- A good example
cmsRun.sh
that sets up a bare CMSSW and loops over all root files at the end of the job and transfers them to EOS. We then remove the files from the local job node working area after transferring them so that the condor jdl doesn't pick up any of those already transferred root files with the option:Should_Transfer_Files = YES
. Thanks to Kevin Pedro for the code.#!/bin/bash echo "Starting job on " `date` #Date/time of start of job echo "Running on: `uname -a`" #Condor job is running on this node echo "System software: `cat /etc/redhat-release`" #Operating System on that node source /cvmfs/cms.cern.ch/cmsset_default.sh ## if a tcsh script, use .csh instead of .sh export SCRAM_ARCH=slc6_amd64_gcc530 eval `scramv1 project CMSSW CMSSW_8_0_25` cd CMSSW_8_0_25/src/ eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers echo "CMSSW: "$CMSSW_BASE echo "Arguments passed to this script are: for 1: $1, and for 2: $2" cmsRun ${1} ${2} ### Now that the cmsRun is over, there is one or more root files created echo "List all root files = " ls *.root echo "List all files" ls echo "*******************************************" OUTDIR=root://cmseos.fnal.gov//store/user/username/MyCondorOutputArea/ echo "xrdcp output for condor" for FILE in *.root do echo "xrdcp -f ${FILE} ${OUTDIR}/${FILE}" xrdcp -f ${FILE} ${OUTDIR}/${FILE} 2>&1 XRDEXIT=$? if [[ $XRDEXIT -ne 0 ]]; then rm *.root echo "exit code $XRDEXIT, failure in xrdcp" exit $XRDEXIT fi rm ${FILE} done