U.S. CMS
Search
uscms.org  www 

NFS disks and condor worker nodes

NFS disk migration background


After October 1, 2017, the ~nobackup NFS disks (mounted as /uscms_data) will no longer be mounted on the worker nodes in the cmslpc condor batch system. Users will be able to interact with files on NFS disks as usual on interactive nodes when logging in.

Changes may need to be made to condor batch jobs to not require reading from /uscms_data disks during running. Code snippets are given below for condor.jdl, the condor jdl file that is submitted, as well as condor.sh, the condor script which is run within the job. Note that you may need to consult the main condor batch system web page for complete working examples and information on condor batch system troubleshooting.
  • "The right way" examples gives code in green, and is denoted with "right way".
  • "The wrong way" examples gives code in red, and is denoted with "wrong way".

  • The home directory /uscms/home/username (which is a soft link to /uscms/homes/u/username) is already NOT mounted on the condor worker nodes. One can use that fact to test your workflow by moving code to be accessed from home and if the condor job works, you have configured it correctly. Keep in mind that the home directory has a default quota of 2GB.
  • cmslpc NFS disks are not accessible by xrootd, only the cmslpc EOS filesystem is accessible through xrootd.

Note: It is important for users to check that condor scripts they obtained from previous analyzers or to run on other sites are validated and modified appropriately. What worked in the past or at other sites may no longer work due to changes that are made to ensure a healthier condor worker node infrastructure for the cmslpc. Test changes to code interactively and condor jobs for 1-2 jobs before submitting large workflows.

CMSSW environment: bare CMSSW made within job

Here we have a bare CMSSW environment, that is, one that doesn't have any additional files in $CMSSW_BASE/src or other subdirectories, and is checked out new for the job. This is done within the condor.sh
  • The right way to get a bare CMSSW in your condor.sh (bash script):
    
    #!/bin/bash
    source /cvmfs/cms.cern.ch/cmsset_default.sh
    export SCRAM_ARCH=slc6_amd64_gcc530
    eval `scramv1 project CMSSW CMSSW_8_0_25`
    cd CMSSW_8_0_25/src/
    eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers
    echo "CMSSW: "$CMSSW_BASE
    
    
  • The right way to get a bare CMSSW in your condor.csh (tcsh script):
    
    #!/bin/tcsh
    source /cvmfs/cms.cern.ch/cmsset_default.csh
    setenv SCRAM_ARCH slc6_amd64_gcc530
    eval `scramv1 project CMSSW CMSSW_8_0_25`
    cd CMSSW_8_0_25/src/
    eval `scramv1 runtime -csh` # cmsenv is an alias not on the workers
    echo "CMSSW: "$CMSSW_BASE
    
    
  • The wrong way to get a bare CMSSW in your condor.sh - the problem is the cd /uscms_data:
    
    #!/bin/bash
    source /cvmfs/cms.cern.ch/cmsset_default.sh
    export SCRAM_ARCH=slc6_amd64_gcc530
    cd /uscms_data/d1/username/CMSSW_8_0_25/src
    eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers
    
    

User input overview: custom executables, files, etc.

For just about anything a user might wish to access within a job (custom CMSSW, custom executable, input files, etc.), there are two basic techniques to transport those files to the job on the worker node:
  1. Preferred: Transfer the input file(s) to EOS, and xrdcp the file(s) to the worker node at the start of each job, this is preferred as EOS has faster networking than NFS. Example here is a simple root macro (good1EOS.C), bash script snippet (good1EOS.sh) and input file (input_fileEOS.root).
    • First, be sure you have your EOS area on cmslpc configured properly and enough quota by following the checks at this link.
    • The right, preferred way to transfer file in condor.jdl:
      universe = vanilla
      Executable = good1EOS.sh
      Output = good1EOS.out
      Error = good1EOS.err
      Log = good1EOS.log
      transfer_input_files = good1.C
      should_transfer_files = YES
      when_to_transfer_output = ON_EXIT
      x509userproxy = $ENV(X509_USER_PROXY)
      
      
      You will need to authenticate your grid certificate before submitting the condor job to have the x509userproxy line work.
    • The right, preferred associated good1.sh (bash script snippet). The inputs are transferred to a temporary area on the worker node's local disk, pointed to by the variable _CONDOR_SCRATCH_DIR:
      #!/bin/bash
      # cms software setup not included here for brevity, use setup above for bare CMSSW
      cd ${_CONDOR_SCRATCH_DIR}
      xrdcp root://cmseos.fnal.gov//store/user/username/input_file.root .
      root -b -q good1.C
      xrdcp output_file.root root://cmseos.fnal.gov//store/user/username/output_file.root
      ### the output file is removed before the job ends because otherwise
      ### condor will transfer it to your submit directory at the end of the job
      rm output_file.root
      
      
    • The right, preferred associated good1.C (root macro snippet), files are opened locally on the worker node:
      {
      TFile f("input_file.root");
      output_ntuples = process(f); // do some calculation
      TFile g("output_file.root", "w");
      g.write(output_ntuples);
      }
      
      
    • Note the following details to understand if you use xrdcp from EOS
      • EOS has better networking throughput for file transfer compared to NFS disk
      • EOS works best for files of 1-5GB of size. If you have a lot of small files, you may wish to:
        • [username@cmslpc25 ~]$ tar -zcvf MyDirectory.tgz MyDirectory
        • [username@cmslpc25 ~]$ xrdcp MyDirectory.tgz root://cmseos.fnal.gov//store/user/username/MyDirectory.tgz
        • In MyShellScript.sh: tar -xf MyDirectory.tgz before running
        • In MyShellScript.sh: be sure to cd MyDirectory or move files as appropriate in the local worker node _CONDOR_SCRATCH_DIR
        • In MyShellScript.sh: rm MyDirectory.tgz as files will get transferred automatically back if you have should_transfer_files = YES, and when_to_transfer_output = ON_EXIT
        • Note that condor will NOT transfer any files automatically in a subdirectory, or a directory. This is why the above example does xrdcp to EOS at the end of running the job

  2. Use transfer_input_files in the condor.jdl to transfer files from NFS disk when the job is starting to the worker node. Example here is a simple root macro (good1.C), bash script snippet (good1.sh) and input file (input_file.root).
    • The right, but not preferred way to transfer file in condor.jdl:
      universe = vanilla
      Executable = good1.sh
      Output = good1.out
      Error = good1.err
      Log = good1.log
      should_transfer_files = YES
      when_to_transfer_output = ON_EXIT
      transfer_input_files = /uscms_data/d1/username/input_file.root, /uscms_data/d1/username/good1.C
      
      
    • The right, but not preferred associated good1.sh (bash script snippet). The inputs are transferred to a temporary area on the worker node's local disk, pointed to by the variable _CONDOR_SCRATCH_DIR:
      #!/bin/bash
      # cms software setup not included here for brevity, use setup above for bare CMSSW
      cd ${_CONDOR_SCRATCH_DIR}
      root -b -q good1.C
      
      
    • The right, but not preferred associated good1.C (root macro snippet), files are opened locally on the worker node:
      {
      TFile f("input_file.root");
      output_ntuples = process(f); // do some calculation
      TFile g("output_file.root", "w");
      g.write(output_ntuples);
      }
      
      
    • Note the following details to understand if you choose to use transfer_input_files instead of EOS
      • Each individual condor job will transfer files from NFS disk to the worker node before starting the job (indicated by < in condor_q. This can take significant time and is much slower than having the files transfer from EOS at the start of the job inside the job's shell script.
      • The amount of space your input files takes up will automatically modify the disk space request by the condor job, which may reduce the number of job slots available to run on. Each job slot on the worker node has by default 40GB for input, output, and temporary files made per 1 CPU/core job. Click here to learn more about condor partitionable slots at cmslpc, more or less disk space can be used
      • Following examples will use the EOS method for file transfer

CMSSW environment: user modified

For a user-modified CMSSW environment, you will need to tar the working area, transfer it to EOS, transfer from EOS to the worker node, untar, setup the environment, and proceed.
  • The right way to tar a user modified CMSSW and transfer to EOS on the command line:
  • Any large input root files should be transferred to your personal EOS area, and referred to following the EOS instructions for file reference inside a script:
    xrdcp Filename1.root root://cmseos.fnal.gov//store/user/username/Filename1.root
    xrdcp Filename2.root root://cmseos.fnal.gov//store/user/username/Filename2.root
    
  • The example executable file cmsRun.csh (tsch script) will take arguments ${1} (the name of the python configuration file ExampleConfig.py), and ${2}, some other variable you are passing to the configuration file, like number of events. As always, be sure to test these scripts and python configuration files interactively for a single test before submitting many condor jobs.
  • The right way to use a user modified CMSSW in your cmsRun.csh, copied from EOS:
  • #!/bin/tcsh
    echo "Starting job on " `date` #Date/time of start of job
    echo "Running on: `uname -a`" #Condor job is running on this node
    echo "System software: `cat /etc/redhat-release`" #Operating System on that node
    source /cvmfs/cms.cern.ch/cmsset_default.csh  ## if a bash script, use .sh instead of .csh
    ### copy the input root files if they are needed only if you require local reading
    xrdcp root://cmseos.fnal.gov//store/user/username/Filename1.root .
    xrdcp root://cmseos.fnal.gov//store/user/username/Filename2.root .
    xrdcp -s root://cmseos.fnal.gov//store/user/username/CMSSW8025.tgz .
    tar -xf CMSSW8025.tgz
    rm CMSSW8025.tgz
    setenv SCRAM_ARCH slc6_amd64_gcc530
    cd CMSSW_8_0_25/src/
    scramv1 b ProjectRename
    eval `scramv1 runtime -csh` # cmsenv is an alias not on the workers
    echo "Arguments passed to this script are: for 1: $1, and for 2: $2"
    cmsRun ${1} ${2}
    xrdcp nameOfOutputFile.root root://cmseos.fnal.gov//store/user/username/nameOfOutputFile.root
    ### remove the input and output files if you don't want it automatically transferred when the job ends
    rm nameOfOutputFile.root
    rm Filename1.root
    rm Filename2.root
    cd ${_CONDOR_SCRATCH_DIR}
    rm -rf CMSSW_8_0_25
    
  • Be sure to make your cmsRun.csh executable: chmod +x cmsRun.csh
  • The right way to use a user modified CMSSW condor.jdl will look something like this:

    universe = vanilla
    Executable = cmsRun.csh
    Should_Transfer_Files = YES
    WhenToTransferOutput = ON_EXIT
    Transfer_Input_Files = cmsRun.csh, ExampleConfig.py
    Output = sleep_\$(Cluster)_\$(Process).stdout
    Error = sleep_\$(Cluster)_\$(Process).stderr
    Log = sleep_\$(Cluster)_\$(Process).log
    x509userproxy = $ENV(X509_USER_PROXY)
    Arguments = ExampleConfig.py 100
    Queue 5
    
  • The right way to use a user modified CMSSW ExampleConfig.py will look something like this (partial), you could read the files directly from EOS or transfer them:

    1. Directly from EOS (you can specify root://cmsxrootd.fnal.gov//store if you wish):
      process.source = cms.Source ("PoolSource",
          fileNames=cms.untracked.vstring(
              '/store/user/username/Filename1.root',
              '/store/user/username/Filename2.root'
          )
      )
      
    2. Or local files copied to the _CONDOR_SCRATCH_DIR:
      process.source = cms.Source ("PoolSource",
          fileNames=cms.untracked.vstring(
              'file:Filename1.root',
              'file:Filename2.root'
          )
      )
      

Other examples of code that will not work in condor batch system

  • In your cmsRun.py, do not read files in with any of these, instead use the examples above
  • file:/uscms_data/d1/username/infile.root
    file:/uscmst1b_scratch/lpc1/3DayLifetime/username/infile.root
    file:/uscms/home/username/infile.root
  • Note that the last one does not work now, the home directories are not mounted in any of 2017 on the condor worker nodes, so can be used as a TEST location for your scripts.

EOS usage

While this migration does not affect EOS, while you are checking your condor scripts and workflow, you can check that:
  1. In your complete workflow, you don't do anything that typically stresses the EOS filesystem
  2. Move your input files to EOS: you can copy them in bulk using one of these scripts written by LPC users
  3. You may wish to modify the output of your condor job to transfer all files to EOS automatically at the end (shown below)
    • A good example cmsRun.sh that sets up a bare CMSSW and loops over all root files at the end of the job and transfers them to EOS. We then remove the files from the local job node working area after transferring them so that the condor jdl doesn't pick up any of those already transferred root files with the option: Should_Transfer_Files = YES. Thanks to Kevin Pedro for the code.
      #!/bin/bash
      echo "Starting job on " `date` #Date/time of start of job
      echo "Running on: `uname -a`" #Condor job is running on this node
      echo "System software: `cat /etc/redhat-release`" #Operating System on that node
      source /cvmfs/cms.cern.ch/cmsset_default.sh  ## if a tcsh script, use .csh instead of .sh
      export SCRAM_ARCH=slc6_amd64_gcc530
      eval `scramv1 project CMSSW CMSSW_8_0_25`
      cd CMSSW_8_0_25/src/
      eval `scramv1 runtime -sh` # cmsenv is an alias not on the workers
      echo "CMSSW: "$CMSSW_BASE
      echo "Arguments passed to this script are: for 1: $1, and for 2: $2"
      cmsRun ${1} ${2}
      ### Now that the cmsRun is over, there is one or more root files created
      echo "List all root files = "
      ls *.root
      echo "List all files"
      ls 
      echo "*******************************************"
      OUTDIR=root://cmseos.fnal.gov//store/user/username/MyCondorOutputArea/
      echo "xrdcp output for condor"
      for FILE in *.root
      do
        echo "xrdcp -f ${FILE} ${OUTDIR}/${FILE}"
        xrdcp -f ${FILE} ${OUTDIR}/${FILE} 2>&1
        XRDEXIT=$?
        if [[ $XRDEXIT -ne 0 ]]; then
          rm *.root
          echo "exit code $XRDEXIT, failure in xrdcp"
          exit $XRDEXIT
        fi
        rm ${FILE}
      done
      
      

Get help to modify your workflow

Contact LPC Computing support or email the lpc-howto community support mailing list to get help modifying your workflow.
Webmaster | Last modified: Friday, 29-Sep-2017 09:36:35 CDT