User Software and Computing
Getting Started: Guideline for users
LPC Computing Analysis Facility (CAF) at Fermilab
The LPC CAF provides user access to NFS and EOS user disk space. Data and MC files are accessible through T3_US_FNALLPC (EOS) and T1_US_FNAL (dCache) via xrootd. At this scale, users need to follow some rules to be able to provide the highest possible performance to all processes. Please help keeping FNAL stable so that everybody can produce physics results as quickly and reliably as possible.
1. Interactive nodes
The basic purpose of interactive nodes is to provide a platform to develop and debug analysis code. Once ready, running it on a big scale must be performed on the condor batch nodes or on the CMS grid resources using CRAB or CMS connect. Jobs running interactively usually have a negative impact on the performance of interactive nodes and affect many users at once, for this reason, users should avoid this practice. Taking up multiple cores in your interactive jobs will slow yourself and others down. Processes running this way may be removed by the administrators.
- In general, do not use more than 4 CPU in parallel, certainly not for a long period of time
- In the past, there has been a script which limits interactive jobs to one hour, that is no longer enforced, but if users abuse the system, that may need to be implemented again
- Forking jobs to run outside of your interactive shell is never a good idea on the interactive nodes. It affects everyone trying to work on these multi-user systems. Please don't do this.
- Running non CMSSW software like Madgraph, or running scripts from other users without checking them may have default options to use all available CPU, this will slow down yourself and other users and may cause sysadmins to kill your jobs. Be sure to understand programs you run before running them, especially parallelization options.
- A review of how to find out if your script or program has unknown parallelization was had in the LPC Computing Discussion, March 16, 2018, see slides and minutes. Note that how to fix madgraph which by default takes all the CPUs on the nodes is listed there as well.
- Do NOT write many files to
/tmp
, this includes using scripts optimized for lxplus. If you need temporary storage space, use the 3DayLifetime NFS disk area. Filling up/tmp
will results in the loss of condor jobs, cvmfs, interactive usage of the node, and then sysadmins will remove the offending files.
2. Batch jobs:
- Use CRAB instead of writing your own data handling system since CRAB has solved all problems you could encounter, both for file handling and data access. Additionally CRAB has access to more CPU than is available at the FNAL batch system. The SwGuideCrab twiki has an excellent set of documentation.
- Guidelines of how to structure a condor batch job at the CMS LPC are here, including how to utilize EOS for file transfer instead of NFS.
3. File access in general:
- Please use the CMSSW software releases installed centrally on CVMFS and don’t install CMSSW yourself. The software setup documentation includes directions to find many other types of software you may wish to use on cvmfs, including LCG software, ROOT, and GPU software
- Follow guidlines here for what not to do on the EOS Storage Element filesystem to help ensure the fuse mount is healthy.
- Do not put more than 1000 files in a directory (you will notice CRAB3 automatically splits output following this guideline)
- In general, EOS performs best with files of individual size of 1-5GB per file
- If in doubt about the kind of work your trying to do outside the established workflows (for example CRAB) at FNAL, please contact LPC computing support to ask for guidance and help.
- The FNAL facility team is very motivated and works very hard to keep the facility running smoothly and without problems.