ORFEO-DOC

WARNING: this documentation page is outdated and deprecated. Please refer to https://orfeo-doc.areasciencepark.it/

Welcome to ORFEO documentation (still preliminary)

This is a preliminary and yet incomplete documentation for ORFEO cluster at AreaSciencePark. This location is supposed to be temporary and will be later moved somewhere else

how to login/access

Login node (ct1-005.area.trieste.it ) can be accessed via ssh, The Secure Shell protocol. We do not enable username/password mechanism but just passwordless by means of public/private keys.

Be sure you already sent a public SSH key to `support@areasciencepark.it` to have your account activated.

how to ask for help

Request should be sent to `support@areasciencepark.it`, always specifying in the subject line the work “GENOMICA”.

Please be sure to formulate your request in a detailed way.

HPC partition

The environment is composed by:

  • a login node, where users log in and submit computational jobs to the PBSPRO batch server;
  • a set of computing nodes each composed by
    • 2 fat nodes: 2 Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz (36 cores) and 1536GB RAM
    • 10 thin nodes: 2 Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz (24 cores) and 768GB RAM
    • 4 gpu nodes: 2 Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz (24 cores) and 256GB + 2 Nvidia V100 32GB RAM each

storage services/resources

Four different areas of storage are at user disposal:

  • the user’s home
    • once logged in, each user will land in its home in `/u/[name_of_group]/[name_of_user]
    • e.g. the home of user area is in /u/area/[name_of_users]
    • it’s physically located on ceph large FS, and exported via infiniband to all the computational nodes
    • it is backed up upon request (not by default)
    • quotas are enforced with a default limit of 2TB for each users
    • soft link are available there for the other areas
[cozzini@pbs-centos7 ~]$ ls -lrt
total 2
lrwxrwxrwx 1 cozzini area 18 Apr  7 15:23 fast -> /fast/area/cozzini
lrwxrwxrwx 1 cozzini area 21 Apr  7 15:23 storage -> /storage/area/cozzini
lrwxrwxrwx 1 cozzini area 21 Apr 16 09:23 scratch -> /scratch/area/cozzini
  • the scratch’s area
    • it is large area intended to be used to store data that neeed to be elaborated
    • it is not backed up / no quota for the moment are enforced
    • it is also physically located on ceph large FS, and exported via infiniband to all the computational nodes
[cozzini@pbs-centos7 ~]$ df -h /scratch
Filesystem                                                                 Size  Used Avail Use% Mounted on
10.128.6.211:6789,10.128.6.213:6789,10.128.6.212:6789,10.128.6.214:6789:/  407T  4.3T  402T   2% /large
  • the fast filesystem
    • is a fast space available for each user, on all the computing nodes
    • /fast/[name_of_group]/[name_of_user]
    • e.g. fast area for user eXact is in /fast/area/cozzini
    • is intended to be a fast scratch area for data intensive application
[cozzini@pbs-centos7 ~]$ df -h /fast
Filesystem                                                                 Size  Used Avail Use% Mounted on
0.128.6.211:6789,10.128.6.212:6789,10.128.6.213:6789,10.128.6.214:6789:/   96T  5.5T   90T   6% /fast
  • the storage FS
    • it is intended for long-term storage of final processed dataset
    • at the moment only 40TB of space are mounted
    • it is NFS mounted via 50bit ethernet link
[cozzini@pbs-centos7 ~]$ df -h /storage
Filesystem             Size  Used Avail Use% Mounted on
10.128.2.231:/storage   37T  2.9T   34T   8% /storage

PBS-pro resource manager

PBSPro has been installed as queue manager and scheduler. Different queues have been configured: the `qstat -q` command is available to show them:

[cozzini@pbs-centos7 ~]$ qstat -q

Queue Memory CPU Time Walltime Node Run Que Lm State —————- —— ——– ——– —- —– —– —- —– blade – – 96:00:00 48 0 0 – E R fat – – 300:00:0 – 1 0 – E R thin – – 96:00:00 – 18 7 – E R gpu – – 300:00:0 – 0 0 – E R

—– —–
19 7

We report here few basic commands to submit and check job submission:

  • `qsub -q blade -l nodes=1:ppn=4 -I`
    • to submit a job in Interactive mode, on 4 cores
[cozzini@pbs-centos7 ~]$ qsub -q blade -l nodes=1:ppn=4 -I
qsub: waiting for job 1072.192.168.10.5 to start
qsub: job 1072.192.168.10.5 ready

[cozzini@ct1pf-fnode002 ~]$
  • `qsub -q blade -l nodes=4:ppn=24 job_test.sh`
    • to submit a job described in a shell script, on 96 cores
  • `qsub -q blade -l nodes=thin1:ppn=24+:thin1:ppn=12,walltime=2:00:00 job_test.sh`
    • to submit a job described in a shell script, on 24 cores of node thin1 + 12 cores of thin2, with a walltime of 2 hours (default is 1 hour)
  • `tracejob [PBS_JOBID]`
    • to trace job’s life
  • `qdel [PBS_JOBID]`
    • to delete a job
  • `qstat -a`
    • to view the current jobs queue

Please note that by default jobs will last one hour and then removed. To specify longer time define walltime keyword:

$ qsub -q blade -l nodes=1:ppn=4 -l walltime=3:00:00  -I

The number of processors allocated to the job can be retrieved with:

[cozzini@pbs-centos7 ~]$  qsub -q thin -l nodes=4:ppn=3 -l walltime=24:00:00 -I
qsub: waiting for job 1075.192.168.10.5 to start
qsub: job 1075.192.168.10.5 ready

[cozzini@ct1pt-tnode001 ~]$ cat $PBS_NODEFILE
ct1pt-tnode001
ct1pt-tnode001
ct1pt-tnode001
ct1pt-tnode002
ct1pt-tnode002
ct1pt-tnode002
ct1pt-tnode004
ct1pt-tnode004
ct1pt-tnode004
ct1pt-tnode005
ct1pt-tnode005
ct1pt-tnode005
[cozzini@ct1pt-tnode001 ~]$

More information about the PBS Pro queue system can be retrieved in the user manuals:

Scientific Software

Software is available to the user by means of Environment Modules, that provides for the dynamic modification of a user’s environment via modulefiles.

Useful commands:

  • `module avail`
    • to list all the available software
[cozzini@ct1pt-tnode001 ~]$ module avail

-------------------------------------------------------------------- /opt/area/shared/modules/mpi --------------------------------------------------------------------
openmpi/4.0.3/gnu/4.8.5 (D)    openmpi/4.0.3/gnu/9.3.0

--------------------------------------------------------------- /opt/area/shared/modules/applications ----------------------------------------------------------------
python/3.7.7/gnu/4.8.5    python/3.8.2/gnu/4.8.5   conda/4.9.2     java/1.8.0

 ----------------------------------------------------------------- /opt/area/shared/modules/utilities -----------------------------------------------------------------
 hwloc/2.2.0    numactl/2.0.13

 ----------------------------------------------------------------- /opt/area/shared/modules/compilers -----------------------------------------------------------------
gnu/9.3.0

Where:
D:  Default Module
  • `module load` [name_of_software]
    • to load an available software
  • `module info` [name_of_software]
    • to get info about a software
  • `module list`
    • to list all the loaded softwares
  • `module unload`
    • to unload a software
  • `module purge`
    • to unload all the previously loaded software

Please note that the following convention has been used to define Libraries and Applications in modules file naming:

openmpi/4.0.3/gcc/9.3.0
^^^^^^^ ^^^^^^ ^^^ ^^^^^
|       |      |   |
|       |      |   |
|       |      |   +-> version of the compiler used to compile that software
|       |      |
|       |      +-> compiler used to compile that software
|       |
|       +-> Software version
|
+-> Software name

Software for Genomics

A number of software for primary and secondary analysis of Next Generation Sequencing (NGS) data are available. The following convention is adopted for the module files naming:

fastqc/0.11.9/
^^^^^^ ^^^^^^^
|       |
|       +-> Software version
|
+-> Software name

NOTE: `module load [module name]` has the same effect of `module load [module name/default_version]`, where the default version of the module is flagged with (D) when multiple verison are available.

Part of the software has been installed as standard packages and is defined in the applications path, namely:

  • bwa/0.7.17
  • fastqc/0.11.9
  • samtools/1.11
  • htslib/1.11
  • bcftools/1.11
  • trimmomatic/0.39
  • picard/2.24.0

In order to use any of the above software it is sufficient to load the module. This makes all the SW executables and libraries readily available in the current shell, and multiple modules can be loaded simoultaneously. For instance:

module load fastqc
module load picard

fastqc -t 2 forward_reads.fastq.gz reverse_reads.fastq.gz -O=dir_fastqc
java -jar $picard FastqToSam F1=dir_fastq/forward_reads.fastq F2=dir_fastq/reverse_reads.fastq \
 O=read_pairs_unmapped.bam SM=sample001 RG=rg0013

Part of the software has been installed by means of conda environments, and thus sits in the environments.

  • cutadapt/2.10
  • trim_galore/0.6.6
  • gatk/4.1.9.0
  • multiqc/1.9

Loading of the modules makes SW commands in the conda environment available in the current shell.

IMPORTANT NOTE: it is NOT possible to load distinct conda enivironments simoultaneously in the same shell.

Example:

module load cutadapt
module load gatk

# only module gatk is loaded (cutadapt automatically removed at gatk loading)