Ion Cluster

From Cse
Jump to: navigation, search

The Ion Cluster

ion.cc.gatech.edu

The Ion cluster consists of 8 compute nodes plus a head node (ion.cc.gatech.edu) and 4 Nvidia S1070 Tesla units. The nodes are interconnected via a DDR InfiniBand fabric. Jobs on Ion are managed by the Torque resource managager.

Tech Specs

Compute Nodes

Each node (i1-i8) has the following hardware configuration:

  • 2 x Intel Xeon X5550 quad-core Nehalem processors @ 2.66GHz (Hyperthreading enabled)
  • 24GB Memory
  • 2 x Nvidia Tesla T10 boards (half of an S1070)
  • 250GB 7200RPM SATA hard drive
  • Mellanox ConnectX DDR InfiniBand HCA

Head Node

  • 2 x Intel Xeon X5550 quad-core Nehalem processors @ 2.66GHz (Hyperthreading enabled)
  • 12GB Memory
  • 2 x 250GB 7200RPM SATA hard drive (RAID 1)
  • Note: The head node contains no GPUs.

Software Available

  • CUDA Toolkit 2.3 (/usr/local/cuda)
  • Nvidia driver version 190.29 with CUDA and OpenCL support
  • Nvidia OpenCL Visual Profiler (/opt/openclprof1.0)
  • PETSc 3.0.0-p9 (/opt/petsc-3.0.0-p9)
  • Matlab 2009a (/opt/matlab2009a)
  • Intel C/C++/Fortran Compiler 11.1 (/opt/intel/Compiler/11.1/059)
  • FFTW 3.2.2 (/opt/fftw-3.2.2)
  • OpenMPI 1.3.3 (/opt/openmpi-1.3.3)
  • MPICH2 1.2.1 (/opt/mpich2-1.2.1)
  • CULA versions 1.3a and 2.0-preview (/opt/cula-<version>/, documentation in the doc/ directory for each version)
  • HPC Toolkit version 5.0.1, built with OpenMPI 1.4.3 (gcc 4.4) installed in /opt/hpctoolkit-5.0.1-r3440-gcc44 (http://hpctoolkit.org/)

Usage

Policies

  1. Please do not run memory- or CPU-intensive tasks on the head node (ion.cc), as this can negatively affect the performance and availability of the cluster for other users.
  2. It is recommended that you compile your code on ion.cc before submitting a job to one or more compute nodes.
  3. When running an interactive session, please log out as soon as your job is finished. If you are running a long job and will need to leave it unattended, please run it as a batch job with an appropriate walltime limit.

Job Submission

The Ion cluster runs the Torque Resource Manager with the Maui Cluster Scheduler. Users submit jobs to the compute nodes through the qsub command. Users cannot log into compute nodes directly without submitting a job.

Torque jobs come in two flavors: interactive and batch. Interactive jobs give you an interactive shell on the first node in your allocation. Batch jobs allow you to submit a shell script (optionally with qsub options given in special comments) to run your code on the compute nodes. Batch jobs are useful for long-running programs.

Batch Jobs

To run the executable ~/myprog on one cluster node, you could create a file called myjob.sh with the following content:

#PBS -l nodes=1
#PBS -l walltime=00:05:00
#PBS -N MyJobName
./myprog arg1 arg2 arg3

The #PBS lines provide instructions on how to run your job to the qsub command. Alternatively, these options could have been provided on the command line. In this case, the job runs on a single node, with a maximum wall time of 5 minutes (The job will be forcibly killed if it hasn't terminated on its own by then). The job's name, up to 15 non-whitespace characters provided with the -N option, will be displayed in the output of the qstat command.

Torque automatically redirects stdout and stderr to files, and once your job completes, it copies them to your home directory. If you provided a name for your job, these files will be named "jobname.ojobid" and "jobname.ejobid," respectively. For example, if you submitted "myjob.sh" as written above, and it was assigned the Torque job id 1135, the output and error files would be named "MyJobName.o1135" and "MyJobName.e1135," respectively.

If you were to run "myjob.sh" but without the line that provides the job name, the output and error files would be named "myjob.sh.o1135" and "myjob.sh.e1135," respectively. Again, this assumes that 1135 is the Torque job ID.

Interactive Jobs

If you need an interactive shell on a compute node to run your program, use qsub with the "-I" flag:

[user@ion ~]$ qsub -I -l nodes=1

This will give you an interactive shell on one of the nodes you reserved (1 node by default).

MPI Jobs

The following directions assume you are using OpenMPI. The procedure for MPICH2 or Intel's MPI implementation should be similar.

You can run MPI jobs as standard Torque batch jobs. For example, to run the MPI program "myMPIprog" with 16 MPI processes per node on 3 nodes, submit the following script:

#PBS -q q
#PBS -l nodes=3
#PBS -l walltime=00:05:00
#PBS -N MyJobName
OMPI_MCA_mpi_yield_when_idle=0
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi-1.3.3/lib
/opt/openmpi-1.3.3/bin/mpirun --hostfile $PBS_NODEFILE -np 48 myMPIprog

Torque automatically creates a file containing the list of nodes allocated to a job and stores the path to that file in the $PBS_NODEFILE environment variable. As with any other job, MPI jobs can be executed interactively by submitting a standard interactive job and running mpirun.

The next example will run the MPI program "myMPIprog" with 1 MPI processes per node on 3 nodes:

#PBS -q q
#PBS -l nodes=3
#PBS -l walltime=00:05:00
#PBS -N MyJobName2
OMPI_MCA_mpi_yield_when_idle=0
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi-1.3.3/lib
/opt/openmpi-1.3.3/bin/mpirun --hostfile $PBS_NODEFILE -np 3 myMPIprog


Note: mpirun is unaware that each node has multiple processors, so running more than one process per node forcess MPI processes to run in the "degraded" mode (see http://www.open-mpi.org/faq/?category=running#oversubscribing). To force them to run in the "aggressive" mode, which generally is the correct behavior unless you are running more than 8 MPI processes per node, set the environment variable OMPI_MCA_mpi_yield_when_idle=0.


Using MPICH2

When running an MPI job using the MPICH2 implementation, it is necessary to start an mpd (multi-process daemon) on each of the nodes allocated to your job. To do this, you can modify the above sample script as follows:

#PBS -q q
#PBS -l nodes=3
#PBS -l walltime=00:05:00
#PBS -N MyJobName2
LD_LIBRARY_PATH=/opt/mpich2-1.2.1/lib:$LD_LIBRARY_PATH
PATH=/opt/mpich2-1.2.1/bin:$PATH
#Start the MPDs
mpdboot -n `cat $PBS_NODEFILE|wc -l` -f $PBS_NODEFILE -r ssh -m /opt/mpich2-1.2.1/bin/mpd &
#Wait for them to initialize
sleep 10
cd /path/of/your/program/
mpirun -np 3 myMPIprog arg0 arg1
#clean up MPDs
mpdallexit

Note that it is important that you *always* run your MPI executable using the mpirun/mpiexec that matches the MPI compiler wrappers used to build the program. Running, for example, a program compiled with Open MPI's mpicc command with MPICH2's mpirun command will not work.

More recent versions of MPICH2 use the Hydra process manager instead of mpd, which does not require the manual bootstrapping step. A newer version will be available shortly.

Matlab Jobs

You can also run Matlab jobs on a compute node, an example script to run a Matlab m-file is as:

#PBS -l nodes=1
#PBS -l walltime=00:10:00
#PBS -N MyJobName
matlab -nodisplay -r matlabScript

The m-file matlabScript should be in the search path of Matlab.

Jacket 1.6

The Ion cluster now has Jacket (GPU acceleration for Matlab) installed. To use it, you should add the following to your PBS job script:

export LD_LIBRARY_PATH=/usr/local/jacket/engine/lib64:$LD_LIBRARY_PATH

Note that this library path must be before the CUDA libary path in your LD_LIBRARY_PATH environment variable. For that reason, I would not recommend adding this to your shell profile unless you use the Ion cluster solely for Matlab jobs. You will also need to add the following line to your Matlab script:

addpath /usr/local/jacket/engine

Further documentation on Jacket is available on the Accelereyes wiki

Job Control

Checking the Status of Jobs and Queues

Another very useful Torque command is qstat, which provides information about jobs and queues. Below is a short list of several common usages of qstat:

  • 'qstat': Show all running, queued, and held jobs.
  • 'qstat -a': Same as above, but provides more information about each job.
  • 'qstat -f': Same as above, but provide extremely verbose information about each job.
  • 'qstat jobid': same as 'qstat,' but show only the job with the specified ID.
  • 'qstat -f jobid': same as 'qstat -f,' but show detailed information only for the job with the specified ID.
  • 'qstat -q': List the Torque queues, showing the resource limits for each queue.
  • 'qstat -Q': List the Torque queues, showing job statistics for each queue.

Deleting Jobs

To delete a queued or running, job, use the qdel command followed by the job ID. For example, to cancel job number 1573, type:

[user@ion ~]$ qdel 1573

Additional Information on Torque

For details on more advanced usage of the Torque Resource Manager, consult the Torque Commands Overview.

File System Layout

Each compute node has the following local world-writable file systems:

  • /tmp
    • 15GB
    • Cleaned periodically and after every reboot
  • /scratch
    • 100GB
    • Not backed up
    • Not automatically cleaned. You should delete them manually when you're finished with them. You can use 'qsub -I -l host=in ' (where n=1...8) to get to the specific node where you have files stored.
    • Files in /scratch may be deleted at any time if /scratch gets too full.

Customization

Sample .bash_profile

Below is a sample .bash_profile that you can place into your home directory to configure your environment for icc, ifort, cuda, and OpenMPI:

. /opt/intel/Compiler/11.1/059/bin/iccvars.sh intel64
. /opt/intel/Compiler/11.1/059/bin/ifortvars.sh intel64
export  LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64:/opt/openmpi-1.3.3/lib
export  PATH=${PATH}:/usr/local/cuda/bin:/opt/openmpi-1.3.3/bin
export MANPATH=${MANPATH}:/usr/local/cuda/man:/opt/openmpi-1.3.3/share/man
export INCLUDE=${INCLUDE}:/usr/local/cuda/include:/opt/openmpi-1.3.3/include