User Guide

From Cellbuzz

Jump to: navigation, search

The Georgia Tech - IBM BladeCenter QS20 Cell Broadband Engine cluster

Through the support of the Sony-Toshiba-IBM (STI) Center of Competence for the Cell Broadband Engine Processor at Georgia Tech, directed by Prof. David A. Bader a cluster of Cell/B.E. QS20 dual-Cell blades are available for community use and development of Cell/B.E. applications.

Error creating thumbnail: Unable to save thumbnail to destination

The Georgia Tech Cell/B.E. cluster contains a publicly-accessible front-end and three BladeCenters containing 14 IBM BladeCenter QS20 and 6 IBM BladeCenter QS22 dual-Cell blades named cell01 through cell20. The IBM BladeCenter QS20 is a Cell BE-based blade system designed for businesses that can benefit from high performance computing power and the unique capabilities of the Cell BE processor to run graphic-intensive applications and is especially suitable for computationally intense, high performance workloads across a number of industries including digital media, medical imaging, aerospace, defense and communications.

An IBM BladeCenter QS20 blade features:

  • Two 3.2 GHz Cell BE processors
  • 1 GB XDRAM (512 MB per processor)
  • 410 GFLOPS peak performance
  • Blade-mounted 40 GB IDE hard disk drive
  • Two 1 Gb Ethernet (GbE) controllers that provide connectivity to the
    • BladeCenter chassis midplane and BladeCenter GbE switches
  • BladeCenter interface that offers Blade Power System and Sense Logic Control
  • Double-wide blade (uses two BladeCenter slots)
  • InfiniBand (IB) option, supporting up to two Mellanox IB 4x Host Channel Adapters
  • Peak performance of 2.8 TFLOPS in a standard single-chassis configuration,
    • and over 17 TFLOPS may be possible in a standard 42U rack

Account Request Form

Please request an account from our online Account Request Form. This form requires the applicant to accept the Georgia Tech computer user responsibilities.

Accessing the Cluster

The front-end system to the cluster is, which you can connect to using a standard ssh client. A tutorial on using ssh is available. The cell-user node contains the Cell SDK 3.1 build environment, compilers, editors, and tools.

Changing your Password

Strong Passwords must be used on the Georgia Tech Cell Cluster. To change your password, log on to cell-user and type

[username@cell-user ~]$ passwd
Changing password for user username.
Enter login(LDAP) password: (Type your current password.)
New UNIX password:  (Type your new password.)
Retype new UNIX password: (Type your new password a second time.)
LDAP password information changed for username
passwd: all authentication tokens updated successfully.

Changing your Shell

The default shell is /bin/tcsh. To change your default shell, run the /usr/local/bin/ldapchsh script:

[chadh@cell-user ~]$ /usr/local/bin/ldapchsh /bin/bash
Changing shell to /bin/bash
Enter LDAP Password: <enter your password here>
modifying entry "uid=chadh,ou=People,dc=cell,dc=buzz"

[chadh@cell-user ~]$

Then log out and back in for the settings to take effect.

Compiling Programs

The complete Cell SDK is installed on cell-user, and the preferred method of compiling programs is to cross-compile them there. You can find ppu-gcc, spu-gcc, etc., installed in /opt/cell/toolchain/bin. IBM's XL C/C++ compiler is also installed on cell-user, and symlinks to its binaries (ppuxlc, spuxlc, etc.) are installed in /usr/bin.

Alternatively, you may compile programs natively on the cell nodes, themselves. On the cell blades, ppu-gcc and spu-gcc are installed in /usr/bin, but the IBM XL compilers are not installed.

Running Jobs

On the Cell cluster, we use the Torque batch queueing system with the Maui cluster scheduler for job management. The user submits jobs to Torque specifying the number of nodes to use, the amount of memory, and the length of time needed (and, possibly, other resources). Torque runs the job when the resources are available, and delivers the output back to the submitter.

Interactive Jobs

Users can launch an interactive session with dedicated access to Cell cluster nodes by running the command:

 [username@cell-user ~]$ qsub -I

By default, the user is given access to one BladeCenter node. To request more than one interactive node, the user can run the command:

 [username@cell-user ~]$ qsub -I -l nodes=num

where num is the number of nodes requested. The names of the nodes allocated to the user are listed in the file pointed to by the environment variable $PBS_NODEFILE

Please exit from an interactive shell when the job completes to allow other users to gain access to the resource.

Batch Jobs

The easiest way to run a Torque batch job is to create a shell script. Torque directives are specified as comments in the script. In particular, all lines beginning with #PBS are Torque directives. After the Torque directives is the body of the script -- commands which are executed when the script runs. Here is a sample Torque script:

  # Request one node for the job
  #PBS -l ncpus=1
  # Request 10 minutes of wall-clock time for the job
  #PBS -l walltime=0:10:00
  # Run the job

To submit the above script to PBS, use the qsub command. For instance, if the script were called, you submit it using:

  [username@cell-user ~]$ qsub

The second line is the job identifier (PBS_JOBID) returned by Torque, and indicates that the script has been accepted.

Useful Torque commands:

1. Display the status of Torque batch jobs

  [username@cell-user ~]$ qstat -a

2. Show all running jobs on system

  [username@cell-user ~]$ qstat -r

3. Show detailed information of the specified job

  [username@cell-user ~]$ qstat -f PBS_JOBID

4. Show the status of all the nodes

  [username@cell-user ~]$ pbsnodes -a

5. Delete (cancel) a queued job

  [username@cell-user ~]$ qdel PBS_JOBID

For more information, please refer to the man pages on cell-user, or the Torque User's Manual or Torque Commands Overview.

MPI Jobs

To run an interactive OpenMPI job on more than one node, first request an interactive session on those nodes:

 [username@cell-user ~]$ qsub -I -l nodes=num

Then, execute mpirun, passing it the $PBS_NODEFILE environment variable. For example, to run an MPI program with 4 processes, do the following:

 [@cell01 ~]$ mpirun -n 4 <mpi program>

If submitting a batch job, simply add the mpirun command to your script.

Torque Queues for Cell SDK 3.0, SDK 3.1, and QS22 Blades

The CellBuzz cluster uses multiple job queues to allow users to select Cell blades running different versions of the Cell SDK. The default queue (sdk3.1) targets the QS20 blades running Cell SDK 3.1. The queue sdk3.0 targets QS20 blades running Cell SDK 3.0. To take advantage of the QS22 blades running SDK 3.1, use the qs22 queue. As an example, to use Cell SDK 3.0 blades, the argument -q sdk3.0 must be added to Torque commands. For example, to request an interactive session on a Cell blade running Cell SDK 3.0:

 [username@cell-user ~]$ qsub -I -q sdk3.0

Torque Queue for 32GB QS22 blades

Two QS22 blades have the maximum 32GB of memory installed, allowing the Cell processors to use their full memory bandwidth. If your job is memory-intensive, running it on these blades is recommended. To do so, submit the job to the qs22.32GB queue by passing the -q qs22.32GB argument to qsub.

Acknowledging the use of Georgia Tech's CellBuzz Cell/B.E. BladeCenter Cluster

  • Please acknowledge the use of the Georgia Tech CellBuzz Cell/B.E. BladeCenter Cluster in any research report, journal, or publication, that has benefited from access to these resources. The recognition of the Cell/B.E. resources is important for acquiring funding for the next generation of cyberinfrastructure. Our suggested acknowledgment is:
    • The authors acknowledge Georgia Institute of Technology, its Sony-Toshiba-IBM Center of Competence, and the National Science Foundation, for the use of Cell Broadband Engine resources that have contributed to this research.

Contacting User Support

Users can email and a ticket will be created in the ticketing system.

Personal tools