Torque and Maui: Submitting an MPI Job

From Debian Clusters
Jump to: navigation, search

If you've installed and configured Torque and Maui, and you've run through the basic job submission sanity check, you're ready to break - I mean, test - your system with something a little more difficult. The basic jobs in the sanity check only used bash commands - commands build into Linux and the bash shell. While they're great for testing the queuer and scheduler to make sure the two are up and running with each other, they're a little simplistic for getting real research done.

Nowadays, MPI is very commonly used for parallel programming. In fact, many parallel software packages like Gromacs or mpiBLAST are powered by MPI.

That's why the next step in testing out the queue is to make sure it works well with MPI. This will require that MPICH is installed on the head node and all of the worker nodes, and that MPICH has been set up for Torque by way of mpiexec or that you have a global MPD ring started.

Creating a Basic MPI Program

The first step is to write and compile a program that uses MPI. A "hello world" type program is ideal for this. I'll be borrowing one from the Bootable Cluster CD project. Follow the instructions on the Creating and Compiling an MPI Program page to create and compile one.

Creating a Basic Qsub Script

Next, we'll need to create a qsub script. Qsub won't just take a binary file, so you can't just run qsub hello.out. If you do try that, you'll get an error message:

kwanous@gyrfalcon:~$ qsub hello.out
qsub:  file must be an ascii script

Most programs will need to be submitted via a qsub script. In fact, that's what allows Maui to do the scheduling that optimizes when jobs should run. There's more about qsub scripts at the Torque Qsub Scripts page, but for now, let's go with this:

#PBS -N mpi_hello
#PBS -l nodes=8
cd $PBS_O_WORKDIR

/shared/bin/mpiexec /shared/home/kwanous/mpi/hello.out

This script tells Torque to call this job "mpi_hello" and that it will need eight processors. Notice that the full path to the command to run (mpiexec) must be given, as well as the full path to the executable (hello.out).

Submit the job via

qsub <your script name>

You can watch the job the same way as in the first sanity check. When it finishes, you should have two new files in the directory you submitted the job from: mpi_hello.eXX and mpi_hello.oXX. If mpi_hello.eXX is not empty, you should use this information to try to diagnose the problem. If everything ran successfully, you should see something like this in mpi_hello.oXX:

Hello MPI from the server process!Hello MPI!
 mesg from 1 of 8 on peregrine
Hello MPI!
 mesg from 2 of 8 on peregrine
Hello MPI!
 mesg from 3 of 8 on peregrine
Hello MPI!
 mesg from 4 of 8 on owl
Hello MPI!
 mesg from 5 of 8 on owl
Hello MPI!
 mesg from 6 of 8 on owl
Hello MPI!
 mesg from 7 of 8 on owl

The above shows that my MPI set up is working correctly: eight processors were allocated on two nodes (the nodes each have four processors each), peregrine and owl. Peregrine only has three hellos because the server process ran on one of the four processors.

Personal tools
Namespaces
Variants
Actions
About
services
node images
clustering
web monitoring
Toolbox