Exercise 2: Deploying OpenHPC @ AWS

Deploying our Elastic OpenHPC system at AWS with Cloudformation (30 mins)

In Exercise 1, we built an AMI to use for our login and compute nodes. We will now use this along with an OpenHPC-provided controller node AMI to deploy our elastic cluster.

First, we are going to generate a new SSH key to use for our cluster access.

Generating cluster SSH key

Services > EC2 > Key Pairs > Create key pair
Name = cluster-sc20 (leave other settings as default)
Create key pair

Your new private key should automatically be downloaded by your Web browser.

Preparing for Cloud Formation Deployment

Now, we are going to deploy the cloud formation template that will setup our cluster. But first, we need to update the template to include the AMIs we just built. The text editor vi is available by default. Other text editors may be installed.

$ sudo dnf -y install emacs nano vim

Edit the centos8-slurm-x86_64.yml file in ~/SC20/cfn-templates/ and replace both instances EX1-AMI with the AMI IDs you just generated with packer in Exercise 1.

Note: AMI IDs are available via the EC2 dasboard: Console > Services > EC2 > Images/AMIs

$ cd ~/SC20/cfn-templates
$ vim centos8-slurm-x86_64.yml

Once you populate the AMI entries in the CloudFormation template, you are ready to deploy.

Deploying the cluster with Cloud Formation

$ aws cloudformation deploy --template-file centos8-slurm-x86_64.yml --capabilities CAPABILITY_IAM --stack-name sc20-1 --region us-east-1

You can monitor the status of the deployment with the CloudFormation dashboard.

Console > Services > CloudFormation > Click the Stack name > Events

Note: If you need to rerun the aws cloudformation deploy command, you’ll need to either delete your stack or increment the index (i.e. –stack-name sc20-2) in order to rerun the command

If everything worked correctly, you’ll now be able to SSH into your login node using your “cluster-sc20” private key and the “centos” user account. You can identify the controller and login instances (and their DNS names or IP addresses) by accessing the EC2 page of your AWS console.

After the CloudFormation deployment command returns successfully, allow a few minutes for the the Slurm configuration to complete before submitting jobs.

First, we need to get the hostname of our login node from the EC2 console:

Console > Services > EC2 > Instances (running)
Right click Name=SlurmManagement > Connect > SSH client
Save the hostname to your clipboard (example: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com)

Now using your downloaded SSH private key and our login node host information, we can access our login node.

$ cp ~/Downloads/cluster-sc20.pem .
$ chmod 400 cluster-sc20.pem
$ ssh -i "cluster-sc20.pem" centos@ec2-xx-xxx-x-xxx.us-xxxx-x.compute.amazonaws.com

Testing our cluster

And finally, we can submit a test job:

$ cp /opt/ohpc/pub/examples/mpi/hello.c .
$ mpicc hello.c
$ cp /opt/ohpc/pub/examples/slurm/job.mpi .
$ sbatch job.mpi

and monitor the job with watch

$ watch -n 5 squeue

job.mpi

#!/bin/bash

#SBATCH -J test               # Job name
#SBATCH -o job.%j.out         # Name of stdout output file (%j expands to jobId)
#SBATCH -N 2                  # Total number of nodes requested
#SBATCH -n 16                 # Total number of mpi tasks requested
#SBATCH -t 01:30:00           # Run time (hh:mm:ss) - 1.5 hours

# Launch MPI-based executable

prun ./a.out

Once the job is done, we can check the output to make sure everything worked correctly.

[centos@ip-192-168-0-200 ~]$ cat job.2.out 
[prun] Master compute host = ip-192-168-1-101
[prun] Resource manager = slurm
[prun] Launch cmd = mpiexec.hydra -bootstrap slurm ./a.out (family=mpich)

 Hello, world (16 procs total)
    --> Process #   8 of  16 is alive. -> ip-192-168-1-102.us-east-1.compute.internal
    --> Process #   9 of  16 is alive. -> ip-192-168-1-102.us-east-1.compute.internal
    --> Process #  10 of  16 is alive. -> ip-192-168-1-102.us-east-1.compute.internal
    --> Process #  11 of  16 is alive. -> ip-192-168-1-102.us-east-1.compute.internal
    --> Process #  12 of  16 is alive. -> ip-192-168-1-102.us-east-1.compute.internal
    --> Process #   0 of  16 is alive. -> ip-192-168-1-101.us-east-1.compute.internal
    --> Process #  13 of  16 is alive. -> ip-192-168-1-102.us-east-1.compute.internal
    --> Process #   1 of  16 is alive. -> ip-192-168-1-101.us-east-1.compute.internal
    --> Process #  14 of  16 is alive. -> ip-192-168-1-102.us-east-1.compute.internal
    --> Process #   4 of  16 is alive. -> ip-192-168-1-101.us-east-1.compute.internal
    --> Process #  15 of  16 is alive. -> ip-192-168-1-102.us-east-1.compute.internal
    --> Process #   5 of  16 is alive. -> ip-192-168-1-101.us-east-1.compute.internal
    --> Process #   6 of  16 is alive. -> ip-192-168-1-101.us-east-1.compute.internal
    --> Process #   7 of  16 is alive. -> ip-192-168-1-101.us-east-1.compute.internal
    --> Process #   2 of  16 is alive. -> ip-192-168-1-101.us-east-1.compute.internal
    --> Process #   3 of  16 is alive. -> ip-192-168-1-101.us-east-1.compute.internal
[centos@ip-192-168-0-200 ~]$ 

If you get messages about “waiting for resources”, just be patient. :)

That it for Exercise 2. You can use this cluster to do Exercise 3 on working with the OpenHPC software stack. If you are attending this tutorial live, you can use your provided “standalone cluster” account instead if needed.