Virtual Cluster in Red Cloud

This documentation describes how to deploy and use a virtual HPC cluster in Red Cloud. The virtual cluster uses slurm scheduler to dynamically launch Red Cloud instances to run batch or interactive jobs submitted to the batch queue, and deletes the instances when the jobs complete.

Deploy Virtual Cluster

Build It Yourself

Red Cloud subscribers can roll their own virtual clusters using the Ansible playbooks and instructions in the Slurm Cluster in Openstack GitHub repo. The playbooks will deploy a single user virtual cluster in Red Cloud.

Need Help or Additional Features

If you need help deploying the cluster or need additional features (e.g. multi-user login, custom software installation, file system export via SMB or Globus etc.), email CAC Help with your requirements. Consulting rates will apply in addition to Red Cloud subscriptions for custom work on your virtual cluster.

Log into the Head Node

ssh to the IP address or hostname of the virtual cluster head node. If you rolled your own cluster using the Slurm Cluster in Openstack repo, ssh as image_init_user (default: cloud-user) with the ssh private key ssh_private_keyfile as defined in vars/main.yml.

If CAC deployed the virtual cluster with multi-user capability, ssh to the IP address or hostname of the virtual cluster head node using your CAC user name and password.

Launch Compute Nodes via Slurm

Users can submit batch jobs or request an interactive session on a compute node using slurm. See CAC slurm documentation for more information on using the slurm scheduler.

  • Batch jobs: use the sbatch command to submit batch jobs.
  • Interactive sessions:
    • srun --pty /bin/bash: start an interactive session on a shared compute node.
    • srun --exclusive --pty /bin/bash: start an interactive session on its own exclusive compute node.

When a user first requests a compute node using srun or sbatch command, the scheduler will launch a new instance to run the compute node. It takes a couple of minutes for the cloud instance to launch. After the compute node is up and running, an interactive user will get a shell prompt on the compute node or the sbatch command will return the batch user to the shell prompt on the head node. Aside from the 2 minute instance launching time, this is very much like how a typical HPC scheduler behaves.

After the user logs off a compute node (interactive use) or a batch job completes running, the compute node instance will stay idle for (by default) 5 minutes. If another (interactive or batch) job submission comes along, the idle compute node will be scheduled to run the job immediately, without the 2-minute instance launching delay because the instance is already running. After idling for 5 minutes, the scheduler will terminate the cloud instance and the subsequent job submissions will require the 2-minute delay to launch a new instance.

Compute Node Status

Users can see how many instances are in use or idling using the “sinfo” command. For example:

  • No nodes are idling: The idle~< state denotes no cloud instance is running that node.

      [shl1@aclab-cluster-headnode ~]$ sinfo
      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
      cloud*       up   infinite     10  idle~ aclab-cluster-c-[0-9]
    
  • aclab-cluster-c-0 and aclab-cluster-c-1 are busy running jobs, as denoted by the alloc (allocated) state.

      [shl1@aclab-cluster-headnode ~]$ sinfo
      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
      cloud*       up   infinite      8  idle~ aclab-cluster-c-[2-9] 
      cloud*       up   infinite      2  alloc aclab-cluster-c-[0-1]
    
  • aclab-cluster-c-1 is running a job (alloc) and aclab-cluster-c-0 is has an idling instance (idle).

      [shl1@aclab-cluster-headnode ~]$ sinfo
      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
      cloud*       up   infinite      8  idle~ aclab-cluster-c-[2-9] 
      cloud*       up   infinite      1  alloc aclab-cluster-c-1 
      cloud*       up   infinite      1   idle aclab-cluster-c-0
    
  • aclab-cluster-c-0 and aclab-cluster-c-1 are idling instances (idle).

      [shl1@aclab-cluster-headnode ~]$ sinfo
      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
      cloud*       up   infinite      8  idle~ aclab-cluster-c-[2-9] 
      cloud*       up   infinite      2   idle aclab-cluster-c-[0-1]
    

This 5-minute idling time is configurable. Your project will get charged for an idling instance so this parameter is a tradeoff between cost and convenience.