Multi-Instance Tasks and Batch Shipyard
The focus of this article is to explain how multi-instance tasks work in the context of Azure Batch, Docker, and Batch Shipyard.
Overview
Multi-instance tasks are a special type of task in Azure Batch that are primarily targeted for an execution that requires multiple compute nodes in order to run. The canonical use case for multi-instance tasks are MPI jobs. MPI jobs are typically run on a cluster of nodes where each node participates in the execution by performing computation on a part of the problem and coordinates with other nodes to reach a solution.
Batch Shipyard helps users execute Dockerized MPI workloads by performing the necessary steps to stage the Docker container for the MPI job.
In-depth Concepts
MPI Runtime
Most popular MPI runtimes can operate with or without an integrated
distributed resource manager (launcher). In the case of Azure Batch on Linux,
the launcher is via remote shell. As the use of rsh
is generally deprecated,
launchers typically default to ssh
. The master node (i.e., the node from
where mpirun
or mpiexec
was invoked), will remote shell to all of the
other compute node hosts in the compute pool. In order for the master node to
know which nodes to connect to, runtimes typically require a host or node
list.
Once all of the nodes have been contacted and initialization of the MPI runtime is complete across all of the nodes, then the MPI application can execute.
Dockerized MPI Applications
Docker images for MPI applications are nearly the same as other non-MPI
applications. Outside of installing the necessary software required for
MPI to run, the difference is that images that use MPI must also install
SSH client/server software and enable the SSH server as the CMD
with a
port exposed to the host. Remember, the container will be running isolated
from the host, so if you attempt to connect to the SSH server running on the
host, the launcher will attempt to initialize on the host with an MPI runtime
that doesn't exist.
Mental Model
With the basics reviewed above, we can construct a mental model of the layout of how a Dockerized MPI program will execute.
The typical Docker package, distribute, deploy model usually ends up with an
image being run with docker run
. This is fine for a large majority of use
cases. However, for multi-instance tasks and MPI jobs in particular, a simple
docker run
is insufficient because most (all?) MPI runtimes do not know how
to connect and initialize MPI running within a container on other compute
nodes.
+---------------+ | MPI Program | | MPI Runtime | << Docker Container | SSH Server | +---------------+ +---------------+------------+ | Docker Daemon | SSH Server | +---------------+------------+ +===============+============+ | Operating System | +============================+ | Host or Virtual Machine | +============================+
The above layout shows the software stack from the host up through to the Docker container. The MPI application running inside the container will need to connect to other identical containers on other compute nodes.
Dockerized multi-instance tasks will use a combination of docker run
and
docker exec
. docker run
will create a running instance of the Docker image
and detach it with the SSH server running. Then the mpirun
or mpiexec
command will be executed inside the running container using docker exec
.
Azure Batch Compute Nodes and SSH
By default Azure Batch compute nodes have an SSH server running on them so users can connect remotely to their compute nodes. Internally, the system default SSH server is running on port 22. (However, this port is not mapped through the load balancer as an instance endpoint on port 22). This can lead to conflicts as described in the next section.
Dockerized MPI Applications and Azure Batch Compute Nodes
Because the internally mapped Docker container IP address is dynamic and
unknown to the Azure Batch service at job run time, the Docker image must
be run using the host networking stack so the host IP address is visible
to the running Docker container. This allows Docker MPI applications to
use AZ_BATCH_HOST_LIST
which the Azure Batch service populates with all
of the IP addresses of other compute nodes involved in the multi-instance
task.
Because of this host networking requirement, there will be a conflict
on the default SSH server port of 22. Thus, in the Dockerfile, the
SSH server in the CMD
command should be bound to an alternate port such
as port 23, along with the corresponding EXPOSE
directive.
SSH User and Options
Because Docker requires running containers as root, all Batch Shipyard jobs are invoked with elevated permissions. A side effect is that this makes setting up the SSH passwordless authentication a bit easier.
Docker images should contain the proper directives to generate an SSH
RSA public/private key pair (or alternatively COPY
a pair) and an
authorized_keys
file corresponding to the public key for the root user
within the Docker container. Additionally, ssh clients need to be transparently
directed to connect to the alternate port and ignore input prompts since these
programs will be run in non-interactive mode. If you cannot override your MPI
runtime remote shell options, you can use an SSH config file stored in the root
user's .ssh
directory alongside the keys:
Host 10.* Port 23 StrictHostKeyChecking no UserKnownHostsFile /dev/null
Note the host wildcard, if the virtual network subnet for the Azure Batch
compute nodes does not start with 10.* then this must be modified. In this
example port 23 is forced as the default for any ssh client connections to
destination hosts of 10.* which would match the EXPOSE
port in the Docker
image.
Multi-Instance Task Coordination Command
In an Azure Batch multi-instance task, a coordination command is executed on
all compute nodes of the task before the application command is run. As
described in the mental model above, a combination of docker run
and
docker exec
is used to execute an MPI job on Azure Batch with Docker.
The coordination command is the docker run
part of the process which creates
a running instance of the Docker image. If the CMD
directive in the Docker
image is not set (i.e., SSH server to execute), then an actual coordination
command should be supplied to Batch Shipyard.
Multi-Instance Task Application Command
The application command is the docker exec
portion of the Docker MPI
job execution with Batch Shipyard. This is typically a call to mpirun
or mpiexec
or a wrapper script that launches either mpirun
or mpiexec
.
Cleanup
As the Docker image is run in detached mode with docker run
, the container
will still be running after the application command completes. Currently,
there is no "clean" way to perform cleanup from the Azure Batch API.
However, by using the job auto-complete and job release facilities provided
by Azure Batch, Batch Shipyard can automatically stop and remove the Docker
container. By default, this behavior is not enabled automatically, however,
by specifying the auto_complete
property for your job to true
, your
multi-instance task will automatically be cleaned up for you, but limits
the number of multi-instance tasks per job to 1.
If you require or prefer more than one multi-instance task per job, you can
keep the auto_complete
setting to false
job specification of each job.
To manually cleanup after multi-instance tasks, there are helper methods in
the Batch Shipyard toolkit. These methods will aid in cleaning up compute nodes
involved in multi-instance tasks if they are needed to be reused for
additional jobs. Please refer to jobs cmi
and jobs cmi --delete
actions
in the Batch Shipyard Usage doc.
Automation!
Nearly all of the Docker runtime complexities are taken care of by the Batch Shipyard tooling. The user just needs to ensure that their MPI Docker images are either constructed with the aforementioned accommodations and/or are able to provide sufficient commands to the coordination/application commands to work with the Azure Batch compute node environment.
More Information
For more general information about MPI and Azure Batch, please visit this page.
Example recipes and samples
Please visit the recipes directory for multi-instance task samples.