Custom Images with Batch Shipyard
The focus of this article is to explain how to provision an ARM Image resource and then deploy it with Batch Shipyard as the VM image to use for your compute node hosts.
Background: Azure Resources and Azure Batch Custom Images
Azure Batch allows provisioning compute nodes with custom images with both Batch Service and User Subscription Batch accounts. This allows users to customize the compute node with software, settings, etc. that fit their use case. With containerization, this requirement is weakened but some users may still want to customize the host compute node environment with particular versions of software such as the Docker Host engine or pre-install and embed certain software.
Azure Batch only supports creating compute nodes from custom images through ARM Image resources. You can create ARM Images using existing page blob VHDs or exporting managed disks. You must create the Image in the same subscription and region as your Batch account.
Azure Active Directory Authentication Required
Azure Active Directory authentication is required for the batch
account
regardless of the account mode. This means that the
credentials configuration file
must include an aad
section with the appropriate options, including the
authentication method of your choosing.
Your service principal requires at least Contributor
role permission or a
custom role with the actions:
Microsoft.Compute/disks/beginGetAccess/action
Microsoft.Compute/images/read
Creating Images for use with Azure Batch and Batch Shipyard
It is recommended to use Packer to create a custom image that is an ARM Image resource compatible with Azure Batch and Batch Shipyard. You will need to create a preparation script and install all of the required software as outlined in the following sections of this guide.
Note: Currently creating an ARM Image directly with Packer can only be used with User Subscription Batch accounts. For standard Batch Service pool allocation mode Batch accounts, Packer will need to create a VHD first, then you will need to import the VHD to an ARM Image. Please follow the appropriate path that matches your Batch account pool allocation mode.
Creating an ARM Image Directly with Packer
The contrib
area of the repository contain example packer
scripts to create an custom
image directly as an ARM Image from an existing Marketplace platform image.
In the packer JSON file (e.g., build.json
), ensure that you have defined
the properties:
managed_image_name
which is the name of the ARM Imagemanaged_image_name_resource_group
which places the ARM Image in the specified resource group
The resource group should be in the same region as your Batch account.
After you have created your ARM image with Packer, skip to
Step 3 below to retrieve the ARM Image Id that is required for
populating the arm_image_id
property in the pool configuration file.
Creating a VHD with Packer
The contrib
area of the repository contain example packer
scripts to create a custom
image as a page blob VHD from an existing Marketplace platform image.
In the packer JSON file (e.g., build-vhd.json
), ensure that you have
defined the properties:
resource_group_name
which places the VHD in the specified resource groupstorage_account
defines which storage account to usecapture_container_name
defines the container to place the VHD intocapture_name_prefix
defines the prefix for the VHD page blob
The resource group should be in the same region as your Batch account.
After you have created your VHD with Packer, then follow the steps below to create an ARM Image from this VHD.
Importing a VHD to an ARM Image via Azure Portal
The following will step you through creating an ARM Image from an existing page blob VHD source through the Azure Portal.
Step 1: Navigate to the ARM Image Blade
On the left, you should see an Images
option. If you do not, hit
More services >
on the bottom left and type Images
in the search box.
Step 2: Create an Image
Select + Add
to bring up the create image blade:
In the Create image blade, assign a Name
to your image. Ensure that
the Subscription
and Location
associated with this image is in the same
subscription and location as your Batch account. Select the proper OS type
and then click Browse
to select the source Storage blob. If you do not
see the Storage Account with your source VHD, you must copy it into an ARM
Storage Account within the same Location
that you have selected.
You can use blobxfer or
AzCopy
to copy your page blob VHDs if they are in a different region than your
Batch account. Select the proper Account type
if you are using HDD or
SSD backed VMs and then hit Create
to finish the process.
Step 3: Retrieve the ARM Image Resource Id
After the Image has been created, navigate to the Images
blade, select
your image and then click on Overview
. In the Overview blade, at the
bottom, you should see a RESOURCE ID
. Click on the copy button to copy
the ARM Image Resource Id into your clipboard. This is the value you should
use for the arm_image_id
in the pool configuration file.
Provisioning a Custom Image
You will need to ensure that your custom image is sufficiently prepared before using it as a source VHD for Batch Shipyard. The following sub-section will detail the reasons and requisites.
Inbound Traffic and Ports
Azure Batch requires communication with each compute node for basic functionality such as task scheduling and health monitoring. If you have a software firewall enabled on your custom image, please ensure that inbound TCP traffic is allowed on ports 29876 and 29877. Port 22 for TCP traffic should also be allowed for SSH. Note that Azure Batch will apply the necessary inbound security rules to ports 29876 and 29877 through a Network Security Group on either the virtual network or each network interface of the compute nodes to block traffic that does not originate from the Azure Batch service. Port 22, however, will be allowed from any source address in the Network Security Group. You can optionally reduce the allowable inbound address space for SSH on your software firewall rules or through the Azure Batch created Network Security Group applied to compute nodes.
Outbound Traffic and Ports
Azure Batch compute nodes must be able to communicate with Azure Storage servers. Please ensure that oubound TCP traffic is allowed on port 443 for HTTPS connections.
Ephemeral (Temporary) Disk
Azure VMs have ephemeral temporary local disks attached to them which are
not persisted back to Azure Storage. Azure Batch utilizes this space for some
system data and also to store task data for execution. It is important
not to change this location and leave the default as-is (i.e., do not
change the value of ResourceDisk.MountPoint
in waagent.conf
).
Batch Shipyard Node Preparation and Custom Images
For non-custom images (i.e., platform images or Marketplace images), Batch Shipyard takes care of preparing the compute node with the necessary software in order for tasks to run with Batch Shipyard.
Because custom images can muddy the assumptions with what is available or not in the operating system, Batch Shipyard requires that the user prepare the custom image with the necessary software and only attempts to modify items that are needed for functionality. Software that is required is checked during compute node preparation.
Base Required Software
Docker Host Engine
The Docker host engine must be installed and must
be invocable as root with default path and permissions. The Docker socket
(/var/run/docker.sock
) must be available (it is available by default).
Important Note: If you have modified the Docker Root directory to mount on the node local temporary disk, then you must disable the service to run on boot due to potential races with the disk not being set up before the service starts. Batch Shipyard will take care of properly starting the service on boot.
SSH Server
An SSH server should be installed and operational on port 22. You can limit inbound connections through the Batch service deployed NSG on the virtual network or network interface (and/or through the software firewall on the host).
GPU-enabled Compute Nodes
In order to utilize the GPUs available on compute nodes that have them (e.g., N-series VMs), the NVIDIA driver must be installed and loaded upon boot.
Additionally, nvidia-docker2 must be installed.
Infiniband/RDMA-enabled Compute Nodes
The host VM Infiniband/RDMA stack must be enabled with the proper drivers and the required user-land software for Infiniband installed. It is best to base a custom image off of the existing Azure platform images that support Infiniband/RDMA.
Storage Cluster Auto-Linking and Mounting
If mounting a storage cluster, the required NFSv4 or GlusterFS client tooling must be installed and invocable such that the auto-link mount functionality is operable. Both clients need not be installed unless you are mounting both types of storage clusters.
GlusterFS On Compute
If a GlusterFS on compute shared data volume is required, then GlusterFS server and client tooling must be installed and invocable so the shared data volume can be created amongst the compute nodes.
Azure Blob Shared Data Volume
If mounting an Azure Blob Storage container, you will need to install the blobfuse software.
Installed/Configured Software
Batch Shipyard may install and/or configure a minimal amount of software to enusre that components and directives work as intended.
Encryption Certificates and Credential Decryption
If employing credential encryption, Batch Shipyard will exercise the necessary logic to decrypt any encrypted field if credential encryption is enabled. Properties in the global configuration should be enabled as per requirements as if deploying a non-Custom Image-based compute node.
Batch Shipyard Docker Images
Batch Shipyard Docker images required for functionality on the compute node will be automatically installed.
Singularity Container Runtime
Batch Shipyard will install and configure the Singularity Continer Runtime on Ubuntu and CentOS/RHEL hosts.
Allocating a Pool with a Custom Image
When allocating a compute pool with a custom image, you must ensure the following:
- The ARM Image is in the same subscription and region as your Batch account.
- You are specifying the proper
aad
settings in your credentials configuration file forbatch
(or "globally" in the credentials file). -
Your pool specification has the proper
vm_configuration
settings forcustom_image
.- The
arm_image_id
points to a valid ARM Image resource node_agent
is populated with the correct node agent sku id which corresponds to the distribution used in the custom image. For instance, if your custom image is based on Ubuntu 16.04, you would usebatch.node.ubuntu 16.04
as thenode_agent
value. You can view a complete list of supported node agent sku ids with thepool listskus
command.
- The
ARM Image Retention Requirements
Ensure that the ARM image exists for the lifetimes of any pool referencing the custom image. Failure to do so can result in pool allocation failures and/or resize failures.