Slurm on Batch with Batch Shipyard
The focus of this article is to explain the Slurm on Batch functionality in Batch Shipyard and how to effectively deploy your workload for traditional lift-and-shift scheduling while leveraging some Platform-as-a-Service capabilities of Azure Batch.
The Slurm workload manager is an open-source job scheduler that is widely used among many institutional and supercomputing sites. Azure Batch provides an abstraction for managing lower-layer VM complexities and automated recovery through Batch pools. Batch Shipyard provides an integration between Slurm and Batch pools where the Slurm cluster controller and login nodes are provisioned and connected to compute nodes in Batch pools in an on-demand fashion.
Some users or organizations may prefer the use of Slurm native tooling and execution workflows which are not currently possible with Azure Batch; either due to workflow familiarity or existing investments in the ecosystem. Additionally, Azure Batch may not provide some of the rich job scheduling and accounting functionality available in Slurm that may be required for some organizational workflows. Moreover, some requirements such as standing up a separate VM for CycleCloud or managing the underlying Slurm compute node infrastructure may not be an amenable solution for some users or organizations.
Slurm on Batch with Batch Shipyard attempts to mix the advantages of both worlds by combining the Slurm scheduler with platform benefits of Azure Batch compute node orchestration and management.
- Simple and automated Slurm cluster deployment and management including support for on-demand suspend and restart of cluster resources
- Automatic HA support of Slurm controllers and the ability to create multiple login nodes
- Ability to specify arbitrary elastic partitions which may be comprised of a hetergeneous mixture of Batch pools
- Automatic linking of shared file systems (RemoteFS clusters) between all Slurm resources
- Support for concurrent dedicated and low priority compute nodes within partitions
- Automatic feature tagging of nodes, including VM size and capabilities
- Automatic generic resource configuration for GPU VMs
- Automatic on-demand resizing of compute node resources including user-specified idle reclaim timeouts and node reclaim exclusion filters
- Support for custom preparation scripts on all Slurm resources
- Goal-seeking engine to recover from compute node allocation failures
- Default cluster user SSH is linked to login nodes and compute nodes for easy logins and file access across non-controller resources
- Supports most Batch Shipyard configuration options on the pool, including distributed scratch, container runtime installations, monitoring integration, shared file system mounting, automatic GPU setup, etc.
- Supports joining pre-existing partitions and nodes which may be on-premises with elastic on-demand nodes
Slurm Dynamic Node Allocation and Deallocation
A Slurm cluster on Batch with Batch Shipyard utilizes the
Slurm Elastic Computing (Cloud Bursting)
functionality which is based on Slurm's
Power Save capabilities.
In a nutshell, Slurm will
resume nodes when needed to process jobs and
suspend nodes once there is no need for the nodes to run (i.e., relinquish
them back to the cloud).
When Slurm decides that new nodes should be provisioned, the
command triggers the
Batch Shipyard Slurm Helper which allocates compute
nodes on the appropriate Batch pool targeting a specific Azure region.
Batch Shipyard handles the complexity of node name assignment, host DNS
registration, and ensuring the controller updates the node information with
the appropriate IP address.
When Slurm decides that nodes should be removed via
Batch Shipyard Slurm Helper will deallocate these nodes in their
respective pools and release the node names back for availability.
Batch Pools as Slurm Compute Nodes
A Batch Shipyard provisioned Slurm cluster is built on top of different resources in Azure. To more readily explain the concepts that form a Batch Shipyard Slurm cluster, let's start with a high-level conceptual layout of all of the components and possible interactions.
+---------------+ | | +----------+ +-----------------> Azure Storage <----------------+ | | | | | | | Azure | | +---------------+ | | Resource | | | | Manager | | +-------------+ | | | | | | | +------^---+ | +-------------> Azure Batch +------------+ | | | | | | | | MSI | MSI | | MSI +-------------+ | | | | | | | +-------------------------------------------------------------------------------+ | | | | | | | | | | | +----v----+--------+ | | +------------------------+ | | | | | | | | | | +--------+ | | | | +--+------+----+-+ | | | | | | | | | | <-----------------------------> | slurmd | | | | | | Batch Shipyard | | | | | | | | | | Slurm Helper | | | +--------+ | | | | | | | | | | | | +----------------+ | +----------------+ | +--------------+ | | | | | | | | | | | | | | +-----------+ | | Batch Shipyard | | | Slurm client | | | | | | | | | Remote FS VMs | | | tools | | | | | | slurmctld | | | | | | | | | | | | | +------> Subnet A <-----+ +--------------+ | | | | +-----------+ | | 10.0.1.0/24 | | | | | | | +-------^--------+ | Azure Batch | | | | Slurm Controller Nodes | | | Compute Nodes | | | | | | | | | | | Subnet B | | | Subnet D | | | | 10.0.2.0/24 | | | 10.1.0.0/16 | | | +----------^-------------+ | +------------------+ | | | +--------+---------+ | | | | | | | | | +--------------+ | | | | | | | | | | +-------------------+ | Slurm client | | | | | | tools | | | | | | | | | | | +--------------+ | | | | | | | | Login Nodes | | | | | | | | Subnet C | | | | 10.0.3.0/24 | | | Virtual Network +---------^--------+ | | 10.0.0.0/8 | | +-------------------------------------------------------------------------------+ | SSH | | +-------+------+ | | | Cluster User | | | +--------------+
The base layer for all of the resources within a Slurm cluster on Batch is
an Azure Virtual Network. This virtual network can be shared
amongst other network-level resources such as network interfaces. The virtual
network can be "partitioned" into sub-address spaces through the use of
subnets. In the example above, we have four subnets where
Subnet A 10.0.1.0/24 hosts the Batch Shipyard RemoteFS shared file system,
Subnet B 10.0.2.0/24 contains the Slurm controller nodes,
Subnet C 10.0.3.0/24 contains the login nodes,
Subnet D 10.1.0.0/16 contains a pool or a collection of pools of
Azure Batch compute nodes to serve as dynamically allocated Slurm
One (or more) RemoteFS shared file systems can be used as a common file system
between login nodes and the Slurm compute nodes (provisioned as Batch compute
nodes). One of these file systems is also designated to store
state for HA/failover for standby Slurm controller nodes. Cluster users
login to the Slurm cluster via the login nodes where the shared file system
is mounted and the Slurm client tools are installed which submit to the
Slurm configuration and munge keys are propagated to the provisioned compute nodes in Batch pools along with mounting the appropriate RemoteFS shared file systems. Once these nodes are provisioned and idle, the node information is updated on the controller nodes to be available for Slurm job scheduling.
When Slurm signals that nodes are no longer needed, the Batch Shipyard Slurm helper will then translate the Slurm node names back to Batch compute node ids and deprovision appropriately.
Some Slurm logs, notably the
Batch Shipyard Slurm Helper, Slurm power
save logs, and
slurmd logs are stored on an Azure File Share for easy
viewing to debug issues.
slurmctld logs are stored locally with each
The following is a brief walkthrough of configuring a Slurm on Batch cluster with Batch Shipyard.
Azure Active Directory Authentication Required
Azure Active Directory authentication is required to create a Slurm cluster.
When executing either the
slurm cluster create or
slurm cluster orchestrate
command, your service principal must be at least
Owner or a
that does not prohibit the following action along with the ability to
create/read/write resources for the subscription:
This action is required to enable Azure Managed Service Identity on the Batch Shipyard Slurm Helper which runs on controller nodes.
The configuration for a Slurm cluster with Batch Shipyard is generally composed of two major parts: the Slurm configuration the normal global and pool configurations.
Slurm Cluster Configuration
The Slurm cluster configuration is defined by a Slurm configuration file. Please refer to the full Slurm cluster configuration documentation for more detailed explanations of each option and for those not shown below.
Conceptually, this file consists of five major parts:
slurm: # 1. define general settings storage_account_settings: mystorageaccount location: <Azure region, e.g., eastus> resource_group: my-slurm-rg cluster_id: slurm # 2. define controller node settings controller: ssh: # SSH access/user to the controller nodes, independent of other resources public_ip: # ... virtual_network: # Virtual Network should be the same for all resources, with a differing subnet network_security: # Optional, but recommended network security rules vm_size: # appropriate VM size vm_count: # Number greater than 1 will create an HA Slurm cluster # 3. define login node settings login: ssh: # The cluster user SSH and username settings public_ip: # ... virtual_network: # Virtual Network should be the same for all resources, with a differing subnet network_security: # Optional, but recommended network security rules vm_size: # appropriate VM size vm_count: # Number greater than 1 will create multiple login nodes # 4. define shared file systems shared_data_volumes: nfs_server: # Batch Shipyard RemoteFS storage cluster id mount_path: # The mount path across all Slurm resources store_slurmctld_state: # at least one shared data volume must set this to true # 5. define Slurm options slurm_options: idle_reclaim_time: # amount of idle time before Slurm issues suspend on nodes elastic_partitions: # define Slurm elastic cloud bursting partitions partition_1: # name of partition batch_pools: mypool1: # pool id, must be pre-allocated with zero nodes account_service_url: https://... # currently this must be the same as the Batch account specified in config.yaml compute_node_type: # dedicated or low_priority nodes max_compute_nodes: # maximum number of VMs to allocate weight: # Slurm weight features: # arbitrary constraint sequence reclaim_exclude_num_nodes: # number of nodes to exclude from idle reclaim. # Once allocated, these number of nodes are not reclaimed. # can define multiple pools max_runtime_limit: # maximum runtime for jobs in this partition default: # is the default partition, one partition must have this set to true unmanaged_partitions: # for pre-existing partitions (or on-prem partitions)
Global configuration should
contain the appropriate RemoteFS shared file system/data volumes that are
to be used across all Slurm resources under
shared_data_volumes. More than one RemoteFS
shared data volume can be specified.
Optionally, if your workload will be container driven, you can specify
image pre-loads here as per normal convention under
Pool configuration should
be used to create all necessary pools used for Slurm elastic partitions
beforehand. This file is not explicitly used for
slurm cluster create and
slurm cluster orchestrate if orchestrating a Slurm cluster with
one pool. If not utilizing the orchestrate command, then it is required
to create pools individually before issuing
slurm cluster create.
Most pool properties apply with no modifications for Slurm clusters. By
default, all Slurm nodes have Docker installed. Do not use
for Slurm compute nodes.
This is a non-exhaustive list of potential limitations while using the Slurm on Batch feature in Batch Shipyard.
- All pools must reside under the Batch account linked to the global configuration. This limitation will be lifted at a later date.
- Shared file system (shared data volume) support is currently limited to supported RemoteFS provisioned storage clusters: NFS and GlusterFS.
- LDAP for centralized user control is not implemented, but can be
customized per the
additional_prep_scriptoption on the
loginsection of the Slurm configuration file and using
additional_node_prepfor compute nodes.
- PAM-based auth restrictions for preventing users from logging into compute nodes without a running job is not yet implemented.
- An action aggregator in the
Batch Shipyard Slurm Helperthat would improve resize operation performance is not yet implemented.
- Network Security Groups (NSGs) should permit communication between Slurm resources for all required communication channels and ports.
- Ensure that you have sufficient core and pool quota for your Batch account. Please note that all quotas (except for the number of Batch accounts per region per subscription) apply to each individual Batch account separately. User subscription based Batch accounts share the underlying subscription regional core quotas.
Please see the sample Slurm recipe for a working example.