Remote Filesystems with Batch Shipyard
The focus of this article is to explain how to provision a standalone file server or storage cluster for use as a shared file system.
Notabene: Creating a standalone remote filesystem with Batch Shipyard is independent of Azure Batch and all Batch related functionality in Batch Shipyard. You may create a filesystem in Azure with Batch Shipyard and manage it with the tooling in Batch Shipyard or in the Azure Portal without having to create an Azure Batch account. However, if you do want to use this filesystem with the Azure Batch service, many convenience features are present to make using such filesystems relatively painless with your jobs and tasks.
Overview
The ability to have a shared file system that all compute nodes can access is vital to many HPC and batch processing workloads. Azure Batch provides simple mechanisms for scaling a workload, but most non-trivial compute tasks often require access to shared data, whether that is as simple as configuration files to more complicated scenarios such as shared model files or shared output for validation.
Some scenarios are natively handled by Azure Batch through resource files or through Batch Shipyard with data ingress. However, there are many scenarios where this is insufficient and only a real shared file system will suffice.
Batch Shipyard includes support for automatically provisioning an entire file server with attached disks or a GlusterFS storage cluster for both scale up and scale out scenarios.
Major Features
- Support for multiple file server types: NFS or GlusterFS
- Support for SMB/CIFS on top of NFS or GlusterFS mountpoints to enable file sharing to Windows clients
- Automatic provisioning of all required resources for the storage cluster including managed disks, virtual networks, subnets, network interfaces, IP addresses and DNS labels, network security groups, availability sets, virtual machines and extensions
- Suite of commandline tooling for cluster management including zero-downtime disk array expansion and storage cluster resize (scale out), status queries tailored to file server types and hassle-free SSH for administration
- Support for cluster suspension (deallocation) and restart
- Support for definining and managing multiple clusters simultaneously
- Support for btrfs, XFS, ext4, ext3 and ext2 filesystems
- Automatic disk array construction via RAID-0 through btrfs or Linux software RAID (mdadm)
- Consistent private IP address allocation per virtual machine and virtual machine to disk mapping
- Automatic network security rule configuration based on file server type, if requested
- Automatic placement in an availability set for GlusterFS virtual machines
- Support for accelerated networking
- Automatic boot diagnostics enablement and support for serial console access
- Automatic SSH keypair provisioning and setup for all file servers in storage cluster
- Configuration-driven data ingress support via scp and rsync+ssh, including concurrent multi-node parallel transfers with GlusterFS storage clusters
Azure Batch Integration Features
- Automatic linking between Azure Batch pools (compute nodes) and Batch Shipyard provisioned remote filesystems
- Support for mounting multiple disparate Batch Shipyard provisioned remote filesystems concurrently to the same pool and compute nodes
- Automatic failover for HA GlusterFS volume file lookups (compute node client mount) through remote filesystem deployment walk to find disparate upgrade and fault domains of the GlusterFS servers
- Automatic volume mounting of remote filesystems into a Docker container executed through Batch Shipyard
Mental Model
A Batch Shipyard provisioned remote filesystem is built on top of different resources in Azure. These resources are from networking, storage and compute. To more readily explain the concepts that form a Batch Shipyard standalone storage cluster, let's start with a high-level conceptual layout of all of the components and possible interacting actors.
+------------------------------------------------------------+ | | | +-------------------------------+ | | | | | | | +---------------------------+ | | | | | +-------------+ | | | | | | | Data | Data | | | +--------------------+ | | | | | Disk | Disk | | | | Subnet B | | | | | Virtual | 00 | 01 | | | | 10.1.0.0/16 | | | | | Machine A +-------------+ | | | +----------------+ | | | | | | Data | Data | | | | | | | | | | | GlusterFS | Disk | Disk | | | | | Azure Batch | | | | | | Server 0 | 02 | .. | | | | | Compute Node X | | | | | | +------+------+ | | | | | | | | | | RAID-0 Array | | | +------------+ | | | +--------------+ Mount | | +-----------+ +------------+ | | | Private IP | | | | | External <--------------> Public IP | | Private IP <-------> 10.1.0.4 | | | | | Client | Brick | | | 1.2.3.4 | | 10.0.0.4 | | | +------------+---+ | | | Mount | Data | | +-----------+--+-----^------+ | | | | | (if allowed) | | | | | +--------------------+ | +------^-------+ | | | | | | | | | | +--------------------+ | | | | +-----------+--+-----v------+ | | | | +----------------------> Public IP | | Private IP | | | +------------+---+ | | Brick Data | | | 1.2.3.5 | | 10.0.0.5 <-------> Private IP | | | | | | +---------------------------+ | | | 10.2.1.4 | | | | | | | +-------------+ | | | +------------+ | | | | | | | Data | Data | | | | | | | | | | | | Disk | Disk | | | | | Azure Virtual | | | | | | Virtual | 00 | 01 | | | | | Machine Y | | | | | | Machine B +-------------+ | | | | | | | | | | | Data | Data | | | | +----------------+ | | | | | GlusterFS | Disk | Disk | | | | Subnet C | | | | | Server 1 | 02 | .. | | | | 10.2.1.0/24 | | | | | +------+------+ | | +--------------------+ | | | | RAID-0 Array | | | | | +---------------------------+ | | | | Subnet A | | | | 10.0.0.0/24 | | | +-------------------------------+ | | Virtual Network | | 10.0.0.0/8 | +------------------------------------------------------------+
The base layer for all of the resources within a standalone provisioned
filesystem is an Azure Virtual Network. This virtual network can be shared
amongst other network-level resources such as network interfaces. The virtual
network can be "partitioned" into sub-address spaces through the use of
subnets. In the example above, we have three subnets where
Subnet A 10.0.0.0/24
hosts the GlusterFS infrastructure,
Subnet B 10.1.0.0/16
contains a pool of Azure Batch compute nodes, and
Subnet C 10.2.1.0/24
contains other Azure virtual machines. No resource
in Subnet B
or Subnet C
is required for the Batch Shipyard provisioned
filesystem to work, it is just to illustrate that other resources can
access the filesystem within the same virtual network if configured to do
so.
If your configuration is NFS instead, then the above illustration would be
simplified to a single virtual machine (Virtual Machine A
in
Subnet A 10.0.0.0/24
) only. However, other non-GlusterFS specific
concepts still apply regarding other Azure resources.
The storage cluster depicted is a 2-node GlusterFS distributed file system with attached disks. Each node has a number of managed disks attached to it arranged in a RAID-0 disk array. The array (and ultimately the filesystem sitting on top of the disk array) holds the GlusterFS brick for the virtual machine. Because the managed disks are backed to Azure Storage LRS (locally redundant storage), there is no practical need to for mirroring or striping at this level.
For each Azure Virtual Machine hosting a brick of the GlusterFS server, two IP addresses are provisioned, in addition to a fully qualified domain name that resolves to the public IP address. The public IP address allows for external clients to SSH into the virtual machine for diagnostic, maintenance, debugging, data transfer and other tasks. The SSH inbound network security rule can be tightened according to your requirements in the configuration file. Additionally, inbound network rules can be applied to allow the filesystem to be mounted externally as well. The private IP address is an address that is only internally routable on the virtual network. Resources that are on the virtual network will/should use these IP addresses for access to the filesystem.
And finally, when provisioning GlusterFS servers, rather than NFS servers, Batch Shipyard automatically places the virtual machines in an availability set along with maximally spreading virtual machines across update and fault domains. Single instance NFS servers will not be placed in an availbility set, however, if using a premium storage virtual machine size along with all premium disks, then you may qualify for single instance SLA.
Configuration
In order to create storage clusters, there are a few configuration changes that must be made to enable this feature.
Azure Active Directory Authentication Required
Azure Active Directory authentication is required to create storage clusters. Additionally, if leveraging integration features with Batch pools, then the virtual network shared between the storage cluster and the Batch pool must be the same.
Your service principal requires at least Contributor
role permission in
order to create the resources required for the storage cluster.
Credentials Configuration
The following is an example for Azure Active Directory authentication in the credentials configuration.
credentials: # management settings required with aad auth management: aad: # valid aad settings (or at the global level) subscription_id: # subscription id required # ... other required settings
RemoteFS Configuration
Please see this page for a full explanation of each remote filesystem and storage cluster configuration option.
The following will step through and explain the major configuration portions. The RemoteFS configuration file has four top-level properties:
remote_fs: resource_group: # resource group for all resources, can be overridden location: # Azure region for all storage cluster resources managed_disks: # disk settings storage_clusters: # storage cluster settings
It is important to specify a location that is appropriate for your storage cluster and if joining to a Batch pool, must be within the same region.
Managed Disks Configuration
The managed_disks
section describes disks to be created for use with
storage clusters.
managed_disks: resource_group: # optional resource group just for the disks # premium disks have provisioned IOPS and can provide higher throughput # and lower latency with consistency. If selecting premium disks, # you must use a premium storage compatible vm_size. premium: true disk_size_gb: # size of the disk, please see Azure Manage Disk docs disk_names: - # list of disk names
Storage Cluster Configuration
The storage_clusters
section describes one or more storage clusters to
create and manage.
storage_clusters: # unique name of the storage cluster, this is the "storage cluster id" mystoragecluster: resource_group: # optional resource group just for the storage cluster hostname_prefix: # hostname prefix and prefix for all resources created ssh: # ssh settings public_ip: enabled: # true or false for enabling public ip. If public ip is not # enabled, then it is only accessible via the private network. static: # true or false if public ip should be static virtual_network: # virtual network settings. If joining to a Batch pool, ensure that # the virtual network resides in the same region and subscription # as the Batch account. It is recommended that the storage cluster # is in a different subnet than that of the Batch pool. network_security: # network security rules, only "ssh" is required. All other settings # are for external access and not needed for joining with Batch pools # as traffic remains private/internal only for that scenario. file_server: type: # nfs or glusterfs mountpoint: # the mountpoint on the storage cluster nodes mount_options: - # fstab mount options in list format server_options: glusterfs: # this section is only needed for "glusterfs" type transport: tcp # tcp is only supported for now volume_name: # name of the gluster volume volume_type: # type of volume to create. This must be compatible # with the number of bricks. # other key:value pair tuning options can be specified here nfs: # this section is only needed for "nfs" type # key:value (where value is a list) mapping of /etc/exports options samba: # optional section, if samba server setup is required vm_count: # 1 for nfs, 2+ for glusterfs vm_size: # Azure VM size to use. This must a premium storage compatible # size if using premium managed disks. fault_domains: # optional tuning for the number of fault domains accelerated_networking: # true to enable accelerated networking vm_disk_map: # cardinal mapping of VMs to their disk arrays, e.g.: '0': # note that this key must be a string disk_array: - # list of disks in this disk array filesystem: # filesystem to use, see documentation on available kinds raid_level: # this should be set to 0 if disk_array has more than 1 # disk. If disk_array has only 1 disk, then this property # should be omitted. prometheus: # optional monitoring settings
Batch Pool Integration
If you wish to use your storage cluster in conjunction with a Batch pool, then you will need to modify the credentials, global, pool, and jobs configuration files.
Credentials Configuration
Azure Active Directory authentication for Batch is required for joining a storage cluster with a Batch pool.
credentials: # batch aad settings required if monitoring batch pools batch: aad: # valid aad settings (or at the global level) account_service_url: # valid batch service url resource_group: # batch account resource group management: aad: # valid aad settings (or at the global level) subscription_id: # subscription id required # ... other required settings
Global Configuration
You must specify the storage cluster under global_resources
such that
bound Batch pools will provision the correct software to mount the storage
cluster.
# ... other global configuration settings global_resources: # ... other global resources settings volumes: shared_data_volumes: mystoragecluster: # this name must match exactly with the storage cluster # id from the RemoteFS configuration that you intend # to link volume_driver: storage_cluster container_path: # that path to mount this storage cluster in # containers when jobs/tasks execute mount_options: # optional fstab mount options bind_options: # optional bind options to the container, default is "rw"
Pool Configuration
The pool configuration file must specify a valid virtual network. Because of this requirement, you must use Azure Active Directory authentication for Batch.
pool_specification: # ... other pool settings virtual_network: # virtual network settings must have the same virtual network as the # RemoteFS configuration. However, it is strongly recommended to have # the Batch pool compute nodes reside in a different subnet.
Jobs Configuration
The jobs configuration must refer to the shared data volume such that it understands to mount the volume into the container for the task or all tasks under a job.
job_specifications: - id: # job id shared_data_volumes: # this name must match exactly with the global_resources # shared_data_volumes name. If specified at the job level, then all # tasks under the job will mount this volume. - mystoragecluster # ... other job settings tasks: - shared_data_volumes: - # storage cluster can be specified for fine grained control at # a per task level # ... other task settings
Usage Documentation
The workflow for creating a storage cluster is first creating the managed disks, then the storage cluster itself. Below is an example command usage.
# create managed disks shipyard fs disks add # create storage cluster shipyard fs cluster add <storage-cluster-id>
If there were provisioning errors during fs cluster add
but the provisioning
had not yet reached the VM creation phase, you can remove the orphaned
resources with:
# clean up a failed provisioning that did not reach VM creation
shipyard fs cluster del <storage-cluster-id> --generate-from-prefix
If any VMs were created and the provisioning failed after that, you can
delete normally (without --generate-from-prefix
).
After there is no need for the storage cluster, you can either suspend the storage cluster or delete it. Note that suspending a glusterfs storage cluster is considered experimental.
# suspend a storage cluster shipyard fs cluster suspend <storage-cluster-id> # restart a suspended storage cluster shipyard fs cluster start <storage-cluster-id> # delete a storage cluster shipyard fs cluster del <storage-cluster-id>
Please see this page for detailed documentation
on fs
command usage.
Usage with Batch Pools
If joining to a Batch pool, the storage cluster must be created first.
After which, commands such as pool add
and jobs add
should work
normally with the storage cluster mounted into containers if configuration
is correct.
Sample Recipes
Sample recipes for RemoteFS storage clusters of NFS and GlusterFS types can be found in the recipes area.