Enhanced support for autogenerated task id schemas
(#324)
SR-IOV IB/RDMA support for Ubuntu-based images
Support for CentOS/CentOS-HPC 7.7
Support Hyper-V Gen2 boot
Ubuntu 16.04 SR-IOV IB/RDMA Packer script
Changed
Breaking Change: the singularity_images property in the global
configuration has been modified to accommodate encrypted container images.
Please see the global configuration doc for more information.
Breaking Change: non-native pools using STANDARD_NC24rs_v3 will now
default to using SR-IOV IB/RDMA settings. native pools using this VM size
will rely on the Azure Batch runtime to bind the correct container settings.
Please see the
Azure Batch Guidance issue
for more information.
Improve Azure blob/fileshare mount logic and add retries
Set proper FI_PROVIDER env var to intel-ofi MPI runtime selection
Updated Docker CE to 19.03.5
Updated Singularity to 3.5.0
Updated NC/ND driver to 418.87.01, NV driver to 430.46
Updated LIS to 4.3.4
Updated blobxfer to 1.9.4
Updated Python to 3.7.5 for pre-built binaries
Updated dependencies to latest, where applicable
Update OSUMicroBenchmarks recipe to MVAPICH-2.3.2
Fixed
Fix task output_data to correctly honor virtual directories in remote
paths for native pools
(#313)
Fix blobfuse not properly remounting on reboot
(#320)
Support for CentOS 7.5, CentOS-HPC 7.1/7.3, and WindowsServerSemiAnnual
Datacenter-Core-1709-with-Containers-smalldisk/Datacenter-Core-1803-with-Containers-smalldisk
blobxfer program output for handling input_data and output_data on
non-native pools is now captured to a separate file named
blobxfer-download.log and blobxfer-upload.log, respectively. This
prevents pollution of the stdout/stderr streams by the data transfer
phases.
Updated Docker CE to 19.03.2
Updated NC/ND driver to 418.87.00
Updated blobxfer to 1.9.2
Updated dependencies
Fixed
Fix prefix filter not being applied on task factory remote_path
(#303)
Fix non-string pickling in recurring job definitions
(#306)
Fix potential null values on node error collections and node agent info
on preempted nodes
(#307,
#309)
Fix task termination for infinite retry tasks and in non-native mode over
SSH (#308)
Fix non-native data transfer sequence coupling
(#310)
Prevent job submission on pools without task runner
(#312)
Fix task output_data with include filters for native pools
(#313)
Fix downloading of cascade logs on start task failure
Update documentation regarding AAD and subscription id requirements
along with better error messages
(#305)
Update documentation regarding Windows vs Linux environment
variables
(#311)
Revamped Singularity support, including support for Singularity 3,
SIF images, and pull support from ACR registries for SIF images via ORAS.
Please see the global and jobs configuration docs for more information.
(#146)
New MPI interface in jobs configuration for seamless multi-instance task
executions with automatic configuration for SR-IOV RDMA VM sizes with support
for popular MPI runtimes including OpenMPI, MPICH, Intel MPI, and MVAPICH
(#287)
Support for bring your own public IP addresses on Batch pools.
Please see the pool configuration doc and the
Virtual Networks and Public IPs guide
for more information.
Support for Shared Image Gallery for custom images
Support for CentOS HPC 7.6 native conversion
Additional Slurm configuration options
New recipes: mpiBench across various configurations,
OpenFOAM-Infiniband-OpenMPI, OSUMicroBenchmarks-Infiniband-MVAPICH
Changed
Breaking Change: jobs cannot be submitted against pre-3.8.0 pools.
Pools must be re-created with 3.8.0 or later.
Breaking Change: the singularity_images property in the global
configuration has been modified to accomodate Singularity 3 support.
Please see the global configuration doc for more information.
(#146)
Breaking Change: the gpu property in the jobs configuration has
been changed to gpus to accommodate the new native GPU execution
support in Docker 19.03. Please see the jobs configuration doc for
more information.
(#293)
pool images commands now support Singularity
Non-native task execution is now proxied via script
(#235)
Batch Shipyard images have been migrated to the Microsoft Container Registry
(#278)
Updated Docker CE to 19.03.1
Updated blobxfer to 1.9.0
Updated LIS to 4.3.3
Updated NC/ND driver to 418.67, NV driver to 430.30
Updated Batch Insights to 1.3.0
Updated dependencies to latest, where applicable
Updated Python to 3.7.4 for pre-built binaries
Updated Docker images to use Alpine 3.10
Various recipe updates to showcase the new MPI schema, HPLinpack and HPCG
updates to SR-IOV RDMA VM sizes
Fixed
Cargo Batch service client update missed
(#274, #296)
Premium File Shares were not enumerating correctly with AAD
(#294)
Per-job autoscratch setup failing for more than 2 nodes
Breaking Change: the additional_node_prep_commands property has
been migrated under the new additional_node_prep property as
commands (#252)
Performance improvements to speed up job submission with large task
factories or large amount of tasks. Verbosity of task generation progress
has been increased which can be modified with -v.
force_enable_task_dependencies property in jobs configuration to turn
on task dependencies on a job even when no task dependencies are present
initially. This is useful when tasks are added at a later time that may
have dependencies. Please consult the jobs documentation for more information.
Windows Server 2019 support
Genomics and Bioinformatics recipes: BLAST and RNASeq
Kata containers support: run containers on Linux compute nodes with a higher
level of isolation through lightweight VMs. Please see the pool doc for more
information.
Per-job distributed scratch space support: create on-demand scratch
space shared between tasks of a job which can be particularly useful for MPI
and multi-instance tasks without having to manage a GlusterFS-on-compute
shared data volume. Please see both the pool doc and jobs doc for more
information.
Add restrict_default_bind_mounts option to jobs specifications. This
will restrict automatic host directory bindings to the container filesystem
only to $AZ_BATCH_TASK_DIR. This is particularly useful in combination with
container runtimes enforcing VM-level isolation such as Kata containers.
Allow installation and selection of multiple container runtimes along with
a default container runtime for Docker invocations. Please see the pool doc
for more information under container_runtimes.
Support for Standard SSD and Ultra SSD managed disks for RemoteFS clusters.
In conjunction with this change, Availability Zone support has been added
for manage disks and storage cluster VMs. Please see the relevant
documentation for more information.
Changed
Breaking Change: the premium property under remote_fs:managed_disks
has been replaced with sku. Please see the RemoteFS configuration doc for
more information.
Breaking Change: the Singularity container runtime is no longer installed
by default, please see the pool doc to configure pools to install Singularity
as needed under container_runtimes:install.
Renamed MADL recipe to HPMLA
Updated NC/ND driver to 410.72 with CUDA 10 support
Updated blobxfer to 1.5.4
Updated LIS, Prometheus, and Grafana
Updated other dependencies to latest
Updated binary builds and Windows Docker images to Python 3.7.1
CentOS 7.5 and Microsoft Windows Server semi-annual
datacenter-core-1803-with-containers-smalldisk host support. Please see
the platform image support doc for more information.
fallback_registry to improve robustness during provisioning when Docker
Hub has an outage or is degraded
(#215, #217)
misc mirror-images command to help mirror Batch Shipyard system
images to the designated fallback registry
Support for XFS filesystem in storage clusters (#218)
Experimental support for disk array RAID expansion for mdadm-based devices
via fs cluster expand
Option to auto-upload Batch compute node service logs on unusable (#216)
Microsoft Azure Distributed Linear Learner recipe (#195)
Changed
pool nodes list can now filter nodes with start task failed and/or
unusable states
diag logs upload command can generate a read only SAS for the target
container via --generate-sas
storage clear and storage del now allow multiple --poolid arguments
along with --diagnostics-logs to clear/delete diagnostics logs containers
storage sas create now allows container and file share level SAS creation
along with --list permission now available as an option
Pools failing to allocate with unusable or start task failed nodes will now
dump a listing of problematic nodes detailing the error
Updated RemoteFS storage clusters using GlusterFS and Ubuntu/Debian-based
GlusterFS-on-compute to 4.1
Breaking Change: You can no longer specify both an account_key
and aad with the batch section of the credentials config. The prior
behavior was that account_key would take precedence over aad. Now
these options are mutually exclusive. This will now break configurations
that specified aad at the global level while having a shared account_key
at the batch level. (#197)
Breaking Change:install.sh now installs into a virtual env by
default. Use the -u switch to retain the old (non-recommended)
default behavior. (#200)
GlusterFS for RemoteFS and gluster on compute updated to 4.0.
Update NC driver to 396.26 supporting CUDA 9.2
blobxfer updated to 1.2.1
Fixed
Errant credentials check for configuration from commandline which affected
config load from KeyVault
Blobxfer extra options regression
Cache container/file share creations for data egress (#211)
Output to JSON for a subset of commands via --raw commmand line switch.
JSON output is directed to stdout. Please see the usage doc for which commands
are supported and important information regarding output stability. (#177)
Allow AAD on the storage section in the credentials configuration.
Please see the credential doc for more information. (#179)
Boot diagnostics are now enabled for all VMs provisioned for RemoteFS
clusters. This also enables serial console support in the portal. (#193)
product_iterables task factory support. Please see the task factory
doc for more information. (#187)
default_working_dir option as the job and task-level to set the
default working directory when a container executes as a task. Please
see the jobs doc for more information. (#190)
--no-generate-tunnel-script option to pool nodes grls
Changed
Greatly improve throughput speed of many commands that internally iterated
sequences of actions (#188)
RemoteFS clusters provisioned using Ubuntu 18.04-LTS
(#161, #185)
Update Nvidia NC driver to 390.46 supporting CUDA 9.1
Support for adding network access rules to the remote access port (SSH or
RDP). Please see the pool configuration guide for more details.
Support for adding certificate references to a pool. Please see the
pool configuration guide for more details. Also please see below for
improvements to the cert command.
Support for
NCv3 VM sizes.
Note that ND/NCv2/NCv3 all require separate quota approval; please raise a
ticket through the Azure Portal.
Support for uploading Batch compute node service logs to the specified
Azure storage account used by Batch Shipyard. Please see the
diag logs upload command in the usage docs.
Support for fine-tuning /etc/exports when creating NFS file servers via
server_options and nfs. Please see the remote FS configuration doc
for more information.
Support for job-level default task exit condition options. These options
can be overriden on a per-task basis. Please see the job configuration doc
for more information.
Changed
Improve cert commands
Support adding arbitrary cer, pem and pfx certificates to a Batch
account via command line options
Support deleting arbitrary certificates by thumbprint, including
multiple at once; also ask for confirmation before deleting
Support creating pem/pfx pairs with cert create without having
to define an encryption section in the global configuration for
use in scenarios outside of credential encryption
depends_on and depends_on_range now apply to tasks generated by
task_factory (#173). Please see the job configuration doc for more
information.
pool nodes del and pool nodes reboot now accept multiple --nodeid
arguments to specify deleting and rebooting multiple nodes at the same time,
respectively
pool nodes prune, pool nodes reboot, pool nodes zap will now ask
for confirmation first. -y flag can be specified to suppress confirmation.
Added Batch Shipyard version to user agent for all ARM clients
Improved node prep scripts with more timestamp detail, Docker and
Nvidia details
CUDA 9.1 support on ND/NCv2/NCv3 with Tesla Driver 390.30
Docker CE updated to 18.03.0
Singularity updated to 2.4.4
Dependencies updated
Fixed
Previous environment variable expansion fix applied to multi-instance tasks
jobs tasks list command with undefined job action but with dependency
actions
job_action for task default exit condition was being overwritten
incorrectly in certain scenarios
Support for specifying default task exit conditions (i.e., non-zero exit
codes). Please see the jobs configuration doc for more information.
New commands (please see usage doc for more information):
pool nodes prune will remove all unused Docker-related data on all
nodes in the pool (requires an SSH user)
pool nodes ps performs a docker ps -a on all nodes in the pool
(requires an SSH user)
pool nodes zap will kill (and optionally remove) all running
Docker containers on nodes in a pool (requires an SSH user). Note that
jobs tasks term is the preferred command to control individual
(or grouped) task termination or jobs term --termtasks to terminate
at the job level.
storage sas create command added as a utility helper function to
create SAS tokens for given storage accounts in credentials
Custom Linux Mount support for shared_data_volumes. Please see the
global configuration doc for more information.
New commands (please see usage doc for more information):
account command added with the following sub-commands (requires
AAD auth):
info provides information about a Batch account (including account
level quotas)
list provides information about all (or a resource group subset)
of accounts within the subscription specified in credentials
quota provides service level quota information for the
subscription for a given location
pool rdp sub-command added, please see usage doc for more information.
Requires Batch Shipyard executing on Windows with target Windows
containers pools.
pool images update command now supports updating Docker images
in native container support pools via SSH
Ability to specify an AAD authority URL via the aad:authority_url
credential configuration, --aad-authority-url command line option or
SHIPYARD_AAD_AUTHORITY_URL environment variable. Please see relevant
documentation for credentials and usage.
Support for CentOS 7.4 and Debian 9 compute node hosts. CentOS 7.4
on GPU nodes is currently unsupported; CentOS 7.3 will continue to work on
N-series.
Support for publisher MicrosoftWindowsServer, offer
WindowsServerSemiAnnual, and sku
Datacenter-Core-1709-with-Containers-smalldisk
--delete-resource-group option added to fs disks del command
CentOS-HPC 7.1, CentOS 7.3 GPU, and CentOS 7.4 packer scripts added to
contrib area
Add documentation for which platform_images are supported
Changed
Breaking Change:additional_node_prep_commands is now a dictionary
of pre and post properties which are executed either before or after the
Batch Shipyard startup task. Please see the pool configuration doc for more
information.
Allow provisioning of OpenLogic CentOS-HPC 7.1
Default management endpoint for public Azure cloud updated
Improve some error messages/handling
Update dependencies to latest
Linux pre-built binary is no longer gzipped
Update packer scripts in contrib area
Fixed
AAD auth for ARM endpoints in non-public Azure cloud regions
Custom image + native mode deployment for Linux pools
Potential command launch problems in native mode
Minor schema validation updates
AAD check logic for different points in pool allocation
--ssh parameter for pool images update was not correctly set as a flag
--jobs was not properly being merged with --configdir (#163)
Fix regression in pool images update that would not login to
registries in multi-instance mode
Fix pool images commands to more reliably work with SSH
Fix output_data with windows containers pools (#165)
Support for mounting multiple Azure File shares as shared_data_volumes
to a pool (#123)
bind_options support for data_volumes and shared_data_volumes
More packer samples for custom images
Singularity HPLinpack recipe
Changed
Breaking Change:global_resources:docker_volumes is now named
global_resources:volumes. Although backward compatibility is maintained
for this property, it is recommended to migrate as volumes are now shared
between Docker and Singularity containers.
Azure Files (with volume_driver of azurefile) specified under
shared_data_volumes are now mounted directly to the host (#123)
The internal root mount point for all shared_data_volumes is now under
$AZ_BATCH_NODE_ROOT_DIR/mounts to reduce clutter/confusion under the
old root mount point of $AZ_BATCH_NODE_SHARED_DIR. The container mount
points (i.e., container_path) are unaffected.
Canonical UbuntuServer 16.04-LTS is no longer pinned to a specific
release. Please avoid using the version 16.04.201709190.
Update to blobxfer 1.0.0rc3
Updated custom image guide
Fixed
Multi-instance Docker-based application command was not being launched
under a user identity if specified
Allow min node allocation with bias_last_sample without required
sample percentage (#138)
Support for deploying compute nodes to an ARM Virtual Network with Batch
Service Batch accounts (#126)
Support for deploying custom image compute nodes from an ARM Image resource
(#126)
Support for multiple public and private container registries (#127)
YAML configuration support. JSON formatted configuration files will continue
to be supported, however, note the breaking change with the corresponding
environment variable names for specifying individual config files from the
commandline. (#122)
Option to automatically attempt recovery of unusable nodes during
pool allocation or resize. See the attempt_recovery_on_unusable option in
the pool configuration doc.
Virtual Network guide
Changed
Breaking Change: Docker image tag for the CLI has been renamed to
alfpark/batch-shipyard:<version>-cli where <version> is the release
version or latest for whatever is in master. (#130)
Breaking Change: Fully qualified Docker image names are now required
under both the global config global_resources.docker_images and jobs
task array docker_image (or image). The docker_registry property
in the global config file is no longer valid. (#106)
Breaking Change: Docker private registries backed to Azure Storage blobs
are no longer supported. This is not to be confused with the Classic Azure
Container Registries which are still supported. (#44)
Breaking Change:docker_registry property in the global config is
no longer required. An additional_registries option is available for any
additional registries that are not present from the docker_images
array in global_resources but require a valid login. (#106)
Breaking Change: Data ingress/egress from/to Azure Storage along with
task_factory:file has changed to accommodate blobxfer 1.0.0 commandline
and options. There are new expanded options available, including multiple
include and exclude along with remote_path explicit specifications
(instead of general container or file_share). Please see the appropriate
global config, pool or job configuration docs for more information. (#47)
Breaking Change:image_uris in the vm_configuration:custom_image
property of the pool configuration has been replaced with arm_image_id
which is a reference to an ARM Image resource. Please see the custom image
guide for more information. (#126)
Breaking Change: environment variables SHIPYARD_CREDENTIALS_JSON,
SHIPYARD_CONFIG_JSON, SHIPYARD_POOL_JSON, SHIPYARD_JOBS_JSON, and
SHIPYARD_FS_JSON have been renamed to SHIPYARD_CREDENTIALS_CONF,
SHIPYARD_CONFIG_CONF, SHIPYARD_POOL_CONF, SHIPYARD_JOBS_CONF, and
SHIPYARD_FS_CONF respectively. (#122)
--configdir or SHIPYARD_CONFIGDIR now defaults to the current working
directory (i.e., .) if no other conf file options are specified.
aad can be specified at a "global" level in the credentials configuration
file, which is then applied to batch, keyvault and/or management
section. Please see the credentials configuration guide for more information.
docker_image is now preferred over the deprecated image property in
the task array in the jobs configuration file
gpu and infiniband under the jobs configuration are now optional. GPU
and/or RDMA capable compute nodes will be autodetected and the proper
devices and other settings will be automatically be applied to tasks running
on these compute nodes. You can force disable GPU and/or RDMA by setting
gpu and infiniband properties to false. (#124)
Optional version support for platform_image. This property can be
used to set a host OS version to prevent possible issues that occur with
latest image versions.
--all-starting option for pool delnode which will delete all nodes
in starting state
Changed
Prevent invalid configuration of HPC offers with non-RDMA VM sizes
Expanded network tuning exemptions for new Dv3 and Ev3 sizes
Temporarily override Canonical UbuntuServer 16.04-LTS latest version to
a prior version due to recent linux-azure kernel issues
Fixed
NV driver updates
Various OS updates and Docker issues
CentOS 7.3 to 7.4 Nvidia driver breakage
Regression in pool ssh on Windows
Exception in unusable nodes with pool stats on allocation
Handle package manager db locks during conflicts for local package installs
Recurring job support (job schedules). Please see jobs configuration doc
for more information.
custom task factory. See the task factory guide for more information.
--all-jobschedules option for jobs term and jobs del
--jobscheduleid option for jobs disable, jobs enable and
jobs migrate
--all option for jobs listtasks
Changed
autogenerated_task_id_prefix configuration setting is now named
autogenerated_task_id and is a complex property. It has member properties
named prefix and zfill_width to control how autogenerated task ids are
named.
jobs list will now output job schedules in addition to jobs
--all parameter for jobs term and jobs del renamed to --all-jobs
list subcommands now output in a more human readable format
random and file task factories. See the task factory guide for more
information.
Summary statistics: pool stats and jobs stats. See the usage doc for
more information.
Delete unusable nodes from pool with --all-unusable option for
pool delnode
CentOS-HPC 7.3 support
CNTK-GPU-Infiniband-IntelMPI recipe
Changed
remove_container_after_exit now defaults to true
input_data:azure_storage files with an include filter that does not
include wildcards (i.e., targets a single file) will now be placed at
the destination directly as specified.
Nvidia Tesla driver updated to 384.59
TensorFlow recipes updated for 1.2.1. TensorFlow-Distributed launcher.sh
script is now generalized to take a script as the first parameter and
relocated to /shipyard/launcher.sh.
CNTK recipes updated for 2.1. run_cntk.sh script now takes in CNTK
Python scripts for execution.
Fixed
Task termination with force failing due to new task generators
Custom image support, please see the pool configuration doc and custom
image guide for more information. (#94)
contrib area with packer scripts
Changed
Breaking Change:publisher, offer, sku is now part of a complex
property named vm_configuration:platform_image. This change is to
accommodate custom images. The old configuration schema is now deprecated and
will be removed in a future release.
Updated NVIDIA Tesla driver to 375.66
Fixed
Improved pool resize/allocation logic to fail early with low priority core
quota reached with no dedicated nodes
pool listimages command which will list all common Docker images on
all nodes and provide warning for mismatched images amongst compute nodes.
This functionality requires a provisioned SSH user and private key.
max_wall_time option for both jobs and tasks. Please consult the
documentation for the difference when specifying this option at either the
job or task level.
--poll-until-tasks-complete option for jobs listtasks to block the CLI
from exiting until all tasks under jobs for which the command is run have
completed
--tty option for pool ssh and fs cluster ssh to enable allocation
of a pseudo-tty for the SSH session
Changed
remove_container_after_exit, retention_time, shm_size, infiniband,
gpu can now be specified at the job-level and overriden at the task-level
in the jobs configuration
data_volumes and shared_data_volumes can now be specified at the
job-level and any volumes specified at the task level will be merged with
the job-level volumes to be exposed for the container
Fixed
Add missing deprecation path for pool_specification_vm_count for
multi-instance tasks. Please upgrade your jobs configuration to explicitly
use either pool_specification_vm_count_dedicated or
pool_specification_vm_count_low_priority.
Speed up task collection additions by caching last task id
Issues with pool resize and wait logic with low priority
resize_timeout can now be specified on the pool specification
--clear-tables option to storage del command which will delete
blob containers and queues but clear table entries
--ssh option to pool udi command which will force the update Docker
images command to update over SSH instead of through a Batch job. This is
useful if you want to perform an out-of-band update of Docker image(s), e.g.,
your pool is currently busy processing tasks and would not be able to
accommodate another task.
Changed
Breaking Change:vm_count in the pool specification is now a
complex property consisting of the properties dedicated and low_priority
Updated all dependencies to latest
Fixed
Improve node startup time for GPU NC-series by removing extraneous
dependencies
fs cluster ssh storage cluster id and command argument ordering was
inverted. This has been corrected to be as intended where the command
is the last argument, e.g., fs cluster ssh mynfs -- df -h
misc tensorboard command added which automatically instantiates a
Tensorboard instance on the compute node which is running or has ran a
task that has generated TensorFlow summary operation compatible logs. An
SSH tunnel is then created so you can view Tensorboard locally on the
machine running Batch Shipyard. This requires a valid SSH user that has been
provisioned via Batch Shipyard with private keys available. This command
will work on Windows if ssh.exe is available in %PATH% or the current
working directory. Please see the usage guide for more information about
this command.
Pool-level resource_files support
Changed
Added optional COMMAND argument to pool ssh and fs cluster ssh
commands. If COMMAND is specified, the command is run non-interactively
with SSH on the target node.
Added some additional sanity checks in the node prep script
Updated TensorFlow-CPU and TensorFlow-GPU recipes to 1.1.0. Removed
specialized Docker build for TensorFlow-GPU. Added jobs-tb.json files
to TensorFlow-CPU and TensorFlow-GPU recipes as Tensorboard samples.
Optimize some Batch calls
Fixed
Site extension issues
SSH user add exception on Windows
jobs del --termtasks will now disable the job prior to running task
termination to prevent active tasks in job from running while tasks are
being terminated
jobs listtasks and data listfiles will now accept a --jobid that
does not have to be in jobs.json
Data ingress on pool create issue with single node
Richer SSH options with new ssh_public_key_data and ssh_private_key
properties in ssh configuration blocks (for both pool.json and
fs.json).
ssh_public_key_data allows direct embedding of SSH public keys in
OpenSSH format into the config files.
ssh_private_key specifies where the private key is located with
respect to pre-created public keys (either ssh_public_key or
ssh_public_key_data). This allows transparent pool ssh or
fs cluster ssh commands with pre-created keys.
RemoteFS-GlusterFS+BatchPool recipe
Changed
Docker installations are now pinned to a specific Docker version which
should reduce sudden breaking changes introduced upstream by Docker and/or
the distribution
Fault domains for multi-vm storage clusters are now set to 2 by default but
can be configured using the fault_domains property. This was lowered from
the prior default of 3 due to managed disks and availability set restrictions
as some regions do not support 3 fault domains with this combination.
Updated NC-series Tesla driver to 375.51
Fixed
Broken Docker installations due to gpgkey changes
Possible race condition between disk setup and glusterfs volume create
Forbid SSH username to be the same as the samba username
Allow smbd.service to auto-restart with delay
Data ingress to glusterfs on compute with no remotefs settings
Created Azure App Service Site Extension.
You can now one-click install Batch Shipyard as a site extension (after you
have Python installed) and use Batch Shipyard from an Azure Function trigger.
Samba support on storage cluster servers
Add sample RemoteFS recipes for NFS and GlusterFS
install.cmd installer for Windows. install_conda_windows.cmd has been
replaced by install.cmd, please see the install doc for more information.
Changed
Breaking Change:multi_instance_auto_complete under
job_specifications is now named auto_complete. This property will apply
to all types of jobs and not just multi-instance tasks. The default is now
false (instead of true for the old multi_instance_auto_complete).
Breaking Change:static_public_ip has been replaced with a public_ip
complex property. This is to accommodate for situations where public IP for
RemoteFS is disabled. Please see the Remote FS configuration doc for more
info.
install.sh now handles Anaconda Python environments
--cardinal 0 is now implicit if no --hostname or --nodeid is specified
for fs cluster ssh or pool ssh commands, respectively
Allow docker_images in global_resources to be empty. Note that it is
always recommended to pre-load images on to pools for consistent scheduling
latencies from pool idle.
Fixed
Removed requirement of a batch credential section for pure fs operations
Multi-instance auto complete setting not being properly read
install.sh virtual environment issues
Fix pool ingress data calls with remotefs (#62)
Move additional node prep commands to last set of commands to execute in
start task (#63)
glusterfs_on_compute shared data volume issues
future and pathlib compat issues
Python2 unicode/str issues with management libraries
Added virtual environment install option for install.sh which is now
the recommended way to install Batch Shipyard. Please see the install
guide for more information. (#55)
Changed
Force SSD optimizations for btrfs with premium storage
Fixed
Incorrect FS server options parsing at script time
KeyVault client not initialized in fs contexts (#57)
Check pool current node count prior to executing pool udi task (#58)
Initialization with KeyVault uri on commandline (#59)
Azure Active Directory authentication support for Batch accounts
Support for specifying a virtual network to use with a compute pool
allow_run_on_missing_image option to jobs that allows tasks to execute
under jobs with Docker images that have not been pre-loaded via the
global_resources:docker_images setting in config.json. Note that, if
possible, you should attempt to specify all Docker images that you intend
to run in the global_resources:docker_images property in the global
configuration to minimize scheduling to task execution latency.
Support for running containers as a different user identity (uid/gid)
Support for Canonical/UbuntuServer/16.04-LTS. 16.04-LTS should be used over
the old 16.04.0-LTS sku due to
issue #31 and is no
longer receiving updates.
Changed
Breaking Change:glusterfsvolume_driver for shared_data_volumes
should now be named as glusterfs_on_compute. This is to distinguish between
co-located GlusterFS on compute nodes with standalone GlusterFS
storage_cluster remote mounted distributed file system.
Logging now has less verbose details (call origin) by default. Prior
behavior can be restored with the -v option.
Pool existance is now checked prior to job submission and can now proceed
to add without an active pool.
Batch account (name) is now an optional property in the credentials config
Configuration doc broken up into multiple pages
Update all recipes using Canonical/UbuntuServer/16.04.0-LTS to use
Canonical/UbuntuServer/16.04-LTS instead
Configuration is no longer shown with -v. Use --show-config to dump
the complete configuration being used for the command.
Precompile Python files during build for Docker images
All dependencies updated to latest versions
Update Batch API call compatibility for azure-batch 2.0.0
Fixed
Logging time format and incorrect Zulu time designation.
scp and multinode_scp data movement capability is now supported in
Windows given ssh.exe and scp.exe can be found in %PATH% or the current
working directory. rsync methods are not supported on Windows.
Credential encryption is now supported in Windows given openssl.exe can
be found in %PATH% or the current working directory.
pool rebootnode command added which allows single node reboot control.
Additionally, the option --all-start-task-failed will reboot all nodes in
the specified pool with the start task failed state.
jobs del and jobs term now provide a --termtasks option to
allow the logic of jobs termtasks to precede the delete or terminate
action to the job. This option requires a valid SSH user to the remote nodes
as specified in the ssh configuration property in pool.json. This new
option is normally not needed if all tasks within the jobs have completed.
Changed
The Docker image used for blobxfer is now tied to the specific Batch
Shipyard release
Default SSH user expiry time if not specified is now 30 days
All recipes now have the default config.json storage account set to the
link as named in the provided credentials.json file. Now, only the credentials
file needs to be modified to run a recipe.
Support for max task retries (#23). See configuration doc for more
information.
Support for task data retention time (#30). See configuration doc for
more information.
Changed
Breaking Change:environment_variables_secret_id was erroneously
named and has been renamed to environment_variables_keyvault_secret_id to
follow the other properties with similar behavior.
Include Python 3.6 Travis CI target
Fixed
Automatically assigned task ids are now in the format dockertask-NNNNN
and will increment properly past 99999 but will not be padded after that (#27)
Defect in list tasks for tasks that have not run (#28)
Docker temporary directory not being set properly
SLES-HPC will now install all Intel MPI related rpms
Defect in task file mover for unencrypted credentials (#29)
Support for
Task Dependency Id Ranges
with the depends_on_range property under each task json property in tasks
in the jobs configuration file. Please see the configuration doc for more
information.
Support for environment_variables_secret_id in job and task definitions.
Specifying these properties will fetch manually added secrets (in the form of
a string representation of a json key-value dictionary) from the specified
KeyVault using AAD credentials. Please see the configuration doc for more
information.
Allow --configdir, --credentials, --config, --jobs, --pool config
options to be specified as environment variables. Please see the usage doc
for more information.
Added subcommand listskus to the pool command to list available
VM configurations (publisher, offer, sku) for the Batch account
Changed
Nodeprep now references cascade and tfm docker images by version instead
of latest to prevent breaking changes affecting older versions. Docker builds
of cascade and tfm based on latest commits are now disabled.
Fixed
Cascade docker image run not propagating exit code
Support for any Internet accessible container registry, including
Azure Container Registry.
Please see the configuration doc for information on how to integrate with
a private container registry.
Changed
GPU driver for STANDARD_NC instances defined in the
gpu:nvidia_driver:source property is no longer required. If omitted,
an NVIDIA driver will be downloaded automatically with an NVIDIA License
agreement prompt. For STANDARD_NV instances, a driver URL is still required.
Docker container name auto-tagging now prepends the job id in order to
prevent conflicts in case of un-named simultaneous tasks from multiple jobs
Update CNTK docker images to 2.0beta4 and optimize GPU images for use
with NVIDIA K80/M60
Update Caffe docker image, default to using OpenBLAS over ATLAS, and
optimize GPU images for use with NVIDIA K80/M60
Update MXNet GPU docker image optimized for use with NVIDIA K80/M60
Update TensorFlow docker images to 0.11.0 and optimize GPU images for use
with NVIDIA K80/M60
Fixed
Cascade thread exceptions will terminate with non-zero exit code
Some improvements with node prep and reboots
Task termination will only issue docker rm if the container exists
install_conda_windows.cmd helper script for installing Batch Shipyard
under Anaconda for Windows
Added relative_destination_path json property for files ingress into
node destinations. This allows arbitrary specification of where ingressed
files should be placed relative to the destination path.
Added ability to ingress directly into the host without the requirement
of GlusterFS for pools with one compute node. A GlusterFS shared volume is
required for pools with more than one compute node for direct to pool data
ingress.
New commands and options:
pool udi: Update docker images on all compute nodes in a pool. --image
and --digest options can restrict the scope of the update.
data stream: --disk will stream the file as binary to disk instead
of as text to the local console
data listfiles: --jobid and --taskid allows scoping of the list
files action
jobs listtasks: --jobid allows scoping of list tasks to a specific job
jobs add: --tail allows tailing the specified file for the last job
and task added
Keras+Theano-CPU and Keras+Theano-GPU recipes
Keras+Theano-CPU added as an option in the quickstart guide
Changed
Breaking Change: Properties of docker_registry have changed
significantly to support eventual integration with the Azure Container
Registry service. Credentials for docker logins have moved to the credentials
json file. Please see the configuration doc for more information.
files data ingress no longer creates a directory where files to
be uploaded exist. For example if uploading from a path /a/b/c, the
directory c is no longer created at the destination. Instead all files
found in /a/b/c will be immediately placed directly at the destination
path with sub-directories preserved. This behavior can be modified with
the relative_destination_path property.
CUDA_CACHE_* variables are now set for GPU jobs such that compiled targets
pass-through to the host. This allows subsequent container invocations within
the same node the ability to reuse cached PTX JIT targets.
batch_shipyard:storage_entity_prefix is now optional and defaults to
shipyard if not specified.
Major internal configuration/settings refactor
Fixed
Pool resize down with wait
More Python2/3 compatibility issues
Ensure pools that deploy GlusterFS volumes have more than 1 node
shipyard execution helper script created via install.sh
generated_sas_expiry_days json property to config json for the ability to
override the default number of days generated SAS keys are valid for.
New options on commands/subcommands:
jobs add: --recreate recreate any jobs which have completed and use
the same id
jobs termtasks: --force force docker kill to tasks even if they are
in completed state
pool resize: --wait wait for completion of resize
HPCG-Infiniband-IntelMPI and HPLinpack-Infiniband-IntelMPI recipes
Changed
Default SAS expiry time used for resource files and data movement changed
from 7 to 30 days.
Pools failing to start will now automatically retrieve stdout.txt and
stderr.txt to the current working directory under
poolid/<node ids>/std{out,err}.txt. These files can be inspected
locally and submitted as context for GitHub issues if pertinent.
Pool resizing will now attempt to add an SSH user on the new nodes if
an SSH public key is referenced or found in the invocation directory
Comprehensive data movement support. Please see the data movement guide
and configuration doc for more information.
Ingress from local machine with files in global configuration
To GlusterFS shared volume
To Azure Blob Storage
To Azure File Storage
Ingress from Azure Blob Storage, Azure File Storage, or another Azure
Batch Task with input_data in pool and jobs configuration
Pool-level: to compute nodes
Job-level: to compute nodes prior to running the specified job
Task-level: to compute nodes prior to running a task of a job
Egress to local machine as actions
Single file from compute node
Entire task-level directories from compute node
Entire node-level directories from compute node
Egress to Azure Blob of File Storage with output_data in jobs
configuration
Task-level: to Azure Blob or File Storage on successful completion of a
task
Credential encryption support. Please see the credential encryption guide
and configuration doc for more information.
Experimental support for OpenSSH with HPN patches on Ubuntu
Support pool resize up with GlusterFS
Support GlusterFS volume options
Configurable path to place files generated by pool add or pool asu
commands
MXNet-CPU and Torch-CPU as options in the quickstart guide
Update CNTK recipes for 1.7.2 and switch multinode/multigpu samples to
MNIST
MXNet-CPU and MXNet-GPU recipes
Changed
Breaking Change: All new CLI experience with proper multilevel commands.
Please see usage doc for more information.
Added new commands: cert, data
Added many new convenience subcommands
--filespec is now delimited by , instead of :
Breaking Change:ssh_docker_tunnel in the pool_specification has
been replaced by the ssh property. generate_tunnel_script has been renamed
to generate_docker_tunnel_script. Please see the configuration doc for
more information.
The name property of a task json object in the jobs specification is no
longer required for multi-instance tasks. If not specified, name defaults
to id for all task types.
data stream no longer has an arbitrary max streaming time; the action will
stream the file indefinitely until the task completes
Validate container with storage_entity_prefix for length issues
pool del action now cleans up and deletes some storage containers
immediately afterwards (with confirmation prompts)
/opt/intel is no longer automatically mounted for infiniband-enabled
containers on SUSE SLES-HPC hosts. Please see the configuration doc
on how to manually map this directory if required. OpenLogic CentOS-HPC
hosts remain unchanged.
Modularized code base
Fixed
GlusterFS mount ownership/permissions fixed such that SSH users can
read/write
Azure File shared volume setup when invoked from Windows
Python2 compatibility issues with file encoding
Allow shipyard.py to be invoked outside of the root of the GitHub cloned
base directory
New recipes added: Caffe-GPU, CNTK-CPU-OpenMPI, CNTK-GPU-OpenMPI,
FFmpeg-GPU, NAMD-Infiniband-IntelMPI, NAMD-TCP, TensorFlow-GPU
Changed
Multi-instance tasks now automatically complete their job by default. This
removes the need to run the cleanmijobs action in the shipyard tool.
Please refer to the
multi-instance documentation
for more information and limitations.
Dumb back-off policy for DHT router convergence
Optimzed Docker image storage location for Azure VMs
Prompts added for destructive operations in the shipyard tool
Fixed
Incorrect file location of node prep finished
Blocking wait for global resource on pool can now be disabled
Incorrect process call to query for docker image size when peer-to-peer
transfer is disabled
Use azure-storage 0.33.0 to fix Edm.Int64 overflow issue