Batch Shipyard Usage

This page contains in-depth details on how to use the Batch Shipyard tool. Please see the Container Image CLI section for information regarding how to use the Docker or Singularity image if not invoking the Python script or pre-built binary directly.

Batch Shipyard Invocation

If you installed Batch Shipyard using the install.sh script, then you can invoke as:

# Change directory to batch-shipyard installed directory
./shipyard

You can also invoke shipyard from any directory if given the full path to the script.

If you are on Windows and installed using the install.cmd script, then you can invoke as:

shipyard.cmd

If you installed manually (i.e., took the non-recommended installation path and did not use the installer scripts), then you will need to invoke the Python interpreter and pass the script as an argument. For example:

python3 shipyard.py

The -h or --help option will list the available options, which are explained below.

Note about interoperability with Azure Tooling and Azure Batch APIs

Nearly all REST calls or commands that are issued against the normal Azure Batch APIs and tooling such as the Azure Portal or Azure CLI will work fine against Azure Batch Shipyard created resources. However, there are some notable exceptions:

  1. All pools must be created with Batch Shipyard if you intend to use any Batch Shipyard functionality.
  2. Please note all of the current limitations for other actions.
  3. Batch Shipyard pools that are deleted outside of Batch Shipyard will not have their associated metadata (in Azure Storage) cleaned up. Please use the pool del command instead. You can use the storage command to clean up orphaned data if you accidentially deleted Batch Shipyard pools outside of Batch Shipyard.

Commands and Sub-commands

shipyard (and shipyard.py) is invoked with a commands and sub-commands as positional arguments, i.e.:

shipyard <command> <subcommand> <options>

For instance:

shipyard pool add --configdir config
# or equivalent in Linux for this particular command
SHIPYARD_CONFIGDIR=config shipyard pool add

Would create a pool on the Batch account as specified in the config files found in the config directory. Please note that <options> must be specified after the command and subcommand.

You can issue the -h or --help option at every level to view all available options for that level and additional help text. For example:

shipyard -h
shipyard pool -h
shipyard pool add -h

Shared Options

There are a set of shared options which are used between most sub-commands. These options must be specified after the command and sub-command. These are:

  -y, --yes                       Assume yes for all confirmation prompts
  --raw                           Output data as returned by the service for
                                  supported operations as raw json
  --show-config                   Show configuration
  -v, --verbose                   Verbose output
  --configdir TEXT                Configuration directory where all
                                  configuration files can be found. Each
                                  config file must be named exactly the same
                                  as the regular switch option, e.g.,
                                  pool.yaml for --pool. Individually specified
                                  config options take precedence over this
                                  option. This defaults to "." if no other
                                  configuration option is specified.
  --credentials TEXT              Credentials config file
  --config TEXT                   Global config file
  --fs TEXT                       RemoteFS config file
  --pool TEXT                     Pool config file
  --jobs TEXT                     Jobs config file
  --monitor TEXT                  Resource monitoring config file
  --subscription-id TEXT          Azure Subscription ID
  --keyvault-uri TEXT             Azure KeyVault URI
  --keyvault-credentials-secret-id TEXT
                                  Azure KeyVault credentials secret id
  --aad-endpoint TEXT             Azure Active Directory endpoint
  --aad-directory-id TEXT         Azure Active Directory directory (tenant) id
  --aad-application-id TEXT       Azure Active Directory application (client)
                                  id
  --aad-auth-key TEXT             Azure Active Directory authentication key
  --aad-authority-url TEXT        Azure Active Directory authority URL
  --aad-user TEXT                 Azure Active Directory user
  --aad-password TEXT             Azure Active Directory password
  --aad-cert-private-key TEXT     Azure Active Directory private key for X.509
                                  certificate
  --aad-cert-thumbprint TEXT      Azure Active Directory certificate SHA1
                                  thumbprint
  • -y or --yes is to assume yes for all confirmation prompts
  • --raw will output JSON to stdout for the command result. Only a subset of commands support this option. Note many of the supported commands are returning raw JSON body results from the Batch API server, thus the output may change/break if the underlying service version changes. It is important to pin the Batch Shipyard release to a specific version if using this feature and perform upgrade testing/validation for your scenario and workflow between releases. The following commands support this option:
    • account info
    • account quota
    • cert list
    • fed create
    • fed destroy
    • fed list
    • fed jobs add
    • fed jobs del
    • fed jobs list
    • fed jobs term
    • fed jobs zap
    • fed pool add
    • fed pool remove
    • fed proxy status
    • jobs list
    • jobs tasks count
    • jobs tasks list
    • monitor status
    • pool autoscale evaluate
    • pool autoscale lastexec
    • pool images list
    • pool images update
    • pool list
    • pool listskus
    • pool nodes count
    • pool nodes grls
    • pool nodes list
    • pool nodes ps
    • pool nodes prune
    • pool nodes zap
  • --show-config will output the merged configuration prior to execution
  • -v or --verbose is for verbose output
  • --configdir path can be used instead of the individual config switches below if all configuration files are in one directory and named after their switch. For example, if you have a directory named config and under that directory you have the files credentials.yaml, config.yaml, pool.yaml and jobs.yaml, then you can use this argument instead of the following individual conf options. If this parameter is not specified or any of the individual conf options, then this paramter defaults to the current working directory (i.e., .).
    • --credentials path/to/credentials.yaml is required for all actions except for a select few keyvault commands.
    • --config path/to/config.yaml is required for all actions.
    • --pool path/to/pool.yaml is required for most actions.
    • --jobs path/to/jobs.yaml is required for job-related actions.
    • --fs path/to/fs.yaml is required for fs-related actions and some pool actions.
    • --monitor path/to/monitor.yaml is required for resource monitoring actions.
  • --subscription-id is the Azure Subscription Id associated with the Batch account or Remote file system resources. This is only required for creating pools with a virtual network specification or with fs commands.
  • --keyvault-uri is required for all keyvault commands.
  • --keyvault-credentials-secret-id is required if utilizing a credentials config stored in Azure KeyVault
  • --aad-endpoint is the Active Directory endpoint for the resource. Note that this can cause conflicts for actions that require multiple endpoints for different resources. It is better to specify endpoints explicitly in the credential file.
  • --aad-directory-id is the Active Directory Directory Id (or Tenant Id)
  • --aad-application-id is the Active Directory Application Id (or Client Id)
  • --aad-auth-key is the authentication key for the application (or client)
  • --aad-authority-url is the Azure Active Directory Authority URL
  • --aad-user is the Azure Active Directory user
  • --aad-password is the Azure Active Directory password for the user
  • --aad-cert-private-key is the Azure Active Directory Service Principal RSA private key corresponding to the X.509 certificate for certificate-based auth
  • --aad-cert-thumbprint is the X.509 certificate thumbprint for Azure Active Directory certificate-based auth

Note that only one of Active Directory Service Principal or User/Password can be specified at once, i.e., --aad-auth-key, --aad-password, and --aad-cert-private-key are mutually exclusive.

Note that the following options can be specified as environment variables instead:

  • SHIPYARD_CONFIGDIR in lieu of --configdir
  • SHIPYARD_CREDENTIALS_CONF in lieu of --credentials
  • SHIPYARD_CONFIG_CONF in lieu of --config
  • SHIPYARD_POOL_CONF in lieu of --pool
  • SHIPYARD_JOBS_CONF in lieu of --jobs
  • SHIPYARD_FS_CONF in lieu of --fs
  • SHIPYARD_MONITOR_CONF in lieu of --monitor
  • SHIPYARD_SUBSCRIPTION_ID in lieu of --subscription-id
  • SHIPYARD_KEYVAULT_URI in lieu of --keyvault-uri
  • SHIPYARD_KEYVAULT_CREDENTIALS_SECRET_ID in lieu of --keyvault-credentials-secret-id
  • SHIPYARD_AAD_ENDPOINT in lieu of --aad-endpoint
  • SHIPYARD_AAD_DIRECTORY_ID in lieu of --aad-directory-id
  • SHIPYARD_AAD_APPLICATION_ID in lieu of --aad-application-id
  • SHIPYARD_AAD_AUTH_KEY in lieu of --aad-auth-key
  • SHIPYARD_AAD_AUTHORITY_URL in lieu of --aad-authority-url
  • SHIPYARD_AAD_USER in lieu of --aad-user
  • SHIPYARD_AAD_PASSWORD in lieu of --aad-password
  • SHIPYARD_AAD_CERT_PRIVATE_KEY in lieu of --aad-cert-private-key
  • SHIPYARD_AAD_CERT_THUMBPRINT in lieu of --aad-cert-thumbprint

Commands

shipyard has the following top-level commands:

  account   Batch account actions
  cert      Certificate actions
  data      Data actions
  diag      Diagnostics actions
  fed       Federation actions
  fs        Filesystem in Azure actions
  jobs      Jobs actions
  keyvault  KeyVault actions
  misc      Miscellaneous actions
  monitor   Monitoring actions
  pool      Pool actions
  slurm     Slurm on Batch actions
  storage   Storage actions
  • account commands deal with Batch accounts
  • cert commands deal with certificates to be used with Azure Batch
  • data commands deal with data ingress and egress from Azure
  • diag commands deal with diganostics for Azure Batch
  • fed commandsd del with Batch Shipyard Federations
  • fs commands deal with Batch Shipyard provisioned remote filesystems in Azure
  • jobs commands deal with Azure Batch jobs and tasks
  • keyvault commands deal with Azure KeyVault secrets for use with Batch Shipyard
  • misc commands are miscellaneous commands that don't fall into other categories
  • pool commands deal with Azure Batch pools
  • slurm commands deal with Slurm on Batch
  • storage commands deal with Batch Shipyard metadata on Azure Storage

account Command

The account command has the following sub-commands:

  info   Retrieve Batch account information and quotas
  list   Retrieve a list of Batch accounts and...
  quota  Retrieve Batch account quota at the...
  • info provides information about the specified batch account provided in credentials
    • --name is the name of the Batch account to query instead of the one specified in credentials
    • --resource-group is the name of the resource group to use associated with the Batch account instead of the one specified in credentials
  • list provides information about all (or a subset) of accounts within the subscription in credentials
    • --resource-group is the name of the resource group to scope the query to
  • quota provides service level quota information for the subscription for a given location. Requires a valid location argument, e.g., westus.

cert Command

The cert command has the following sub-commands:

  add     Add a certificate to a Batch account
  create  Create a certificate to use with a Batch...
  del     Deletes certificate from a Batch account
  list    List all certificates in a Batch account
  • add will add a certificate to the Batch account
    • --file is the certificate file to add. The operation to transform the cert so it is acceptable for the Batch Service is determined by the file extension. Only .cer, .pem and .pfx files are supported. If this option is omitted, the encryption:pfx specified in the global configuration is used.
    • --pem-no-certs will convert and add the PEM file as a CER in the Batch service without any certificates.
    • --pem-public-key will convert and add the PEM file as a CER in the Batch service with only the public key.
    • --pfx-password is the PFX password to use
  • create will create a certificate locally for use with the Batch account.
    • --file-prefix is the PEM and PFX file name prefix to use. If this option is omitted, the global configuration encryption:pfx section options are used.
    • --pfx-password is the PFX passphrase to set. If this option is omitted, the global configuration encryption:pfx section options are used. If neither are specified, the passphrase is prompted.
  • del will delete certificates from the Batch account
    • --sha1 specifies the thumbprint to delete. If this option is omitted, then the certificate referenced in the global configuration setting encryption:pfx will be deleted.
  • list will list certificates in the Batch account

Note that in order to use certificates created by cert create for credential encryption, you must edit your config.yaml to incorporate the generated certificate and then invoke the cert add command. Please see the credential encryption guide for more information.

data Command

The data command has the following sub-commands:

  files    Compute node file actions
  ingress  Ingress data into Azure

The data files sub-command has the following sub-sub-commands:

  list    List files for tasks in jobs
  node    Retrieve file(s) from a compute node
  stream  Stream a file as text to the local console or...
  task    Retrieve file(s) from a job/task
  • files list will list files for all tasks in jobs
    • --jobid force scope to just this job id
    • --taskid force scope to just this task id
  • files node will retrieve a file with node id and filename semantics
    • --all --filespec <nodeid>,<include pattern> can be given to download all files from the compute node with the optional include pattern
    • --filespec <nodeid>,<filename> can be given to download one specific file from compute node
  • files stream will stream a file as text (UTF-8 decoded) to the local console or binary if streamed to disk
    • --disk will write the streamed data as binary to disk instead of output to local console
    • --filespec <jobid>,<taskid>,<filename> can be given to stream a specific file. If <taskid> is set to @FIRSTRUNNING, then the first running task within the job of <jobid> will be used to locate the <filename>.
  • files task will retrieve a file with job, task, filename semantics
    • --all --filespec <jobid>,<taskid>,<include pattern> can be given to download all files for the job and task with an optional include pattern
    • --filespec <jobid>,<taskid>,<filename> can be given to download one specific file from the job and task. If <taskid> is set to @FIRSTRUNNING, then the first running task within the job of <jobid> will be used to locate the <filename>.
  • ingress will ingress data as specified in configuration files
    • --to-fs <STORAGE_CLUSTER_ID> transfers data as specified in configuration files to the specified remote file system storage cluster instead of Azure Storage

diag Command

The diag command has the following sub-commands:

  logs  Diagnostic log actions

The diag logs sub-command has the following sub-sub-commands:

  upload  Upload Batch Service Logs from compute node
  • logs upload will upload the Batch compute node service logs to a specified Azure storage container.
    • --cardinal is the zero-based cardinal number of the compute node in the pool to upload from
    • --nodeid is the node id to upload from
    • --wait will wait until the operation completes

fed Command

The fed command has the following sub-commands:

  create   Create a federation
  destroy  Destroy a federation
  jobs     Federation jobs actions
  list     List all federations
  pool     Federation pool actions
  proxy    Federation proxy actions

The fed jobs sub-command has the following sub-sub-commands:

  add   Add jobs to a federation
  del   Delete a job or job schedule in a federation
  list  List jobs or job schedules in a federation
  term  Terminate a job or job schedule in a...
  zap   Zap a queued unique id from a federation

The fed pool sub-command has the following sub-sub-commands:

  add     Add a pool to a federation
  remove  Remove a pool from a federation

The fed proxy sub-command has the following sub-sub-commands:

  create   Create a federation proxy
  destroy  Destroy a federation proxy
  ssh      Interactively login via SSH to federation...
  start    Starts a previously suspended federation...
  status   Query status of a federation proxy
  suspend  Suspend a federation proxy
  • create will create a federation
    • FEDERATION_ID is the federation id name
    • --force force creates the federation even if a federation with a same id exists.
    • --no-unique-job-ids creates a federation without unique job id enforcement.
  • destroy will destroy a previously created federation
    • FEDERATION_ID is the federation id name
  • jobs add submits jobs/task groups or job schedules to a federation
    • FEDERATION_ID is the federation id name
  • jobs del submits an action to delete jobs or job schedules from a federation
    • FEDERATION_ID is the federation id name
    • --all-jobs deletes all jobs in the federation
    • --all-jobschedules deletes all job schedules in the federation
    • --job-id deletes a specific job id. This can be specified multiple times.
    • --job-schedule-id deletes a specific job schedule id. This can be specified multiple times.
  • jobs list lists jobs or locates a job or job schedule
    • FEDERATION_ID is the federation id name
    • --blocked will list blocked actions
    • --job-id locates a specific job id
    • --job-schedule-id deletes a specific job schedule id
    • --queued will list queued actions
  • jobs term submits an action to terminate jobs or job schedules from a federation
    • FEDERATION_ID is the federation id name
    • --all-jobs deletes all jobs in the federation
    • --all-jobschedules deletes all job schedules in the federation
    • --force forces submission of a termination action for a job even if it doesn't exist
    • --job-id deletes a specific job id. This can be specified multiple times.
    • --job-schedule-id deletes a specific job schedule id. This can be specified multiple times.
  • jobs zap removes a unique id action from a federation
    • FEDERATION_ID is the federation id name
    • --unique-id is the unique id associated with the action to zap
  • list will list federations
    • --federation-id will limit the list to the specified federation id
  • pool add will add a pool to a federation
    • FEDERATION_ID is the federation id name
    • --batch-service-url is the batch service url of the pool id to add instead of read from the credentials configuration
    • --pool-id is the pool id to add instead of the pool id read from the pool configuration
  • pool remove
    • FEDERATION_ID is the federation id name
    • --all remove all pools from the federation
    • --batch-service-url is the batch service url of the pool id to remove instead of read from the credentials configuration
    • --pool-id is the pool id to remove instead of the pool id read from the pool configuration
  • proxy create will create the federation proxy
  • proxy destroy will destroy the federation proxy
    • --delete-resource-group will delete the entire resource group that contains the federation proxy. Please take care when using this option as any resource in the resoure group is deleted which may be other resources that are not Batch Shipyard related.
    • --delete-virtual-network will delete the virtual network and all of its subnets
    • --generate-from-prefix will attempt to generate all resource names using conventions used. This is helpful when there was an issue with monitoring creation/deletion and the original virtual machine resources cannot be enumerated. Note that OS disks cannot be deleted with this option. Please use an alternate means (i.e., the Azure Portal) to delete disks.
    • --no-wait does not wait for deletion completion. It is not recommended to use this parameter.
  • proxy ssh will interactive log into the federation proxy via SSH
    • COMMAND is an optional argument to specify the command to run. If your command has switches, preface COMMAND with double dash as per POSIX convention, e.g., pool ssh -- sudo docker ps -a.
    • --tty allocates a pseudo-terminal
  • proxy start will start a previously suspended federation proxy
    • --no-wait does not wait for the restart to complete. It is not recommended to use this parameter.
  • proxy status will query status of a federation proxy
  • proxy suspend suspends a federation proxy
    • --no-wait does not wait for the suspension to complete. It is not recommended to use this parameter.

fs Command

The fs command has the following sub-commands which work on two different parts of a remote filesystem:

  cluster  Filesystem storage cluster in Azure actions
  disks    Managed disk actions

fs cluster Command

fs cluster command has the following sub-commands:

  add          Create a filesystem storage cluster in Azure
  del          Delete a filesystem storage cluster in Azure
  expand       Expand a filesystem storage cluster in Azure
  orchestrate  Orchestrate a filesystem storage cluster in Azure with the...
  resize       Resize a filesystem storage cluster in Azure.
  ssh          Interactively login via SSH to a filesystem storage cluster...
  start        Starts a previously suspended filesystem storage cluster in...
  status       Query status of a filesystem storage cluster in Azure
  suspend      Suspend a filesystem storage cluster in Azure

As the fs.yaml configuration file can contain multiple storage cluster definitions, all fs cluster commands require the argument STORAGE_CLUSTER_ID after any option below is specified targeting the storage cluster to perform actions against.

  • add will create a remote fs cluster as defined in the fs config file
  • del will delete a remote fs cluster as defined in the fs config file
    • --delete-resource-group will delete the entire resource group that contains the server. Please take care when using this option as any resource in the resoure group is deleted which may be other resources that are not Batch Shipyard related.
    • --delete-data-disks will delete attached data disks
    • --delete-virtual-network will delete the virtual network and all of its subnets
    • --generate-from-prefix will attempt to generate all resource names using conventions used. This is helpful when there was an issue with cluster creation/deletion and the original virtual machine(s) resources cannot be enumerated. Note that OS disks and data disks cannot be deleted with this option. Please use fs disks del to delete disks that may have been used in the storage cluster.
    • --no-wait does not wait for deletion completion. It is not recommended to use this parameter.
  • expand expands the number of disks used by the underlying filesystems on the file server.
    • --no-rebalance rebalances the data and metadata among the disks for better data spread and performance after the disk is added to the array.
  • orchestrate will create the remote disks and the remote fs cluster as defined in the fs config file
  • resize resizes the storage cluster with additional virtual machines as specified in the configuration. This is an experimental feature.
  • ssh will interactively log into a virtual machine in the storage cluster. If neither --cardinal or --hostname are specified, --cardinal 0 is assumed.
    • COMMAND is an optional argument to specify the command to run. If your command has switches, preface COMMAND with double dash as per POSIX convention, e.g., fs cluster ssh mycluster -- df -h.
    • --cardinal is the zero-based cardinal number of the virtual machine in the storage cluster to connect to.
    • --hostname is the hostname of the virtual machine in the storage cluster to connect to
    • --tty allocates a pseudo-terminal
  • start will start a previously suspended storage cluster
    • --no-wait does not wait for the restart to complete. It is not recommended to use this parameter.
  • status displays the status of the storage cluster
    • --detail reports in-depth details about each virtual machine in the storage cluster
    • --hosts will output the public IP to hosts mapping for mounting a glusterfs based remote filesystem locally. glusterfs must be allowed in the network security rules for this to work properly.
  • suspend suspends a storage cluster
    • --no-wait does not wait for the suspension to complete. It is not recommended to use this parameter.

fs disks Command

fs disks command has the following sub-commands:

  add   Create managed disks in Azure
  del   Delete managed disks in Azure
  list  List managed disks in resource group
  • add creates managed disks as specified in the fs config file
  • del deletes managed disks as specified in the fs config file
    • --all deletes all managed disks found in a specified resource group
    • --delete-resource-group deletes the specified resource group
    • --name deletes a specific named disk in a resource group
    • --no-wait does not wait for disk deletion to complete. It is not recommended to use this parameter.
    • --resource-group deletes one or more managed disks in this resource group
  • list lists managed disks found in a resource group
    • --resource-group lists disks in this resource group only
    • --restrict-scope lists disks only if found in the fs config file

jobs Command

The jobs command has the following sub-commands:

  add      Add jobs
  cmi      Cleanup non-native multi-instance jobs
  del      Delete jobs and job schedules
  disable  Disable jobs and job schedules
  enable   Enable jobs and job schedules
  list     List jobs
  migrate  Migrate jobs or job schedules to another pool
  stats    Get statistics about jobs
  tasks    Tasks actions
  term     Terminate jobs and job schedules

The jobs tasks sub-command has the following sub-sub-commands:

  count  Get task counts for a job
  del    Delete specified tasks in jobs
  list   List tasks within jobs
  term   Terminate specified tasks in jobs
  • add will add all jobs and tasks defined in the jobs configuration file to the Batch pool
    • --recreate will recreate any completed jobs with the same id
    • --tail will tail the specified file of the last job and task added with this command invocation
  • cmi will cleanup any stale non-native multi-instance tasks and jobs. Note that this sub-command is typically not required if auto_complete is set to true in the job specification for the job.
    • --delete will delete any stale cleanup jobs
  • del will delete jobs and job scheudles specified in the jobs configuration file. If an autopool is specified for all jobs and a jobid option is not specified, the storage associated with the autopool will be cleaned up.
    • --all-jobs will delete all jobs found in the Batch account
    • --all-jobschedules will delete all job schedules found in the Batch account
    • --jobid force deletion scope to just this job id
    • --jobscheduleid force deletion scope to just this job schedule id
    • --termtasks will manually terminate tasks prior to deletion. Termination of running tasks requires a valid SSH user if the tasks are running on a non-native container support pool.
    • --wait will wait for deletion to complete
  • disable will disable jobs or job schedules
    • --jobid force disable scope to just this job id
    • --jobscheduleid force disable scope to just this job schedule id
    • --requeue requeue running tasks
    • --terminate terminate running tasks
    • --wait wait for running tasks to complete
  • enable will enable jobs or job schedules
    • --jobid force enable scope to just this job id
    • --jobscheduleid force enable scope to just this job schedule id
  • list will list all jobs in the Batch account
  • migrate will migrate jobs or job schedules to another pool. Ensure that the new target pool has the Docker images required to run the job.
    • --jobid force migration scope to just this job id
    • --jobscheduleid force migration scope to just this job schedule id
    • --poolid force migration to this specified pool id
    • --requeue requeue running tasks
    • --terminate terminate running tasks
    • --wait wait for running tasks to complete
  • stats will generate a statistics summary of a job or jobs
    • --jobid will query the specified job instead of all jobs
  • tasks count will count the task states within a job
    • --jobid will query the specified job instead of all jobs
  • tasks del will delete tasks within jobs specified in the jobs configuration file. Active or running tasks will be terminated first on non-native container support pools.
    • --jobid force deletion scope to just this job id
    • --taskid force deletion scope to just this task id
    • --wait will wait for deletion to complete
  • tasks list will list tasks from jobs specified in the jobs configuration file
    • --all list all tasks in all jobs in the account
    • --jobid force scope to just this job id
    • --poll-until-tasks-complete will poll until all tasks have completed
  • tasks term will terminate tasks within jobs specified in the jobs configuration file. Termination of running tasks requires a valid SSH user if tasks are running on a non-native container support pool.
    • --force force send docker kill signal regardless of task state
    • --jobid force termination scope to just this job id
    • --taskid force termination scope to just this task id
    • --wait will wait for termination to complete
  • term will terminate jobs and job schedules found in the jobs configuration file. If an autopool is specified for all jobs and a jobid option is not specified, the storage associated with the autopool will be cleaned up.
    • --all-jobs will terminate all jobs found in the Batch account
    • --all-jobschedules will terminate all job schedules found in the Batch account
    • --jobid force termination scope to just this job id
    • --jobscheduleid force termination scope to just this job schedule id
    • --termtasks will manually terminate tasks prior to termination. Termination of running tasks requires a valid SSH user if tasks are running on a non-native container support pool.
    • --wait will wait for termination to complete

keyvault Command

The keyvault command has the following sub-commands:

  add   Add a credentials config file as a secret to...
  del   Delete a secret from Azure KeyVault
  list  List secret ids and metadata in an Azure...

The following subcommands require --keyvault-* and --aad-* options in order to work. Alternatively, you can specify these in the credentials.yaml file, but these options are mutually exclusive of other properties. Please refer to the Azure KeyVault and Batch Shipyard guide for more information.

  • add will add the specified credentials config file as a secret to an Azure KeyVault. A valid credentials config file must be specified as an option.
    • NAME argument is required which is the name of the secret associated with the credentials config to store in the KeyVault
  • del will delete a secret from the Azure KeyVault
    • NAME argument is required which is the name of the secret to delete from the KeyVault
  • list will list all secret ids and metadata in an Azure KeyVault

misc Command

The misc command has the following sub-commands:

  mirror-images  Mirror Batch Shipyard system images to the...
  tensorboard    Create a tunnel to a Tensorboard instance for...
  • mirror-images will mirror Batch Shipyard Docker images to the designated fallback_registry specified in the global configuration for the version of Batch Shipyard that is executed in the command invocation.
  • tensorboard will create a tunnel to the compute node that is running or has run the specified task
    • --jobid specifies the job id to use. If this is not specified, the first and only jobspec is used from jobs.yaml.
    • --taskid specifies the task id to use. If this is not specified, the last run or running task for the job is used.
    • --logdir specifies the TensorFlow logs directory generated by summary operations
    • --image specifies an alternate TensorFlow image to use for Tensorboard. The tensorboard.py file must be in the expected location in the Docker image as stock TensorFlow images. If not specified, Batch Shipyard will attempt to find a suitable TensorFlow image from Docker images in the global resource list or will acquire one on demand for this command.

monitor Command

The monitor command has the following sub-commands:

  add      Add a resource to monitor
  create   Create a monitoring resource
  destroy  Destroy a monitoring resource
  list     List all monitored resources
  remove   Remove a resource from monitoring
  ssh      Interactively login via SSH to monitoring...
  start    Starts a previously suspended monitoring...
  status   Query status of a monitoring resource
  suspend  Suspend a monitoring resource
  • add will add a resource to monitor to an existing monitoring VM
    • --poolid will add the specified Batch pool to monitor
    • --remote-fs will add the specified RemoteFS cluster to monitor
  • create will create a monitoring resource VM
  • destroy will destroy a monitoring resource VM
    • --delete-resource-group will delete the entire resource group that contains the monitoring resource. Please take care when using this option as any resource in the resoure group is deleted which may be other resources that are not Batch Shipyard related.
    • --delete-virtual-network will delete the virtual network and all of its subnets
    • --generate-from-prefix will attempt to generate all resource names using conventions used. This is helpful when there was an issue with monitoring creation/deletion and the original virtual machine resources cannot be enumerated. Note that OS disks cannot be deleted with this option. Please use an alternate means (i.e., the Azure Portal) to delete disks that may have been used by the monitoring VM.
    • --no-wait does not wait for deletion completion. It is not recommended to use this parameter.
  • list will list all monitored resources
  • remove will remove a resource to monitor to an existing monitoring VM
    • --all will remove all resources that are currently monitored
    • --poolid will remove the specified Batch pool to monitor
    • --remote-fs will remove the specified RemoteFS cluster to monitor
  • ssh will interactively log into the monitoring resource via SSH.
    • COMMAND is an optional argument to specify the command to run. If your command has switches, preface COMMAND with double dash as per POSIX convention, e.g., pool ssh -- sudo docker ps -a.
    • --tty allocates a pseudo-terminal
  • start will start a previously suspended monitoring VM
    • --no-wait does not wait for the restart to complete. It is not recommended to use this parameter.
  • status will query status of a monitoring VM
  • suspend suspends a monitoring VM
    • --no-wait does not wait for the suspension to complete. It is not recommended to use this parameter.

pool Command

The pool command has the following sub-commands:

  add        Add a pool to the Batch account
  autoscale  Autoscale actions
  del        Delete a pool from the Batch account
  exists     Check if a pool exists
  images     Container images actions
  list       List all pools in the Batch account
  listskus   List available VM configurations available to the Batch account
  nodes      Compute node actions
  rdp        Interactively login via RDP to a node in a pool
  resize     Resize a pool
  ssh        Interactively login via SSH to a node in a pool
  stats      Get statistics about a pool
  user       Remote user actions

The pool autoscale sub-command has the following sub-sub-commands:

  disable   Disable autoscale on a pool
  enable    Enable autoscale on a pool
  evaluate  Evaluate autoscale formula
  lastexec  Get the result of the last execution of the...

The pool images sub-command has the following sub-sub-commands:

  list    List container images in a pool
  update  Update container images in a pool

The pool nodes sub-command has the following sub-sub-commands:

  count   Get node counts in pool
  del     Delete a node or nodes from a pool
  grls    Get remote login settings for all nodes in...
  list    List nodes in pool
  prune   Prune container/image data on nodes in pool
  ps      List running containers on nodes in pool
  reboot  Reboot a node or nodes in a pool
  zap     Zap all container processes on nodes in pool

The pool user sub-command has the following sub-sub-commands:

  add  Add a remote user to all nodes in pool
  del  Delete a remote user from all nodes in pool
  • add will add the pool defined in the pool configuration file to the Batch account
    • --recreate will delete and recreate the pool if there already exists a pool with the same id. Note that you should only use this option if you are certain that it will not cause side-effects.
  • autoscale disable will disable autoscale on the pool
  • autoscale enable will enable autoscale on the pool
  • autoscale evaluate will evaluate the autoscale formula in the pool configuration file
  • autoscale lastexec will query the last execution information for autoscale
  • del will delete the pool defined in the pool configuration file from the Batch account along with associated metadata in Azure Storage used by Batch Shipyard. It is recommended to use this command instead of deleting a pool directly from the Azure Portal, Batch Labs, or other tools as this action can conveniently remove all associated Batch Shipyard metadata on Azure Storage.
    • --poolid will delete the specified pool instead of the pool from the pool configuration file
    • --wait will wait for deletion to complete
  • exists will check if a pool exists under the Batch account
    • --pool-id will query the specified pool instead of the pool from the pool configuration file
  • images list will query the nodes in the pool for Docker images. Common and mismatched images will be listed. Requires a provisioned SSH user and private key.
  • images update will update container images on all compute nodes of the pool. This command may require a valid SSH user.
    • --docker-image will restrict the update to just the Docker image or image:tag
    • --docker-image-digest will restrict the update to just the Docker image or image:tag and a specific digest
    • --singularity-image will restrict the update to just the Singularity image or image:tag
    • --ssh will force the update to occur over an SSH side channel rather than a Batch job.
  • list will list all pools in the Batch account
  • nodes count will count the node states within a pool
    • --poolid will query the specified pool instead of the pool from the pool configuration file
  • nodes del will delete the specified node from the pool
    • --all-start-task-failed will delete all nodes in the start task failed state
    • --all-starting will delete all nodes in the starting state
    • --all-unusable will delete all nodes in the unusable state
    • --nodeid is the node id to delete
  • nodes grls will retrieve all of the remote login settings for every node in the specified pool
    • --no-generate-tunnel-script will disable generating an SSH tunnel script even if enabled in the pool configuration
  • nodes list will list all nodes in the specified pool
    • --start-task-failed will list nodes in start task failed state
    • --unusable will list nodes in unusable state
  • nodes prune will prune unused Docker data. This command requires a provisioned SSH user.
    • --volumes will also include volumes
  • nodes ps will list all Docker containers and their status. This command requires a provisioned SSH user.
  • nodes reboot will reboot a specified node in the pool
    • --all-start-task-failed will reboot all nodes in the start task failed state
    • --nodeid is the node id to reboot
  • nodes zap will send a kill signal to all running Docker containers. This command requires a provisioned SSH user.
    • --no-remove will not remove exited containers
    • --stop will execute docker stop instead
  • rdp will interactively log into a compute node via RDP. If neither --cardinal or --nodeid are specified, --cardinal 0 is assumed. This command requires Batch Shipyard executing on Windows with target Windows containers pools.
    • --cardinal is the zero-based cardinal number of the compute node in the pool to connect to as listed by grls
    • --no-auto will prevent automatic login via temporary credential saving if an RDP password is supplied via the pool configuration file
    • --nodeid is the node id to connect to in the pool
  • resize will resize the pool to the vm_count specified in the pool configuration file
    • --wait will wait for resize to complete
  • ssh will interactively log into a compute node via SSH. If neither --cardinal or --nodeid are specified, --cardinal 0 is assumed.
    • COMMAND is an optional argument to specify the command to run. If your command has switches, preface COMMAND with double dash as per POSIX convention, e.g., pool ssh -- sudo docker ps -a.
    • --cardinal is the zero-based cardinal number of the compute node in the pool to connect to as listed by grls
    • --nodeid is the node id to connect to in the pool
    • --tty allocates a pseudo-terminal
  • stats will generate a statistics summary of the pool
    • --poolid will query the specified pool instead of the pool from the pool configuration file
  • user add will add an SSH or RDP user defined in the pool configuration file to all nodes in the specified pool
  • user del will delete the SSH or RDP user defined in the pool configuration file from all nodes in the specified pool

slurm Command

The slurm command has the following sub-commands:

  cluster  Slurm cluster actions
  ssh      Slurm SSH actions

The slurm cluster sub-command has the following sub-sub-commands:

  create       Create a Slurm cluster with controllers and login nodes
  destroy      Destroy a Slurm controller
  orchestrate  Orchestrate a Slurm cluster with shared file system and
               Batch...
  start        Starts a previously suspended Slurm cluster
  status       Query status of a Slurm controllers and login nodes
  suspend      Suspend a Slurm cluster contoller and/or login nodes

The slurm ssh sub-command has the following sub-sub-commands:

  controller  Interactively login via SSH to a Slurm controller virtual...
  login       Interactively login via SSH to a Slurm login/gateway virtual...
  node        Interactively login via SSH to a Slurm compute node virtual...
  • cluster create will create the Slurm controller and login portions of the cluster
  • cluster destroy will destroy the Slurm controller and login portions of the cluster
    • --delete-resource-group will delete the entire resource group that contains the Slurm resources. Please take care when using this option as any resource in the resoure group is deleted which may be other resources that are not Batch Shipyard related.
    • --delete-virtual-network will delete the virtual network and all of its subnets
    • --generate-from-prefix will attempt to generate all resource names using conventions used. This is helpful when there was an issue with creation/deletion and the original virtual machine resources cannot be enumerated. Note that OS disks cannot be deleted with this option. Please use an alternate means (i.e., the Azure Portal) to delete disks that may have been used by the Slurm resource VMs.
    • --no-wait does not wait for deletion completion. It is not recommended to use this parameter.
  • cluster orchestrate will orchestrate the entire Slurm cluster with a single Batch pool
    • --storage-cluster-id will orchestrate the specified RemoteFS shared file system
  • cluster start will start a previously suspended Slurm cluster
    • --no-controller-nodes does not start controller nodes
    • --no-login-nodes does not start login nodes
    • --no-wait does not wait for the restart to complete. It is not recommended to use this parameter.
  • cluster status queries the status of the Slurm controller and login nodes
  • cluster suspend suspends the Slurm cluster
    • --no-controller-nodes does not suspend controller nodes
    • --no-login-nodes does not suspend login nodes
    • --no-wait does not wait for the suspension to complete. It is not recommended to use this parameter.
  • ssh controller will SSH into the Slurm controller nodes if permitted with the controller SSH user
    • COMMAND is an optional argument to specify the command to run. If your command has switches, preface COMMAND with double dash as per POSIX convention, e.g., pool ssh -- sudo docker ps -a.
    • --offset is the cardinal offset of the controller node
    • --tty allocates a pseudo-terminal
  • ssh login will SSH into the Slurm login nodes with the cluster user identity
    • COMMAND is an optional argument to specify the command to run. If your command has switches, preface COMMAND with double dash as per POSIX convention, e.g., pool ssh -- sudo docker ps -a.
    • --offset is the cardinal offset of the login node
    • --tty allocates a pseudo-terminal
  • ssh node will SSH into a Batch compute node with the cluster user identity
    • COMMAND is an optional argument to specify the command to run. If your command has switches, preface COMMAND with double dash as per POSIX convention, e.g., pool ssh -- sudo docker ps -a.
    • --node-name is the required Slurm node name
    • --tty allocates a pseudo-terminal

storage Command

The storage command has the following sub-commands:

  clear  Clear Azure Storage containers used by Batch...
  del    Delete Azure Storage containers used by Batch...
  sas    SAS token actions

The storage sas sub-command has the following sub-sub-commands:

  create  Create a container- or object-level SAS key
  • clear will clear the Azure Storage containers used by Batch Shipyard for metadata purposes
    • --poolid will target a specific pool id rather than from configuration
  • del will delete the Azure Storage containers used by Batch Shipyard for metadata purposes
    • --clear-tables will clear tables instead of deleting them
    • --poolid will target a specific pool id
  • sas create will create a SAS key for containers, file shares, individual blobs or file objects.
    • STORAGE_ACCOUNT is the storage account link to target. This link must be specified as a credential.
    • PATH is the Azure storage path including the container or file share name
    • --create adds a create permission (only applicable to objects)
    • --delete adds a delete permission
    • --list adds a list permission (only applicable to container/file share)
    • --file creates a file SAS rather than a blob SAS
    • --read adds a read permission
    • --write adds a write permission

Example Invocations

shipyard pool add --credentials credentials.yaml --config config.yaml --pool pool.yaml

# ... or if all config files are in the current working directory named as above ...
# (note this is strictly not necessary as Batch Shipyard will search the
# current working directory if the options above are not explicitly specified

shipyard pool add --configdir .

# ... or use environment variables instead

SHIPYARD_CONFIGDIR=. shipyard pool add

The above invocation will add the pool specified to the Batch account. Notice that the options and shared options are given after the command and sub-command and not before.

shipyard jobs add --configdir .

# ... or use environment variables instead

SHIPYARD_CONFIGDIR=. shipyard jobs add

The above invocation will add the jobs specified in the jobs.yaml file to the designated pool.

shipyard data files stream --configdir . --filespec job1,task-00000,stdout.txt

# ... or use environment variables instead

SHIPYARD_CONFIGDIR=. shipyard data files stream --filespec job1,task-00000,stdout.txt

The above invocation will stream the stdout.txt file from the job job1 and task task1 from a live compute node. Because all portions of the --filespec option are specified, the tool will not prompt for any input.

Batch Shipyard Container Image CLI Invocation

If using either the Docker image alfpark/batch-shipyard:latest-cli or the Singularity image shub://alfpark/batch-shipyard-singularity:cli, then you would invoke Batch Shipyard as:

# if using Docker
docker run --rm -it alfpark/batch-shipyard:latest-cli \
    <command> <subcommand> <options...>

# if using Singularity
singularity run shub://alfpark/batch-shipyard-singularity:cli \
    <command> <subcommand> <options...>

where <command> <subcommand> is the command and subcommand as described above and <options...> are any additional options to pass to the <subcommand>.

Invariably, you will need to pass config files to the tool which reside on the host and not in the container by default. Please use the -v volume mount option with docker run or -B bind option with singularity run to mount host directories inside the container. For example, if your Batch Shipyard configs are stored in the host path /home/user/batch-shipyard-configs you could modify the invocations as:

# if using Docker
docker run --rm -it \
    -v /home/user/batch-shipyard-configs:/configs \
    -w /configs \
    alfpark/batch-shipyard:latest-cli \
    <command> <subcommand> <options...>

# if using Singularity
singularity run \
    -B /home/user/batch-shipyard-configs:/configs \
    --pwd /configs \
    shub://alfpark/batch-shipyard-singularity:cli \
    <command> <subcommand> <options...>

Notice that we specified the working directory as -w for Docker or --pwd for Singularity to match the /configs container path.

Additionally, if you wish to ingress data from locally accessible file systems using Batch Shipyard, then you will need to map additional volume mounts as appropriate from the host to the container.

Batch Shipyard may generate files with some actions, such as adding a SSH user or creating a pool with an SSH user. In this case, you will need to create a volume mount with the -v (or -B) option and also ensure that the pool specification ssh object has a generated_file_export_path property set to the volume mount path. This will ensure that generated files will be written to the host and persisted after the docker container exits. Otherwise, the generated files will only reside within the docker container and will not be available for use on the host (e.g., SSH into compute node with generated RSA private key or use the generated SSH docker tunnel script).

Remote Filesystem Support

For more information regarding remote filesystems and Batch Shipyard, please see this page.

Data Movement

For more information regarding data movement with respect to Batch Shipyard, please see this page.

Multi-Instance Tasks

For more information regarding Multi-Instance Tasks and/or MPI jobs using Batch Shipyard, please see this page.

Current Limitations

Please see this page for current limitations.

Explore Recipes and Samples

Visit the recipes directory for different sample Docker workloads using Azure Batch and Batch Shipyard.

Need Help?

Open an issue on the GitHub project page.