Troubleshooting Guide
This guide is to help in the event you encounter issues while using Batch Shipyard. You can also visit the FAQ for other questions that do not fall in to the categories below.
Table of Contents
- Installation Issues
- Azure Batch Service Issues
- Compute Node Issues
- Job/Task Execution Issues
- Container Issues
Installation Issues
Anaconda Python Environments
Anaconda Python environments are structured differently than standard CPython environments. Anaconda has isolated environments which are conceptually equivalent to Virtual Environments and also have separate packaging mechanisms than packages traditionally found on PyPI. As such, special attention should be given when installing Batch Shipyard into an Anaconda environment. It is highly recommended to use an Anaconda environment that provides Python 3.5 or higher if using Anaconda. In general, it is recommended to use CPython especially when installing into virtual environments (the recommended installation method) as the command lag can be much shorter.
However, if you do plan to use Anaconda, then it is recommended to use the
install.sh
script to install on an Anaconda environment on Linux and
the install.cmd
command file to install on an Anaconda environment on
Windows.
Azure Batch Service Issues
Check Azure Batch Service Status
If you suspect possible Azure Batch service issues, you can check the status of Azure services at this website.
Compute Pool and Node Issues
Resize Error is encounted with pool
Resize errors with a pool can happen any time a pool is growing or shrinking.
Remember that when you issue pool add
, a pool starts with zero compute nodes
and then grows to the target number of nodes specified. You can query the
pools in your account with the pool list
command and any resize errors
will be displayed. You can also query this information using Azure Portal
or Batch Labs. If it appears that the resize error was transient,
you can try to issue pool resize
to begin the pool grow or shrink process
again, or alternatively you can opt to recreate the pool.
There are typically three common reasons for Resize Errors:
- Insufficient core quota: Non-UserSubscription Batch accounts by default have 20 cores associated with them. These core quota are managed independently of any core quota on the associated subscription. UserSubscription Batch Accounts have core quota that is associated with the subscription. Please follow this guide for submitting a support request to increase your core quota.
- Operation(s) took longer than expected: Resizing the pool to a different target VM count may take longer than the specified timeout. In these cases, re-issue the resize command.
- Not enough IPs in the virtual network subnet: When creating a pool with a UserSubscription Batch account with a virtual network, you must ensure that there are sufficient number of available IPs in your subnet. Batch Shipyard will attempt to validate this on your behalf if you specify the subnet's address range in the configuration. You can attempt to change the address range of the subnet indpendently (if pre-created) and issue the resize command again if you encounter this issue.
Compute Node appears to be stuck in starting
state
If you are using pools with native
container support, compute nodes that
appear to be "stuck" in starting
state may not really be stuck. During this
phase, all Docker images specified in the global_resources
are preloaded
on to compute nodes. Thus, it may take a while for your compute nodes to
transition to idle
from this state.
If you are not using pools with native
container support, then there may
be an issue allocating the node from the Azure Cloud. Azure Batch
automatically tries to recover from this state, but may not be able to
on occasion. In these circumstances, you can delete the affected nodes
with pool nodes del --all-starting
and then pool resize
to scale the
pool back to your desired amount.
Compute Node appears to be stuck in waiting_for_start_task
state
Compute nodes that appear to be "stuck" in waiting for start task may not
really be stuck. If you are not using pools with native
container support
and are specifying that nodes should block for all Docker images to
be present on the node before allowing scheduling and your Docker images are
large, it may take a while for your compute nodes to transition from waiting
for start task to idle. Additionally, if your Docker images are sourced from
Docker Hub then Docker Hub may apply throttling or outright reject requests
from your pool as compute nodes are attempting to retrieve your images. It
is recommended to isolate from potential Docker Hub issues by provisioning
your own Azure Container Registry within the region of your Batch account
to reduce latency and improve bandwidth. Premium Azure Container Registries
may be an appropriate option for very large pools.
If you are certain the above is not the cause for this behavior, then it may indicate a regression in the Batch Shipyard code, a new Docker release that is causing interaction issues (e.g., with nvidia-docker) or some other problem that was not caught during testing. You can retrieve the compute node start task stdout and stderr files to diagnose further and report an issue on GitHub if it appears to be a defect.
Compute Node enters start_task_failed
state
For pools thare are allocated without native
container support, Batch
Shipyard installs the Docker Host Engine and other requisite software
when the compute node starts. Even with pools with native
container support,
some additional software is installed along with integrity checks of the
compute node. There is a possibility for the start task to fail due to
transient network faults when issuing system software updates or other
issues. You can turn on automatic rebooting where Batch Shipyard can
attempt to mitigate the issue on your behalf in the pool.yaml
config file.
Alternatively, you can issue the command
pool nodes reboot --all-start-task-failed
which will attempt to reboot the
nodes that have entered this state.
If the compute node fails to start properly, Batch Shipyard will automatically
download the compute node's stdout.txt
, stderr.txt
and wd/cascade*.log
files for the start task into the directory where you ran shipyard
. The files
will be placed in <pool name>/<node id>/startup/
. You can examine these
files to see what the possible culprit for the issue is. If it appears to be
transient, you can try to create the pool again. If it appears to be a Batch
Shipyard issue, please report the issue on GitHub.
Additionally, if you have specified an SSH or RDP user for your pool and there
is a start task failure, you can still issue the command pool user add
to
add the pool remote user and then pool ssh
to SSH into the node to debug
further, or manually RDP on Windows.
Please note that the start task requires downloading some files that are
uploaded to your Azure Storage account with the command pool add
. These
files have SAS tokens which allow the Batch compute node to authenticate
with the Azure Storage service to download the files. These SAS tokens are
bound to the storage account key for which they were generated with. If you
change/regenerate your storage account key that these SAS tokens were
originally generated with, then the compute nodes will fail to start as
these files as the SAS tokens bound to these files will no longer be valid.
You will need to recreate your pool in these situations.
Compute Node enters unusable
state
If compute nodes enter unusable
state then this indicates that there was
an issue allocating the node from the Azure Cloud or that the Azure Batch
service can no longer communicate with the compute node. Azure Batch
automatically tries to recover from such situations, but may not be able to
on occasion. In these circumstances, you can delete the affected nodes
with pool nodes del --all-unusable
and then resize back up with pool resize
or recreate the pool.
Another potential problem for nodes that may enter into this state are
pools which are part of a virtual network. Improper NSG rules can prevent
communication between the compute nodes and the Batch service which will
result in unusable
nodes.
Pool creation fails due to Could not find an Azure Batch Node Agent Sku
If you are using a platform_image
, you may encounter an error such as:
RuntimeError: Could not find an Azure Batch Node Agent Sku for this offer=abc publisher=def sku=xyz. You can list the valid and available Marketplace images with the command: account images
This problem can happen if you are specifying a sku
that is not listed
by the account images
command. You will need to update your sku
field
to one that is listed.
Job/Task Execution Issues
Task is submitted but doesn't run
There are various reasons why this would happen:
- There are insufficient compute nodes to service the job
- The task is multi-instance and there are not enough compute nodes to run the job as specified
jobs.yaml
file was submitted with the wrongpool.yaml
file causing a mismatch in the target pool for the jobsjobs.yaml
file was submitted with the wrongconfig.yaml
file causing an infinite wait on a Docker image to be present that may not exist
Task runs and completes but fails with a non-zero exit code or scheduling error
In Azure Batch, a task that completes is independent of if the task is
considered a success or failure. The jobs tasks list
command will list
the status of all tasks for the jobs specified including exit codes and
any scheduling errors. You can use this information in combination with the
task's stdout and stderr files to determine what went wrong if your task
has completed but has failed.
Container Issues
pool images update
command doesn't run
The pool images update
command runs as a normal job if your pool is
comprised entirely of dedicated compute nodes. Thus, your compute
nodes must be able to accommodate this update job and task. If your pool only
has one node in it, it will run as a single task under a job. If the node in
this pool is busy and the max_tasks_per_node
in your pool.yaml
is either
unspecified or set to 1, then it will be blocked behind the running task.
For pools with more than 1 node, then the update images command will run
as a multi-instance task to guarantee that all nodes in the pool have updated
the specified container image to latest or the given hash. The multi-instance
task will be run on the current number of nodes in the pool at the time
the pool images update
command is issued. If before the task can be
scheduled, the pool is resized down and the number of nodes decreases, then
the update container images job will not be able to execute and will stay
active until the number of compute nodes reaches the prior number.
Additionally, if max_tasks_per_node
is set to 1 or unspecified in
pool.yaml
and any task is running on any node, the update container images
job will be blocked until that task completes.
You can work around this behavior by providing the --ssh
option to the
pool images update
command. This will use an SSH side-channel to upgrade the
container images on the pool. Please note that this requires a provisioned SSH
user and ssh
or ssh.exe
available.
pool images update
will always use the SSH side-channel method for pools
containing a positive number of low priority nodes or pools which are
native
mode enabled.