Batch Shipyard Task Factory and Merge Tasks
The focus of this article is to describe the task factory and merge task concepts. Task factories can be utilized to generate arbitrary task arrays, and are particularly useful in creating parameter (parametric) sweeps, replicated/repeated tasks, or assigning generated parameters for tasks. Merge tasks can be used to automatically create dependent final tasks. A merge task is useful in cases where a final task is required after a set of tasks are run, and should only be run after those tasks complete.
Quick Navigation
Task Factory
The normal configuration structure for a job in Batch Shipyard is through the
definition of a tasks
array which contain individual task specifications.
Sometimes it is necessary to create a set of tasks where the base task
specification is the same (e.g., the run options, input, etc.) but the
arguments and options for the command
must vary between tasks. This can
become tedious and error-prone to perform by hand or requires auxillary
code to generate the jobs configuration.
A task factory is simply a task generator for a job. With this functionality,
you can direct Batch Shipyard to generate a set of tasks given a
task_factory
property. If applicable, parameters specified in the
task_factory
are then applied to the command
resulting in a transformed
task.
Note that you can attach only one task_factory
specification to one
task specification within the tasks
array. However, you can have multiple
task specifications in the tasks
array thus allowing for multiple and
potentially different types of task factories per job.
Now we'll dive into each type of task factory available in Batch Shipyard.
Task Factory Quick Navigation
- Parametric Sweep: Product
- Parametric Sweep: Product Iterables
- Parametric Sweep: Combinations
- Parametric Sweep: Permutations
- Parametric Sweep: Zip
- Random
- Repeat
- File
- Custom
Parametric (Parameter) Sweep
A parametric_sweep
will generate parameters to apply to the command
according to the type of sweep.
Product
A product
parametric_sweep
can perform nested or unnested parameter
generation. For example, if you need to generate a range of integers from
0 to 9 with a step size of 1 (thus 10 integers total), you would specify this
as:
task_factory: parametric_sweep: product: - start: 0 step: 1 stop: 10 command: /bin/bash -c "sleep {0}"
As shown above, the associated command
requires either {}
or {0}
Python-style string formatting to specify where to substitute the generated
argument value within the command
string.
This task_factory
example specified above would create 10 tasks:
Task 0: /bin/bash -c "sleep 0" Task 1: /bin/bash -c "sleep 1" Task 2: /bin/bash -c "sleep 2" ... Task 9: /bin/bash -c "sleep 9"
As mentioned above, product
can generate nested parameter sets. To do this
one would create two or more start
, stop
, step
objects in the
product
array. For example:
task_factory: parametric_sweep: product: - start: 0 step: 1 stop: 3 - start: 100 step: -1 stop: 97 command: /bin/bash -c "sleep {0}; sleep {1}"
would generate 9 tasks (i.e., 3 * 3
sets of parameters):
Task 0: /bin/bash -c "sleep 0; sleep 100" Task 1: /bin/bash -c "sleep 0; sleep 99" Task 2: /bin/bash -c "sleep 0; sleep 98" Task 3: /bin/bash -c "sleep 1; sleep 100" Task 4: /bin/bash -c "sleep 1; sleep 99" Task 5: /bin/bash -c "sleep 1; sleep 98" Task 6: /bin/bash -c "sleep 2; sleep 100" Task 7: /bin/bash -c "sleep 2; sleep 99" Task 8: /bin/bash -c "sleep 2; sleep 98"
You can nest an arbitrary number of parameter sets within the product
array.
Product Iterables
A product_iterables
parametric_sweep
can perform nested or unnested
parameter generation similar to product
but can operate on arbitrary
list or string iterables. For example, if you need to generate a sweep of
input of the strings abc
, def
, ghi
with an inner nest of 1
, 2
, 3
,
you would specify this as:
task_factory: parametric_sweep: product_iterables: - - abc - def - ghi - - '1' - '2' - '3' command: /bin/bash -c "echo {0}; sleep {1}"
As shown above, the associated command
requires either {}
or {0}
Python-style string formatting to specify where to substitute the generated
argument value within the command
string.
This task_factory
example specified above would create 9
(i.e., 3 * 3 sets of parameters) tasks:
Task 0: /bin/bash -c "echo abc; sleep 1" Task 1: /bin/bash -c "echo abc; sleep 2" Task 2: /bin/bash -c "echo abc; sleep 3" Task 3: /bin/bash -c "echo def; sleep 1" Task 4: /bin/bash -c "echo def; sleep 2" Task 5: /bin/bash -c "echo def; sleep 3" Task 6: /bin/bash -c "echo ghi; sleep 1" Task 7: /bin/bash -c "echo ghi; sleep 2" Task 8: /bin/bash -c "echo ghi; sleep 3"
You can nest an arbitrary number of parameter sets within the
product_iterables
array, even mixing strings and lists of strings.
Combinations
The combinations
parametric_sweep
generates length
subsequences of
parameters from the iterable
. Combinations are emitted in lexicographic
sort order. Combinations with replacement can be specified by setting the
replacement
option to true
. For example:
task_factory: parametric_sweep: combinations: iterable: - abc - '012' - def length: 2 replacement: false command: /bin/bash -c "echo {0}; echo {1}"
would generate 3 tasks:
Task 0: /bin/bash -c "echo abc; echo 012" Task 1: /bin/bash -c "echo abc; echo def" Task 2: /bin/bash -c "echo 012; echo def"
Permutations
The permutations
parametric_sweep
generates length
permutations of
parameters from the iterable
. Permutations are emitted in lexicographic
sort order. For example:
task_factory: parametric_sweep: permutations: iterable: - abc - '012' - def length: 2 command: /bin/bash -c "echo {0}; echo {1}"
would generate 6 tasks:
Task 0: /bin/bash -c "echo abc; echo 012" Task 1: /bin/bash -c "echo abc; echo def" Task 2: /bin/bash -c "echo 012; echo abc" Task 3: /bin/bash -c "echo 012; echo def" Task 4: /bin/bash -c "echo def; echo abc" Task 5: /bin/bash -c "echo def; echo 012"
Zip
The zip
parametric_sweep
generates parameters where the i-th parameter
contains the i-th element from each iterable. For example:
task_factory: parametric_sweep: zip: - abc - '012' - def command: /bin/bash -c "echo {0}; echo {1}; echo {2}"
would generate 3 tasks:
Task 0: /bin/bash -c "echo a; echo 0; echo d" Task 1: /bin/bash -c "echo b; echo 1; echo e" Task 2: /bin/bash -c "echo c; echo 2; echo f"
zip
supports mixing strings and list of strings for each iterable.
Random
A random
task factory will generate random values for the command
up to
N times as specified by the generate
property. The random
task factory
can generate both integral and floating point (real) values.
For example:
task_factory: random: generate: 3 integer: start: 0 step: 1 stop: 10 command: /bin/bash -c "sleep {}"
will generate 3 tasks with random integral sleep times ranging from 0 to 9.
To generate floating point values, you can use the distribution
functionality as required by your scenario. For example:
task_factory: random: distribution: uniform: a: 0.0 b: 1.0 generate: 3 command: /bin/bash -c "sleep {}"
will generate 3 tasks with random floating point values pulled from a uniform distribution between 0.0 and 1.0.
The following distributions are available:
uniform
triangular
beta
exponential
gamma
gauss
lognormal
pareto
weibull
For more information, please see the distribution property explanations.
Repeat
A repeat
task factory simply replicates the command
N number of times.
For example:
task_factory: repeat: 3 command: /bin/bash -c "sleep 1"
would create three tasks with identical commands of /bin/bash -c "sleep 1"
.
File
A file
task factory will generate tasks by enumerating a target storage
container or file share for entities and then applying any specified keyword
arguments to the command
.
For example, let's assume that we want to generate a task for every blob
found in the container mycontainer
in the storage account link named
mystorageaccount
. The task factory for this could be:
task_factory: file: azure_storage: storage_account_settings: mystorageaccount container: mycontainer task_filepath: file_path command: /bin/bash -c "echo url={url} full_path={file_path_with_container} file_path={file_path} file_name={file_name} file_name_no_extension={file_name_no_extension}"
As you can see from the command
above, there are keyword formatters
available:
url
is the full URL of the blob resource including the SAS. This is not available for files on file shares.file_path_with_container
is the path of the blob or file (with all virtual directories) prepended with the container or file share namefile_path
is the path of the blob or file (with all virtual directories)file_name
is the blob or file name without the virtual directoriesfile_name_no_extension
is just the blob or file name without the virtual directories and file extension
Let's assume that mycontainer
contains the following blobs:
test0.bin test1.bin archived\old0.bin archived\old1.bin
This would generate 4 tasks:
Task 0: /bin/bash -c "echo url=<full blob url with sas> full_path=mycontainer/test0.bin file_path=test0.bin file_name=test0.bin file_name_no_extension=test0" Task 1: /bin/bash -c "echo url=<full blob url with sas> full_path=mycontainer/test1.bin file_path=test1.bin file_name=test1.bin file_name_no_extension=test1" Task 2: /bin/bash -c "echo url=<full blob url with sas> full_path=mycontainer/archived/old0.bin file_path=archived/old0.bin file_name=old0.bin file_name_no_extension=old0" Task 3: /bin/bash -c "echo url=<full blob url with sas> full_path=mycontainer/archived/old1.bin file_path=archived/old1.bin file_name=old1.bin file_name_no_extension=old1"
Each task would automatically download each blob "assigned" to it
automatically in the task's working directory as specified by the
task_filepath
property. This property has the following valid values,
similar to the keyword arguments above:
file_path_with_container
is the path of the blob or file (with all virtual directories) prepended with the container or file share namefile_path
is the path of the blob or file (with all virtual directories)file_name
is the blob or file name without the virtual directoriesfile_name_no_extension
is just the blob or file name without the virtual directories and file extension
For the example above, the files would be downloaded to the compute node
as follows, given the task_filepath
being set to file_path
:
Task 0: wd/test0.bin Task 1: wd/test1.bin Task 2: wd/archived/old0.bin Task 3: wd/archived/old1.bin
Please note that a point in time listing of the blob container or file share
is performed when the jobs add
is called. Any modification of the
container or file share during jobs add
will result in non-deterministic
behavior or even potentially unstable execution of the submission process.
Custom
A custom
task factory will generate tasks by calling a custom Python-based
generator function named generate
supplied by the user. This is accomplished
by importing a user-defined Python module which has a defined generate
generator function.
For example, suppose we create a directory named foo
in our Batch Shipyard
installation directory and have our custom generator as follows:
batch-shipyard |-- foo |-- __init__.py +-- generator.py
Inside the foo
directory we have a bare __init__.py
file and a file named
generator.py
which will contain our logic to generate parameters. Note that
the custom task factory does not have to reside within your Batch Shipyard
installation directory, but must be resolvable by importlib.import_module()
.
Inside generator.py
resides our logic to generate parameters for the
task factory. The one required function to implement must be named
generate
. For example, let's suppose we want to generate a range of
parameters for each argument given:
# in file generator.py def generate(*args, **kwargs): for arg in args: for x in range(0, arg): yield (x,)
The generate
function acceps two variadic parameters: *args
and **kwargs
which correspond to the configuration input_args
and input_kwargs
,
respectively. If using input_kwargs
, then the dictionary specified
must have only string-based keys. You can use any combination of input_args
and input_kwargs
or no input if not required. In this example, for each
positional argument (i.e., *args
), we are creating a range from 0
to that
argument value and yield
ing the result as a iterable (tuple). Yielding
the result as an iterable is mandatory as the return value is unpacked and
applied to the command
. This allows for multiple parameters to be generated
and applied for each generated task. An example corresponding configuration
may be similar to the following:
task_factory: custom: input_args: - '1' - '2' - '3' module: foo.generator command: /bin/bash -c "sleep {}"
which would result in 6 tasks:
Task 0: /bin/bash -c "sleep 0" Task 1: /bin/bash -c "sleep 0" Task 2: /bin/bash -c "sleep 1" Task 3: /bin/bash -c "sleep 0" Task 4: /bin/bash -c "sleep 1" Task 5: /bin/bash -c "sleep 2"
Of course, this example is contrived and custom task factory logic will invariably be more complex. Your generator function can be dependent upon any Python package that is needed to accomodate complex task factory parameter generation scenarios. Please note that if you have installed your Batch Shipyard environment into a virtual environment and your dependencies are non-local (i.e., not in the Batch Shipyard directory), then you need to ensure that your dependencies are properly installed in the correct environment.
Merge Task
A merge task can be thought of as a final task that is run after all tasks
specified within the tasks
array complete successfully. The tasks for which
a merge task is dependent upon can be manually generated or a task factory.
For instance:
tasks: - task_factory: repeat: 3 command: /bin/bash -c "sleep 1" merge_task: command: /bin/bash -c "echo merge"
would create four total tasks. Batch Shipyard will automatically create
the dependencies between the merge_task
which echoes "merge" with the
three "sleep 1" tasks. Thus, the merge_task
will run after all three
sleep tasks complete successfully.
Configuration guide
Please see the jobs configuration guide for more information on configuration for jobs and tasks.