Scattering Inputs¶
Part 1: Getting started¶
If you need to run a tool or workflow on an array, or multiple arrays of inputs, scatter is the way to accomplish this. We will be using the following tool as example of what we are looping over:
This runs the linux command wc(word count) on an input file, with the option to use the -l(lines) flag. Let’s assume we want to run this tool on an array of files. (You can use the wc command on a list of files, but let’s ignore this for the example.)
The way to run a tool on an array of inputs is to do it at the workflow level:
Let’s go through the relevant parts.
This is necessary for using the scatter functionality:
requirements:
- class: ScatterFeatureRequirement
We want to run the tool on a list of input files. This is indicated by placing square brackets after the type:
inputs:
lines: boolean?tep
file_array: File[]
We will get back an array of files. Note that the scatter step will always result in an array output of whatever type the you are scattering produces. For example if the tool produces a File, the scattered version will produce and array of files. If the tool produces an array, the scattered version produces an array of arrays. This is true if the output of the step is the final workflow output, as in the above example, or it’s being fed into another step.
outputs:
output_array:
type: File[]
outputSource:
- wc/output
Finally we need to specify where an what we are scattering
In this example we want to run the wc.cwl tool over multiple files. The tool only takes in one file, so we have to make the workflow run the tool multiple times. The tool has the file input named ‘file’, whereas the workflow has the array input named ‘file_array’. If we gave the tool the array input here, normally this would cause an error since a file array is not the same as a file:
in:
lines: lines
file: file_array
However by adding the scatter definition, we are telling the workflow to iterate over the array of files, running the tool once per each item in the array:
scatter: file
Note that the item we scatter is the name of the tool input name, NOT the workflow input name.
Part 2: dotproduct¶
This is a continuation from part 1. We will also be using the wc.cwl tool from that example.
In part 1 we covered how to do a sample scatter on an array of files. We’ll now extend that any number of arrays. When you want to scatter over multiple arrays, you will need to tell CWL how to handle that. For this example we will use the scatter method called “dotproduct”.
You can use the dotproduct as long as the arrays are the same length. The length of the arrays will determine how time your tool is run, and thus the length of the output array. For example if you have two arrays of three items each, and both are scattered, the tool would be run three times, the first instance would take the first item from each array as parameters, the second instance would use the second item from each array, and so on. Lets see an example:
#!/usr/bin/env cwl-runner
#
# Authors: Andrew Lamb
cwlVersion: v1.0
class: Workflow
requirements:
- class: ScatterFeatureRequirement
inputs:
line_array: boolean[]
file_array: File[]
outputs:
output_array:
type: File[]
outputSource:
- wc/output
steps:
wc:
run: wc.cwl
in:
lines: line_array
file: file_array
scatter:
- lines
- file
scatterMethod: dotproduct
out:
- output
This is very similar to the first example, let’s look at what’s changed.
We are still iterating over an array of input files, but here we want to also control whether or not we use the lines flag or not, so we are now providing an array of booleans:
inputs:
line_array: boolean[]
file_array: File[]
We now need to scatter two array inputs:
scatter:
- lines
- file
Finally since we are scattering more than one array we need to provide the method:
scatterMethod: dotproduct
Part 3: flat_crossproduct¶
This is a continuation from part 1 and 2. We will also be using the wc.cwl tool from part1
In part 1 we covered how to do a sample scatter on an array of files. In part 2 we extended that any number of arrays using the dotproduct. We will now look at scattering over multiple arrays using the flat crossproduct. Where the dotproduct required that your arrays be the same length, the flat crossproduct can scatter over arrays of different length. In addition, where the dotproduct result output is equal to that length of the arrays, the flat crossproduct result output is equal to: len(array1) * len(array2) * …len(array_n).
Another way of describing this is that the cwltool is run on every combination of inputs from each array. For example if you have an array of 3 files, and array of 2 flags, you will have 6 outputs. Each file will be run, once per each flag. The example workflow is exactly the same as the one in part2 except:
scatterMethod: flat_crossproduct
And the input yaml:
line_array:
- true
- false
file_array:
- class: File
path: test_file1
- class: File
path: test_file2
- class: File
path: test_file3
And finally the output of “cwltool wc_workflow3.cwl wc_workflow.yaml” :
{
"output_array": [
{
"path": "/home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 70,
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"class": "File",
"checksum": "sha1$a912a8cf6107efe1bff86c42b7899e0a090d383c"
},
{
"path": "/home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 70,
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"class": "File",
"checksum": "sha1$ad06722d0c3641f8baf46242fcea51b77ee558e9"
},
{
"path": "/home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 70,
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"class": "File",
"checksum": "sha1$35470ddb936f3d1d3a5b907ff73c61d8df35d968"
},
{
"path": "/home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 74,
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"class": "File",
"checksum": "sha1$16fb2f95337e0b7c2b0e5076dc09b6509a762482"
},
{
"path": "/home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 77,
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"class": "File",
"checksum": "sha1$c5c3a3c1ff8ef9d4573f8238cb67c355225775d7"
},
{
"path": "/home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 77,
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"class": "File",
"checksum": "sha1$b0fb51fac542b2b9f64d1408acabcfb61b8a4055"
}
]
}
Part 4: nested_crossproduct¶
This is very similar to flat_crossproduct. The difference is that instead of one long flat array, you will receive a nested array as output:
#!/usr/bin/env cwl-runner
#
# Authors: Andrew Lamb
cwlVersion: v1.0
class: Workflow
requirements:
- class: ScatterFeatureRequirement
inputs:
line_array: boolean[]
file_array: File[]
outputs:
output_array:
type:
type: array
items:
type: array
items: File
outputSource:
- wc/output
steps:
wc:
run: wc.cwl
in:
lines: line_array
file: file_array
scatter:
- lines
- file
scatterMethod: nested_crossproduct
out:
- output
The output will look like:
{
"output_array": [
[
{
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 70,
"checksum": "sha1$e211886d70dfff0eb61fc917d75f184ce8b609b7",
"class": "File",
"path": "/home/aelamb/cwl_stuff/output.txt"
},
{
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 70,
"checksum": "sha1$5a30593e67cc7d8e446b0ea1559da74fb35be45a",
"class": "File",
"path": "/home/aelamb/cwl_stuff/output.txt"
},
{
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 70,
"checksum": "sha1$0220442cc49f0a4b3f82821725b40449c4e150f6",
"class": "File",
"path": "/home/aelamb/cwl_stuff/output.txt"
}
],
[
{
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 74,
"checksum": "sha1$ff65542777206d16635fa2c1a3e0e6376ea02a29",
"class": "File",
"path": "/home/aelamb/cwl_stuff/output.txt"
},
{
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 77,
"checksum": "sha1$c5f042720e1f9e6cf75de5659ef01f547cd1d38f",
"class": "File",
"path": "/home/aelamb/cwl_stuff/output.txt"
},
{
"location": "file:///home/aelamb/cwl_stuff/output.txt",
"basename": "output.txt",
"size": 77,
"checksum": "sha1$e125e09c3b8a7d398014e791698dda762afb0bea",
"class": "File",
"path": "/home/aelamb/cwl_stuff/output.txt"
}
]
]
}