Hi everyone and welcome back to the snakemake series. In this video, I will introduce the concept of wildcards and how we can use them to build robust pipelines. We will start with the rule copyab, which makes a copy of file a and renames it file b.
Timeline:
00:00 Intro
00:18 A simple rule
00:40 Two simple rules
01:00 Rule all
01:37 Wildcards
02:45 Graphics
03:15 Tandem rules
03:45 Ambiguities
04:07 Restrictions
04:28 Outro
rule copyab:
input: "a.txt"
output: "b.txt"
shell: "cp {input} {output}"
When we launch the pipeline, the rule copyab gets run and file b is generated. Now let's create a second rule, copyac that makes a copy of the a file and names it file c.
rule copyab:
input: "a.txt"
output: "b.txt"
shell: "cp {input} {output}"
rule copyac:
input: "a.txt"
output: "c.txt"
shell: "cp {input} {output}"
We launch the pipeline, and the only file that is generated is file b because it is the only one requested in the first rule. So we need to put a rule all on top and request both files b and c. If you are not sure how this works check the rule all video.
rule all:
input: "b.txt", "c.txt"
rule copyab:
input: "a.txt"
output: "b.txt"
shell: "cp {input} {output}"
rule copyac:
input: "a.txt"
output: "c.txt"
shell: "cp {input} {output}"
OK, now we launch the pipeline, all rules are run and we get both the b and c files. So far so good, but looking closer at these rules, they are essentially the same the only difference is the name of the output file. To avoid repetition we can use wildcards to specify the variable part of the rule and have a generic rule. To do this we write the variable part of the output between curly brackets. And we will also change the name of the file from copyab to copya to make it more generic.
rule all:
input: "b.txt", "c.txt"
rule copya:
input: "a.txt"
output: "{x}.txt"
shell: "cp {input} {output}"
# rule copyac:
# input: "a.txt"
# output: "c.txt"
# shell: "cp {input} {output}"
When we write the pipeline, all rules are run, files b and c are created and we can see that the variable part of the output has been assigned a wildcard. When the first copya rule is run, the wildcard x is assigned the value b, and the second time the copya rule it's run the wildcard x is assigned the value c.
We can have a graphic visualization of the pipeline if we run the following command.
snakemake --dag | dot -Tpdf x x.pdf
We can create a pdf with this information. In this case, we see that the rule copy a is run twice using in one case b to specify the wildcard x, and in the other case c is used. We can also see that both rules are run to satisfy the files required in rule all.
Now let's try to run two rules, one that makes a copy of a and names it b and another that makes a copy of file b and names it c. We can specify the output of the first rule with a wild card.
rule all:
input: "b.txt", "c.txt"
rule copyab:
input: "a.txt"
output: "{x}.txt"
shell: "cp {input} {output}"
rule copybc:
input: "b.txt"
output: "c.txt"
shell: "cp {input} {output}"
When we run the pipeline, snakemake figures out that it needs to replace the wildcard with b, all rules are run and both files b and c are generated.
However, things can get tricky quite fast. For example, let's assign the output of the second rule with a wildcard.
rule all:
input: "b.txt", "c.txt"
rule copyab:
input: "a.txt"
output: "{x}.txt"
shell: "cp {input} {output}"
rule copybc:
input: "b.txt"
output: "{x}.txt"
shell: "cp {input} {output}"
When we launch the pipeline, we get an error message claiming that there is some ambiguity.
AmbiguousRuleException:
Rules copybc and copyab are ambiguous for the file c.txt.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
copybc: x=c
copyab: x=c
Expected input files:
copybc: b.txt
copyab: a.txtExpected output files:
copybc: c.txt
copyab: c.txt
Indeed file c could be made by the first and the second rule, and snakemake will not make a decision on this. One way to deal with this is by putting some restrictions on the wildcard, which is done using pattern matching. In this case, we are writing that the pattern can not match a c.
rule all:
input: "b.txt", "c.txt"
rule copyab:
input: "a.txt"
output: "{x,[^c]}.txt"
shell: "cp {input} {output}"
rule copybc:
input: "b.txt"
output: "{y}.txt"
shell: "cp {input} {output}"
When we launch the pipeline, all rules are run, and both b and c files are generated. As you can imagine, we are just scratching the surface of wildcards implementations...
Тэги:
#snakemake_bioinformatics_tutorial #snakemake_pipeline #snakemake_tutorial