1  First Look

In this section we’ll look at making and running a first pipeline and understanding why snakemake is better for you than bash scripts.

1.1 Making a snakefile

snakemake is intended to replace the mess of bash scripts you use to run your workflows. So let’s look at converting a simple bash script to snakemake

minimap2 -ax sr ecoli_genome.fa ecoli_left_R1.fq ecoli_right_R2.fq | \
samtools view -S -h -b -q 25 -f 3 > aln.bam
samtools sort aln.bam -o aln.sorted.bam

The two commands convert into the following snakemake rules

rule sort:
  input: 'aln.bam'
  output: 'aln.sorted.bam'
  shell: "samtools sort aln.bam -o aln.sorted.bam"

rule align_and_bam:
  input:
    fq1="ecoli_left_R1.fq",
    fq2="ecoli_right_R2.fq",
    ref="ecoli_genome.fa"
  output: "aln.bam"
  shell: "minimap2 -ax sr ecoli_genome.fa ecoli_left_R1.fq ecoli_right_R2.fq | samtools view -S -h -b -q 25 -f 3 > aln.bam"

rule final_alignments:
    input: 'aln.sorted.bam'

Here are the key points

  1. A snakemake file is composed of rules
  2. Each rule has (at least) one input file, (at least) one output file and a command for making the output file from the input
  3. Rules are therefore linked into a chain or tree by the files that go in and come out
  4. There is an extra final rule that specifies the final result of the pipeline. This rule has no output or command, only inputs

That is the basis of it!

1.2 Running the pipeline

We run the rules by putting them in a file. Usually this is suffixed with .snakefile to give something like my_pipeline.snakefile. When we come to run the pipeline snakemake needs us to tell it the name of the rule to run (recall our base rule is final_alignments ), and the snakefile and the cores the pipeline is allowed to use with the -c flag.

1.2.1 The dry-run

Usually we don’t want to run the pipeline without doing some sort of checking first. This is the purpose of the dry run feature. This allows us to see the jobs that the file specifies, without actually doing them, the snakemake -n flag creates a dry-run. Put together that makes something like

snakemake final_alignments --snakefile my_pipeline.snakefile -c1 -n
Default file and rules

Actually, snakemake looks for the default file Snakefile and the main rule all so if you use them you can have a command like snakemake -c1 -n which is less verbose but also less explicit.

We get a lot of output. It can be broken down into a few bits,

  1. The Summary
  2. The Jobs
  3. The to-be-done list

1.2.1.1 The Summary

At the top of the file we are given a summary of the number of times each rule will be run and the resources specified (here just the defaults)

Building DAG of jobs...
Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
align_and_bam           1              1              1
final_alignments        1              1              1
sort                    1              1              1
total                   3              1              1

1.2.1.2 The Jobs

Next we get a much more granular view, each job is presented with the expected input and output and a reason why it needs running. Usually this will be either missing output files IE the output hasn’t been created so the job still needs to run or Input files updated by another job meaning that an input file is newer than an output file (or when it is created it will be) so this file needs updating.

[Wed Oct 26 16:51:18 2022]
rule align_and_bam:
    input: ecoli_left_R1.fq, ecoli_right_R2.fq, ecoli_genome.fa
    output: aln.bam
    jobid: 2
    reason: Missing output files: aln.bam
    resources: tmpdir=/var/folders/22/kjdvv_k14cj1m6hq5hl527qw0006zc/T


[Wed Oct 26 16:51:18 2022]
rule sort:
    input: aln.bam
    output: aln.sorted.bam
    jobid: 1
    reason: Missing output files: aln.sorted.bam; Input files updated by another job: aln.bam
    resources: tmpdir=/var/folders/22/kjdvv_k14cj1m6hq5hl527qw0006zc/T


[Wed Oct 26 16:51:18 2022]
localrule final_alignments:
    input: aln.sorted.bam
    jobid: 0
    reason: Input files updated by another job: aln.sorted.bam
    resources: tmpdir=/var/folders/22/kjdvv_k14cj1m6hq5hl527qw0006zc/T

1.2.1.3 The to-be-done list

This is a version of the summary outlining the bits of the pipeline that need to complete for everything to be in order.

Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
align_and_bam           1              1              1
final_alignments        1              1              1
sort                    1              1              1
total                   3              1              1

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        final_alignments, sort
    missing output files:
        align_and_bam, sort

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Using a log

As your list of jobs grows and the output from snakemake becomes large, its best to use a log file. Do that with -o somename.log e.g snakemake --snakefile final_alignments my_pipeline.snakefile -n -o my_pipeline.log

1.2.2 The Run Proper

Everything looks good in the dry-run so let’s go ahead and run. Although we haven’t made explicit point of it, this run will happen in the current directory with all files expected to be in and going to the current directory. That looks like this at the moment.

$ ls -l
total 85136
-rw-r--r--@ 1 macleand  2006   5205449  2 Jul  2019 ecoli_genome.fa
-rw-r--r--@ 1 macleand  2006  19186649 28 Nov  2019 ecoli_left_R1.fq
-rw-r--r--@ 1 macleand  2006  19186649 28 Nov  2019 ecoli_right_R2.fq
-rw-r--r--  1 macleand  2006       420 26 Oct 16:50 my_pipeline.snakefile

Run the pipeline with

snakemake final_alignments --snakefile my_pipeline.snakefile -c1

We get a lot of output to the screen (or the log if we specified that). Hopefully at the end we see

Finished job 0.
3 of 3 steps (100%) done

an indication that it has completed everything (if not we’re into some debugging - more on that later). And the working directory looks like this now

-rw-r--r--  1 macleand  2006   9397482 28 Oct 10:13 aln.bam
-rw-r--r--  1 macleand  2006   7929255 28 Oct 10:13 aln.sorted.bam
-rw-r--r--@ 1 macleand  2006   5205449  2 Jul  2019 ecoli_genome.fa
-rw-r--r--@ 1 macleand  2006  19186649 28 Nov  2019 ecoli_left_R1.fq
-rw-r--r--@ 1 macleand  2006  19186649 28 Nov  2019 ecoli_right_R2.fq
-rw-r--r--  1 macleand  2006       420 26 Oct 16:50 my_pipeline.snakefile

All the files we expected to be created have been and are sitting nicely in the directory. Hurray!

1.3 The first awesome thing about snakemake

So far this has all been very much like a bash script. The snakemake file seems to be just an elaborate reproduction. Now lets have a look at a killer feature that makes snakemake very much more useful than bash scripts - its ability to work out whether all parts of the pipeline are up to date and whether anything needs redoing.

Let’s look at the dry-run output from the pipeline we just ran.

$ snakemake final_alignments --snakefile my_pipeline.snakefile -c1 -n
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

Well, that’s reassuring. Nothing need be done. What if a component file changed. Lets look at what happens if an input file is updated. Using touch to update the timestamp on the reference file

$ sample_data touch ecoli_genome.fa
$ sample_data ls -l
total 119480
-rw-r--r--  1 macleand  2006   9397482 28 Oct 10:13 aln.bam
-rw-r--r--  1 macleand  2006   7929255 28 Oct 10:13 aln.sorted.bam
-rw-r--r--@ 1 macleand  2006   5205449 28 Oct 10:32 ecoli_genome.fa
-rw-r--r--@ 1 macleand  2006  19186649 28 Nov  2019 ecoli_left_R1.fq
-rw-r--r--@ 1 macleand  2006  19186649 28 Nov  2019 ecoli_right_R2.fq
-rw-r--r--  1 macleand  2006       420 26 Oct 16:50 my_pipeline.snakefile

One of the source files is now newer than the outputs. What does snakemake now think needs to be done

$ snakemake final_alignments --snakefile my_pipeline.snakefile -c1 -n

Building DAG of jobs...
Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
align_and_bam           1              1              1
final_alignments        1              1              1
sort                    1              1              1
total                   3              1              1

it thinks that the whole pipeline needs to be redone! This is the first awesome thing about snakemake - if one of the upstream files is updated (input or output files, it doesn’t matter), the relevant parts of the pipeline will be slated to run again (which in this small pipeline will be everything). snakemake will work out which bits need doing again automatically from the rule descriptions. In large pipelines this is a major time saver and increases reproducibility massively. Bash scripts must be manually managed which leads to more manual errors.

Note

I find it really hard to overstate how useful this ability to pick-up-from-where-it-left-off is. It saves an immense amount of checking and redoing and re-issuing of the same commands when something went wrong - especially something catastrophic or hard to detect at the end. Or when you unexpectedly get a new sample and need to add it in, or when your boss wants to change one tiny thing. snakemake insane reproducibility is a huge win for research which is naturally iterative. Its true that there’s a slight learning curve and requires more investment in time at the start of the project, but that is won back in spades later.

1.4 The second awesome thing about snakemake

The next life improving thing about snakemake is how it handles files. Up to now we’ve hardcoded the file names into the rules. Thats not scalable. snakemake provides a clever pattern match facility between the input and output files to match them up between the rules, it cross-references these with actual filenames and fills in the patterns. It provides access to them using special objects called wildcards, input and output. All we need to do is specify the patterns in our rules and the inputs to the master rule (ie final_alignments). In essence, we start by describing the files we want to get out of the pipeline and snakemake works back from there according to our pattern.

This is massively easier to see in the rules themselves. Here’s our original set of rules.

rule sort:
  input: 'aln.bam'
  output: 'aln.sorted.bam'
  shell: "samtools sort aln.bam -o aln.sorted.bam"

rule align_and_bam:
  input:
    fq1="ecoli_left_R1.fq",
    fq2="ecoli_right_R2.fq",
    ref="ecoli_genome.fa"
  output: "aln.bam"
  shell: "minimap2 -ax sr ecoli_genome.fa ecoli_left_R1.fq ecoli_right_R2.fq | samtools view -S -h -b -q 25 -f 3 > aln.bam"

rule final_alignments:
  input: 'aln.sorted.bam'

And here’s our new set of rules

rule final_alignments:
  input: 
    ecoli='ecoli.sorted.bam',
    pputida='pputida.sorted.bam'
    
rule sort:
  input: "{sample}_aln.bam"
  output: "{sample}.sorted.bam"
  shell: "samtools sort {input} -o {output}"

rule align_and_bam:
  input:
    fq1="{sample}_left_R1.fq",
    fq2="{sample}_right_R2.fq",
    ref="{sample}_genome.fa"
  output: "{sample}_aln.bam"
  shell: "minimap2 -ax sr {input.ref} {input.fq1} {input.fq2} | samtools view -S -h -b -q 25 -f 3 > {output}"

Things to note:

  1. The final rule is at the top - it doesn’t actually matter to snakemake what order the rules are in. In many cases for us it’s easier to have the last rule at the top. Note the common name scheme between the input file names.
  2. The {sample} is the wildcard. The {} is a replacement operator, the value of the wildcard will be put in there at runtime.
  3. The special {input} and {output} objects can have more than one attribute, so we can have more than one input or output file into a rule.
  4. We can thread the replacements into the shell commands to make them generic across samples too.

So we can hopefully see how the rules link up to each other.

What does the dry-run say in the following folder with multiple samples in there?

$ ls -l
total 170264
-rw-r--r--@ 1 macleand  2006   5205449 28 Oct 11:04 ecoli_genome.fa
-rw-r--r--@ 1 macleand  2006  19186649 28 Oct 11:04 ecoli_left_R1.fq
-rw-r--r--@ 1 macleand  2006  19186649 28 Oct 11:04 ecoli_right_R2.fq
-rw-r--r--  1 macleand  2006       479 28 Oct 11:28 multi.snakefile
-rw-r--r--@ 1 macleand  2006   5205449 28 Oct 11:04 pputida_genome.fa
-rw-r--r--@ 1 macleand  2006  19186649 28 Oct 11:04 pputida_left_R1.fq
-rw-r--r--@ 1 macleand  2006  19186649 28 Oct 11:04 pputida_right_R2.fq



$ snakemake final_alignments --snakefile multi.snakefile -c1 -n
Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
align_and_bam           2              1              1
final_alignments        1              1              1
sort                    2              1              1
total                   5              1              1

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        final_alignments, sort
    missing output files:
        align_and_bam, sort

Looks great! snakemake has found all our new samples and increased the number of jobs needed accordingly. The figures below show a graphical version of the pipelines. See how snakemake has made it very easy to jump up in scale.

1.5 Summary

We’ve seen how the fundamentals of snakemake allow us to build efficient and reproducible pipelines. In the next section we’ll look at features of snakemake that help to power larger and more complex pipelines.