Skip to content

ccaudek/snakemake_playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake playground

Snakemake

This is a simple example of a Snakemake workflow with external scripts in R. The text in this README has been copied from various sources (especially from the official Snakemake documentation).

Authors

Corrado Caudek ORCID logo 0000-0002-1404-0420

Introduction

A Snakemake workflow is defined by specifying rules in a Snakefile (or in external .smk files). Rules decompose the workflow into small steps by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names. The workflow is determined automatically from top (the files you want) to bottom (the files you have), by applying very general rules with wildcards you give to Snakemake.

Install Snakemake

Install Snakemake using Mamba. For installation details, see this link.

Snakefiles

Snakemake workflows ("snakefiles") are python code (all the python syntax rules apply).

input section

  • Inputs are one or more file names, in quotes, comma-separated
  • Inputs are optional

  • Inputs can have "symbolic" names
rule make_report:
    input:
        data=config["raw_data"],
        subset_data="../results/data/processed/penguin_subset.rds",
        table_data=rules.save_table.output.tab1
    output:
        "../results/reports/report.html"
    params:
        pdf_fig1=config["playground_dir"] + "results/plots/figure_1.pdf"
    log:
        mylog="../results/logs/make_report.log"
    script:
        "scripts/reports/report.Rmd"

output section

  • Outputs are one or more file names, in quotes, comma-separated
  • Outputs can have "symbolic names"
  • Outputs are optional

shell directive

The shell directive is followed by a Python string containing the shell command to execute.

  • This is where you encode the actual work of the workflow
  • By default: /bin/bash in strict mode (set –euo pipefail)

  • Multi-line shell statements: use triple-quotes

  • Can load modules, only affects the current rule.
rule link:
  input: "hello_world.o"
  output: "hello_world"
  shell:
    """
    module load gcc/6.1.0
    gcc -o {output} {input}

    """

run directive

  • Instead of bash, the action can be written in Python
  • Put this in the "run:" section of the rule
  • Note there are no quotes around the Python code
rule usercount:
  input: "userfile.txt"
  output: "users.count"
  run:
    users=set()

      with open(input[0]) as infile:
      ...

Path specifications

In the present example, I use the following rule in the Snakefile:

rule make_report:
    input:
        data=config["raw_data"],
        subset_data="../results/data/processed/penguin_subset.rds",
        table_data=rules.save_table.output.tab1
    output:
        "../results/reports/report.html"
    params:
        pdf_fig1=os.path.join(path_wd.parent, "results/plots/figure_1.pdf")
    log:
        mylog="../results/logs/make_report.log"
    script:
        "scripts/reports/report.Rmd"

In the params directive, I need to specify the path for the figure_1.pdf file. Such file is created by the rule save_figures and is located in the \results\plots directory. For some reasons, the specification "../results/plots/figure_1.pdf" does not work for knitr::include_graphics(). One possible solution is to specify the absolute path. But using absolute paths is not recommended. Therefore, I used the following Python function os.path.join(path_wd.parent, "results/plots/figure_1.pdf"), which produces the string /Users/corrado/Documents/snakemake_workflows/snakemake_playground/playground/results/plots/figure_1.pdf on this computer. The relative path results/plots/figure_1.pdf is fixed, so it is not a problem if it is hard-coded. The first part of the path, instead, varies, depending on where the project folder is placed. So it will be dynamically generated by Python. This requires importing the module os. I use the os.path.join() method to join two path components. The first component of the path is generated in the Snakefile requires importing pathlib from Path. I save path_wd = Path.cwd(), which (on this computer) is /Users/corrado/Documents/snakemake_workflows/snakemake_playground/playground/workflow. The results directory is in playground. So I need to move up one level. To do so, I use path_wd.parent. Joining the two components together I get the desired result. In this manner, the workflow is independent from the device.

script directive

A rule can also point to an external script instead of a shell command or inline Python code. For this purpose, Snakemake offers the script: directive. This mechanism also allows you to integrate R and R Markdown scripts with Snakemake, e.g.

rule NAME:
    input:
        myfile="path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    script:
        "scripts/script.R"

Although there are other strategies to invoke separate scripts from your workflow (for example, invoking them via shell commands), the benefit of this is obvious: the script logic is separated from the workflow logic (and can even be shared between workflows), but boilerplate code like the parsing of command line arguments is unnecessary. It is best practice to use the script directive whenever an inline code block would have more than a few lines of code.

The actual R code to generate the plot is hidden in the script scripts/script.R. Script paths are always relative to the referring Snakefile. In the script, all properties of the rule like input, output, wildcards, etc. are available as attributes of a global snakemake object.

With the standardized directory structure

  • if a rule is written in the Snakefile file, the path for accessing the R script is "scripts/script.R".
  • if a rule is moved into a .smk file in the rules folder, the path for accessing the R script is "../scripts/script.R".

In R scripts, an S4 object named snakemake is available and allows access to input and output files and other parameters. The syntax follows that of S4 classes with attributes that are R lists. For example we can access the first input file with snakemake@input[[1]] (note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, for example snakemake@input[["myfile"]]. An equivalent syntax is snakemake@input$myfile.

A script written in R would look like this:

do_something <- function(
    data_path, out_path, threads, myparam
    ) {
    # R code
}

do_something(
    snakemake@input[[1]],
    snakemake@output[[1]],
    snakemake@threads,
    snakemake@config[["myparam"]]
)

To debug R scripts, you can save the workspace with save.image(), and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the snakemake variable.

It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule. A convenience method, snakemake@source(), acts as a wrapper for the normal R source() function, and can be used to source files relative to the original script directory.

R Markdown

An R Markdown file can be integrated in the same way as R and Python scripts, but only a single output (html) file can be used:

rule NAME:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/report.html",
    script:
        "path/to/report.Rmd"

In the R Markdown file you can insert output from a R command, and access variables stored in the S4 object named snakemake:

---
title: "Test Report"
author:
    - "Your Name"
date: "`r format(Sys.time(), '%d %B, %Y')`"
params:
   rmd: "report.Rmd"
output:
  html_document:
  highlight: tango
  number_sections: no
  theme: default
  toc: yes
  toc_depth: 3
  toc_float:
    collapsed: no
    smooth_scroll: yes
---

## R Markdown

This is an R Markdown document.

Test include from snakemake `r snakemake@input`.

## Source
<a download="report.Rmd" href="`r base64enc::dataURI(file = params$rmd, mime = 'text/rmd', encoding = 'base64')`">R Markdown source file (to produce this document)</a>
  • In an R S4 object, the syntax is:
# load data
print("Loading my_file object")
load(snakemake@input$my_file)

Wildcards

Snakemake allows to generalize rules by using named wildcards. In Snakemake the workflow is determined from the top, i.e. from the target files. Imagine you have a directory with files 1.fastq, 2.fastq, 3.fastq, ..., and you want to produce files 1.bam, 2.bam, 3.bam, .... You should specify these as target files, using the ids 1,2,3,.... You could end up with at least two rules like this (or any number of intermediate steps):

IDS = "1 2 3 ...".split() # the list of desired ids

# a pseudo-rule that collects the target files
rule all:
    input:  expand("otherdir/{id}.bam", id=IDS)

# a general rule using wildcards that does the work
rule:
    input:  "thedir/{id}.fastq"
    output: "otherdir/{id}.bam"
    shell:  "..."

Snakemake will then go down the line and determine which files it needs from your initial directory.

Function glob_wildcards

In order to infer the IDs from present files, Snakemake provides the glob_wildcards function, e.g.

IDS, = glob_wildcards("thedir/{id}.fastq")

The function matches the given pattern against the files present in the filesystem and thereby infers the values for all wildcards in the pattern. A named tuple that contains a list of values for each wildcard is returned. Here, this named tuple has only one item, that is the list of values for the wildcard ``{id}`.

Configuration

Snakemake allows you to use configuration files for making your workflows more flexible and also for abstracting away direct dependencies. A configuration is provided as a JSON or YAML file and can be loaded with the configfile directive. The config file can be used to define a dictionary of configuration parameters and their values. In the present example, the file config.yaml provides the specification:

raw_data: scripts/data/penguins.csv

The Snakefile includes:

configfile: "../config/config.yaml"

In the workflow, the configuration is accessible via the global variable config. For example, the eda.smk rule has:

input:
    penguins_data=config["raw_data"]

Working Directory

All paths in the snakefile are interpreted relative to the directory snakemake is executed in. This behaviour can be overridden by specifying a workdir in the snakefile:

workdir: "path/to/workdir"

Usually, it is preferred to only set the working directory via the command line, because above directive limits the portability of Snakemake workflows.

Workflow report

It is possible to automatically generate detailed self-contained HTML reports that encompass runtime statistics, provenance information, workflow topology and results.

Create the file <PROJECT-NAME/workflow/report/workflow.rst> with a brief description of the project. In the Snakefile, add the directive

report: "report/workflow.rst"

To create the report, run

snakemake --cores 4 --report report.html

With the present directory structure, the output will be saved in the workflow folder.

It is necessary to install imagemagick in order to have embedded images and pdfs in the report:

brew install imagemagick

For the purpose of the report, it is better to save the images in .png format.

Protected and Temporary Files

A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as protected:

rule NAME:
    input:
        "path/to/inputfile"
    output:
        protected("path/to/outputfile")
    shell:
        "somecommand {input} {output}"

A protected file will be write-protected after the rule that produces it is completed.

Integrated Package Management

The Conda package manager is used to obtain and deploy the defined software packages in the specified versions. Packages will be installed into your working directory. Given that conda is available on your system, to use the Conda integration, add the --use-conda flag to your workflow execution command, e.g. snakemake --cores 8 --use-conda.

# To activate this environment, use
conda info --envs
conda activate snakemake
snakemake --help

# To deactivate an active environment, use
conda deactivate

A better explanation is provided here.

Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores.

Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job.

Best practices

  • It is a good idea to stick to a standardized folder structure that is expected by users of Snakemake. It is available here. Configuration of a workflow should be handled via config files. Use such configuration for metadata and experiement information, not for runtime specific configuration like threads, resources and output folders. For those, just rely on Snakemake's CLI arguments like --directory.

  • Try to keep filenames short, but informative. Avoid mixing of too many special characters (e.g. decide whether to use _ as a separator and do that consistently throughout the workflow).

Releases

No releases published

Packages

No packages published

Languages