This is a simple example of a Snakemake workflow with external scripts in R. The text in this README has been copied from various sources (especially from the official Snakemake documentation).
Corrado Caudek 0000-0002-1404-0420
A Snakemake workflow is defined by specifying rules in a Snakefile (or in external .smk
files). Rules decompose the workflow into small steps by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names. The workflow is determined automatically from top (the files you want) to bottom (the files you have), by applying very general rules with wildcards you give to Snakemake.
Install Snakemake using Mamba. For installation details, see this link.
Snakemake workflows ("snakefiles") are python code (all the python syntax rules apply).
- Inputs are one or more file names, in quotes, comma-separated
- Inputs are optional
- Inputs can have "symbolic" names
rule make_report:
input:
data=config["raw_data"],
subset_data="../results/data/processed/penguin_subset.rds",
table_data=rules.save_table.output.tab1
output:
"../results/reports/report.html"
params:
pdf_fig1=config["playground_dir"] + "results/plots/figure_1.pdf"
log:
mylog="../results/logs/make_report.log"
script:
"scripts/reports/report.Rmd"
- Outputs are one or more file names, in quotes, comma-separated
- Outputs can have "symbolic names"
- Outputs are optional
The shell directive is followed by a Python string containing the shell command to execute.
- This is where you encode the actual work of the workflow
- By default:
/bin/bash
in strict mode (set –euo pipefail
) - Multi-line shell statements: use triple-quotes
- Can load modules, only affects the current rule.
rule link:
input: "hello_world.o"
output: "hello_world"
shell:
"""
module load gcc/6.1.0
gcc -o {output} {input}
"""
- Instead of bash, the action can be written in Python
- Put this in the "run:" section of the rule
- Note there are no quotes around the Python code
rule usercount:
input: "userfile.txt"
output: "users.count"
run:
users=set()
with open(input[0]) as infile:
...
In the present example, I use the following rule in the Snakefile:
rule make_report:
input:
data=config["raw_data"],
subset_data="../results/data/processed/penguin_subset.rds",
table_data=rules.save_table.output.tab1
output:
"../results/reports/report.html"
params:
pdf_fig1=os.path.join(path_wd.parent, "results/plots/figure_1.pdf")
log:
mylog="../results/logs/make_report.log"
script:
"scripts/reports/report.Rmd"
In the params
directive, I need to specify the path for the figure_1.pdf file. Such file is created by the rule save_figures
and is located in the \results\plots
directory. For some reasons, the specification "../results/plots/figure_1.pdf"
does not work for knitr::include_graphics()
. One possible solution is to specify the absolute path. But using absolute paths is not recommended. Therefore, I used the following Python function os.path.join(path_wd.parent, "results/plots/figure_1.pdf")
, which produces the string /Users/corrado/Documents/snakemake_workflows/snakemake_playground/playground/results/plots/figure_1.pdf
on this computer. The relative path results/plots/figure_1.pdf
is fixed, so it is not a problem if it is hard-coded. The first part of the path, instead, varies, depending on where the project folder is placed. So it will be dynamically generated by Python. This requires importing the module os
. I use the os.path.join()
method to join two path components. The first component of the path is generated in the Snakefile requires importing pathlib
from Path
. I save path_wd = Path.cwd()
, which (on this computer) is /Users/corrado/Documents/snakemake_workflows/snakemake_playground/playground/workflow
. The results
directory is in playground
. So I need to move up one level. To do so, I use path_wd.parent
. Joining the two components together I get the desired result. In this manner, the workflow is independent from the device.
A rule can also point to an external script instead of a shell command or inline Python code. For this purpose, Snakemake offers the script:
directive. This mechanism also allows you to integrate R and R Markdown scripts with Snakemake, e.g.
rule NAME:
input:
myfile="path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/outputfile",
"path/to/another/outputfile"
script:
"scripts/script.R"
Although there are other strategies to invoke separate scripts from your workflow (for example, invoking them via shell commands), the benefit of this is obvious: the script logic is separated from the workflow logic (and can even be shared between workflows), but boilerplate code like the parsing of command line arguments is unnecessary. It is best practice to use the script
directive whenever an inline code block would have more than a few lines of code.
The actual R code to generate the plot is hidden in the script scripts/script.R. Script paths are always relative to the referring Snakefile. In the script, all properties of the rule like input, output, wildcards, etc. are available as attributes of a global snakemake object.
With the standardized directory structure
- if a rule is written in the Snakefile file, the path for accessing the R script is
"scripts/script.R"
. - if a rule is moved into a
.smk
file in therules
folder, the path for accessing the R script is"../scripts/script.R"
.
In R scripts, an S4 object named snakemake
is available and allows access to input and output files and other parameters. The syntax follows that of S4 classes with attributes that are R lists. For example we can access the first input file with snakemake@input[[1]]
(note that the first file does not have index 0 here, because R starts counting from 1). Named input and output files can be accessed in the same way, by just providing the name instead of an index, for example snakemake@input[["myfile"]]
. An equivalent syntax is snakemake@input$myfile
.
A script written in R would look like this:
do_something <- function(
data_path, out_path, threads, myparam
) {
# R code
}
do_something(
snakemake@input[[1]],
snakemake@output[[1]],
snakemake@threads,
snakemake@config[["myparam"]]
)
To debug R scripts, you can save the workspace with save.image(), and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the snakemake variable.
It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule. A convenience method, snakemake@source()
, acts as a wrapper for the normal R source()
function, and can be used to source files relative to the original script directory.
An R Markdown file can be integrated in the same way as R and Python scripts, but only a single output (html) file can be used:
rule NAME:
input:
"path/to/inputfile",
"path/to/other/inputfile"
output:
"path/to/report.html",
script:
"path/to/report.Rmd"
In the R Markdown file you can insert output from a R command, and access variables stored in the S4 object named snakemake:
---
title: "Test Report"
author:
- "Your Name"
date: "`r format(Sys.time(), '%d %B, %Y')`"
params:
rmd: "report.Rmd"
output:
html_document:
highlight: tango
number_sections: no
theme: default
toc: yes
toc_depth: 3
toc_float:
collapsed: no
smooth_scroll: yes
---
## R Markdown
This is an R Markdown document.
Test include from snakemake `r snakemake@input`.
## Source
<a download="report.Rmd" href="`r base64enc::dataURI(file = params$rmd, mime = 'text/rmd', encoding = 'base64')`">R Markdown source file (to produce this document)</a>
- In an R S4 object, the syntax is:
# load data
print("Loading my_file object")
load(snakemake@input$my_file)
Snakemake allows to generalize rules by using named wildcards. In Snakemake the workflow is determined from the top, i.e. from the target files. Imagine you have a directory with files 1.fastq, 2.fastq, 3.fastq, ...
, and you want to produce files 1.bam, 2.bam, 3.bam, ...
. You should specify these as target files, using the ids 1,2,3,...
. You could end up with at least two rules like this (or any number of intermediate steps):
IDS = "1 2 3 ...".split() # the list of desired ids
# a pseudo-rule that collects the target files
rule all:
input: expand("otherdir/{id}.bam", id=IDS)
# a general rule using wildcards that does the work
rule:
input: "thedir/{id}.fastq"
output: "otherdir/{id}.bam"
shell: "..."
Snakemake will then go down the line and determine which files it needs from your initial directory.
In order to infer the IDs from present files, Snakemake provides the glob_wildcards function, e.g.
IDS, = glob_wildcards("thedir/{id}.fastq")
The function matches the given pattern against the files present in the filesystem and thereby infers the values for all wildcards in the pattern. A named tuple that contains a list of values for each wildcard is returned. Here, this named tuple has only one item, that is the list of values for the wildcard ``{id}`.
Snakemake allows you to use configuration files for making your workflows more flexible and also for abstracting away direct dependencies. A configuration is provided as a JSON or YAML file and can be loaded with the configfile
directive. The config file can be used to define a dictionary of configuration parameters and their values. In the present example, the file config.yaml
provides the specification:
raw_data: scripts/data/penguins.csv
The Snakefile
includes:
configfile: "../config/config.yaml"
In the workflow, the configuration is accessible via the global variable config
. For example, the eda.smk
rule has:
input:
penguins_data=config["raw_data"]
All paths in the snakefile are interpreted relative to the directory snakemake is executed in. This behaviour can be overridden by specifying a workdir in the snakefile:
workdir: "path/to/workdir"
Usually, it is preferred to only set the working directory via the command line, because above directive limits the portability of Snakemake workflows.
It is possible to automatically generate detailed self-contained HTML reports that encompass runtime statistics, provenance information, workflow topology and results.
Create the file <PROJECT-NAME/workflow/report/workflow.rst>
with a brief description of the project. In the Snakefile
, add the directive
report: "report/workflow.rst"
To create the report, run
snakemake --cores 4 --report report.html
With the present directory structure, the output will be saved in the workflow folder.
It is necessary to install imagemagick
in order to have embedded images and pdfs in the report:
brew install imagemagick
For the purpose of the report, it is better to save the images in .png format.
A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as protected
:
rule NAME:
input:
"path/to/inputfile"
output:
protected("path/to/outputfile")
shell:
"somecommand {input} {output}"
A protected file will be write-protected after the rule that produces it is completed.
The Conda package manager
is used to obtain and deploy the defined software packages in the specified versions. Packages will be installed into your working directory. Given that conda is available on your system, to use the Conda integration, add the --use-conda
flag to your workflow execution command, e.g. snakemake --cores 8 --use-conda
.
# To activate this environment, use
conda info --envs
conda activate snakemake
snakemake --help
# To deactivate an active environment, use
conda deactivate
A better explanation is provided here.
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using $N
cores.
Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job.
-
It is a good idea to stick to a standardized folder structure that is expected by users of Snakemake. It is available here. Configuration of a workflow should be handled via
config
files. Use such configuration for metadata and experiement information, not for runtime specific configuration like threads, resources and output folders. For those, just rely on Snakemake's CLI arguments like--directory
. -
Try to keep filenames short, but informative. Avoid mixing of too many special characters (e.g. decide whether to use
_
as a separator and do that consistently throughout the workflow).