Snakemake

安装

推荐使用conda创建python3环境安装

conda install -c bioconda snakemake

命令与规则

组成规则

rule test:
    input:
        "test.py" 
    output: 
        "out.py"
    shell:
        "cat {input} > {output}"

snakemake由不同的rule组成，每一个rule执行一个任务，通过不同的rule串联完成流程，snakemake还支持断点重启。

`rule`

每个rule定义流程中的每一步，相当于一个脚本。

`rule all`

一个特殊的rule，只有输入文件，为最后的要输出的结果文件，如果一个snakemake中存在多个rule需要加上这个rule否则只会输出第一个rule的结果

`params`

指定运行程序的参数

如

rule test:
    input:
        "test.py" 
    output: 
        "out.py"
    params:
    	cat="-n"
    shell:
        "cat {params.cat} {input} > {output}"

`threads`

指定任务的线程

`temp`

有时我们只需要最终结果文件，或者对某些中间文件并不关心，可以使用temp

删除指定的中间文件

rule test:
    input:
        "test.py" 
    output: 
        temp("out.py")
    shell:
        "cat {input} > {output}"
        
rule test2:
	input:
        "out.py" 
    output: 
        "out.txt"
    shell:
        "cat {input} > {output}"

`inclue`

大型的流程可以将不同的部分，分成不同的模块，在最后一个总的snakefile中导入其他snakefile

include: “path/to/other.snakefile

configuration

适合多样本，样本比较多的时候，生成yaml文件，将所需的样本名或者其他信息全部写入，在运行时只要导入文件即可

configfile: "samples.yaml"

rule bwa:
    input:
        fa = "fastq/genome.fa",
        fastq = expand("fastq/{sample}.fastq", sample=config["samples"])
    output: 
        temp("bam/test.bam")
    params:
        samtools="view -Sb"
    shell: 
        "bwa mem {input.fa} {input.fastq} | samtools {params.samtools} -> {output}"

YAML格式

http://www.ruanyifeng.com/blog/2016/07/yaml.html

执行

默认在当前目录下直接使用

snakemake

运行当前目录下的snakefile

-s  指定Snakefile，
-n  不真正执行，
-p   输出要执行的shell命令
-r  输出每条rule执行的原因,默认FALSE
-j  指定运行的核数，若不指定，则使用最大的核数
-f 重新运行第一条rule或指定的rule
-F 重新运行所有的rule，不管是否已经有输出结果

sankemake -np

很有用，通过假运行，可以检查自己的文件是否正确

可视化

snakemake –dag dot -Tpdf > dag.pdf

即可输出流程图，描述了每个rule的前后关系

流程的自动部署

在其他环境下同样使用相同的流程

全局环境

导出conda环境

conda支持到处目前环境下所有的依赖信息，导出为yaml格式

conda env export -n 项目名 -f environment.yaml

重新创建环境

通过导出的文件，快速复现一个环境

conda env create -f environment.yaml

局部环境

当不同工具依赖不同环境的时候，snakemake提供

–use-conda

解析rule中的conda规则

configfile: "samples.yaml"

rule bwa:
    input:
        fa = "fastq/genome.fa",
        fastq = expand("fastq/{sample}.fastq", sample=config["samples"])
    output: 
        "bam/test.bam"
    conda:
        "envs/test.yaml"
    params:
        samtools="view -Sb"
    run: 
        "bwa mem {input.fa} {input.fastq} | samtools {params.samtools} -> {output}"

使用特定的conda环境文件来执行rule

集群投递

snakemake --cluster "qsub -V -cwd -q 投递队列" -j 10
# -c CMD: 集群运行指令
# qusb -cwd -q， 在当前目录下运行(-cwd), 投递到指定的队列(-q)
# --j N: 在每个集群中最多并行N核