Snakemake use all samples as one input with porechop - python

I'm trying to use porechop on several data with a Snakemake workflow.
In my Snakefile, there are three rules, a fastqc rule and a porechop rule, in addition to the all rule. The fastqc rule works very well, I have all three out for my three fastq. But for porechop, instead of running the command three times, it runs the command once with the -i flag for all three files at the same time:
Error in rule porechop:
jobid: 2
output: /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz
conda-env: /ngs/prod/nanocea_project/test/.snakemake/conda/a72fb141b37718b7c37d9f32d597faeb
shell:
porechop -i /ngs/prod/nanocea_project/test/reads/25022021_2.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_1.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_2.fastq.gz -o /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz -t 40 --discard_middle
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
However, when I use it with a single sample, the program works.
Here my code:
import glob
import os
###Global Variables###
FORMATS=["zip", "html"]
DIR_FASTQ="/ngs/prod/nanocea_project/test/reads"
###FASTQ Files###
def list_samples(DIR_FASTQ):
SAMPLES=[]
for file in glob.glob(DIR_FASTQ+"/*.fastq.gz"):
base=os.path.basename(file)
sample=(base.replace('.fastq.gz', ''))
SAMPLES.append(sample)
return(SAMPLES)
SAMPLES=list_samples(DIR_FASTQ)
###Rules###
rule all:
input:
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS),
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
rule fastqc:
input:
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
output:
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS)
threads:
16
conda:
"envs/fastqc.yaml"
shell:
"fastqc {input} -o /ngs/prod/nanocea_project/test/stats/fastqc/ -t {threads}"
rule porechop:
input:
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
output:
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
threads:
40
conda:
"envs/porechop.yaml"
shell:
"porechop -i {input} -o {output} -t {threads} --discard_middle"
Do you have any idea what's wrong?
Thanks !

This question comes up often... If you use expand() in input: or output: then you are feeding the rule with a list of all the files. That is the same as writing:
input:
['sample1.fastq', 'sample2.fastq', ..., 'sampleN.fastq'],
output:
['sample1.pore.fastq', 'sample2.pore.fastq', ..., 'sampleN.pore.fastq'],
To run the rule on each input/output just remove the expand:
rule porechop:
input:
DIR_FASTQ+"/{sample}.fastq.gz"
output:
"/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz",

Related

Apply snakemake rule on all generated files

I want to run a simple script "script.py", which will run some caculayions and periodically spit out a step_000n.txt file with n being dependent on the total file execution time. I would then like snakemake to run another rule on all generated files. What would be the proper Snakefile input?
ie
1. run scipt.py
2. get step_000{1,2,3,4 ..}.txt (n being variable and not determined)
3. apply `process.py -in step_000{n}.txt -out step_000{n}.png` on all step_000{1,2,3,4 ..}.txt
My obviously wrong attempt is below
rule all:
input: expand("{step}.png", step=list(map(lambda x: x.split(".")[0], glob.glob("model0*.txt"))))
rule txt:
input: "{step}.txt"
output: "{step}.png"
shell:
"process.py -in {input} -out {output}"
rule first:
output: "{step}.txt"
script: "script.py"
I could not figure out how to define output target here.
I would write all the step_000n.txt files to a dedicated directory and then process all the files in that directory. Something like:
rule all:
input:
'processed.txt',
rule split:
output:
directory('processed_dir'),
shell:
r"""
# Write out step_001.txt, step_002.txt, ..., step_000n.txt
# in output directory `processed_dir`
mkdir {output}
script.py ...
"""
rule process:
input:
indir= 'processed_dir',
output:
out= 'processed.txt',
shell:
r"""
process.py -n {input.indir}/step_*.txt -out {output.out}
"""

Snakemake runs rule too many times using config.yaml

I'm trying to create this snakemake workflow which would evaluate raw reads quality using FastQc and create a raport using MultiQC. I use 4 input files and get expected results, however I just noticed that each rule gets run 4 times and takes all 4 inputs each time and I'm not sure how to fix that. Could anyone help me figure out how to:
Run the rule 4 times but use only one input from config.yaml at a time?
Run the rule 1 time but use all 4 inputs?
I'm trying to follow the snakemake tutorial but no luck so far.
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
rule raw_fastqc:
input:
expand("data/samples/{sample}.fastq", sample=config["samples"])
output:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip"
shell:
"fastqc {input} -o outputs/fastqc_1/"
rule raw_multiqc:
input:
expand("outputs/fastqc_1/{sample}_fastqc.html", sample=config["samples"]),
expand("outputs/fastqc_1/{sample}_fastqc.zip", sample=config["samples"])
output:
"outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
shell:
"multiqc ./outputs/fastqc_1/ -n {output}"
config.yaml file:
samples:
Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq
Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq
KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq
KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
I run the snakemake using command:
snakemake -s Snakefile --core 1
Each rule is run 4 times:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
----------- ------- ------------- -------------
all 1 1 1
raw_fastqc 4 1 1
raw_multiqc 4 1 1
total 9 1 1
But each time all 4 inputs are used:
[Sun May 15 23:06:22 2022]
rule raw_fastqc:
input: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq, data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
output: outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.html, outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.zip
jobid: 3
wildcards: sample=Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001
resources: tmpdir=/tmp
Your problem is using expand() in the input of each rule. Because expand fills in wildcard values, you only need to do that in the all rule since wildcard values are passed on to upstream rules.
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
rule raw_fastqc:
input:
"data/samples/{sample}.fastq"
output:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip"
shell:
"fastqc {input} -o outputs/fastqc_1/"
rule raw_multiqc:
input:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip",
output:
"outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
shell:
"multiqc ./outputs/fastqc_1/ -n {output}"

Snakemake - How to use every line of input file as wildcard

I am pretty new to using Snakemake and I have looked around on SO to see if there is a solution for the below - I am almost very close to a solution, but not there yet.
I have a single column file containing a list of SRA ids and I want to use snakemake to define my rules such that every SRA id from that file becomes a parameter on command line.
#FileName = Samples.txt
Samples
SRR5597645
SRR5597646
SRR5597647
Snakefile below:
from pathlib import Path
shell.executable("bash")
import pandas as pd
import os
import glob
import shutil
configfile: "config.json"
data_dir=os.getcwd()
units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())
#print(samples)
rule all:
input:
expand("out/{sample}.fastq.gz",sample=samples)
rule clean:
shell: "rm -rf .snakemake/"
include: 'rules/download_sample.smk'
download_sample.smk
rule download_sample:
"""
Download RNA-Seq data from SRA.
"""
input: "{sample}"
output: expand("out/{sample}.fastq.gz", sample=samples)
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
I have tried many different variants of the above code, but somewhere I am getting it wrong.
What I want: For every record in the file Samples.txt, I want the parallel-fastq-dump command to run. Since I have 3 records in Samples.txt, I would like these 3 commands to get executed
parallel-fastq-dump --sra-id SRR5597645 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597646 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597647 --threads 16 --outdir out --gzip
This is the error I get
snakemake -np
WildcardError in line 1 of rules/download_sample.smk:
Wildcards in input files cannot be determined from output files:
'sample'
Thanks in advance
It seems to me that what you need is to access the sample wildcard using the wildcards object:
rule all:
input: expand("out/{sample}_fastq.gz", sample=samples)
rule download_sample:
output:
"out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell:"parallel-fastq-dump --sra-id {wildcards.sample} --threads {params.threads} --outdir {params.outdir} --gzip "
The first solution could be to use the run: section of the rule instead of the shell:. This allows you to employ python code:
rule download_sample:
# ...
run:
for input_file in input:
shell(f"parallel-fastq-dump --sra-id {input_file} --threads {params.threads} --outdir {params.outdir} --gzip")
This straightforward solution however is not idiomatic. From what I can see, you have a one-to-one relationship between input samples and output files. In other words to produce one out/{sample}_fastq.gz file you need a single {sample}. The best solution would be to reduce your rule to the one that makes a single file:
rule download_sample:
input: "{sample}"
output: "out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
The rule all: now requires all targets; the rule download_sample downloads a single sample, the Snakemake workflow does the rest: it constructs a graph of dependences and creates one instance of the rule download_sample per sample. Moreover, if you wish it can run these rules in parallel.

Snakemake don't execute bash command

I'm trying to run a little pipeline in Snakemake for a software to filter good reads in files from a RNA-seq.
This is my code:
SAMPLES = ['ZN21_S1', 'ZN22_S2','ZN27_S3', 'ZN28_S4', 'ZN29_S5' ,'ZN30_S6']
rule all:
input:
expand("SVA-{sample}_L001_R{read_no}.fastq.gz", sample=SAMPLES, read_no=['1', '2'])
rule fastp:
input:
reads1="SVA-{sample}_L001_R1.fastq.gz",
reads2="SVA-{sample}_L001_R2.fastq.gz"
output:
reads1out="out/SVA-{sample}_L001_R1.fastq.gz.good",
reads2out="out/SVA-{sample}_L001_R2.fastq.gz.good"
shell:
"fastp -i {input.reads1} -I {input.reads2} -o {output.reads1out} -O {output.reads2out}"
All samples (in symbolic link) are in the same folder and I only got the message "Nothing to be done".
What am I not seeing?
In your example, target files in rule all are supposed to match with rule fastp's output files, instead of its input files in your current setup. As per your code, target files in rule all already exist and hence the message Nothing to be done when executing it.
rule all:
input:
expand("out/SVA-{sample}_L001_R{read_no}.fastq.gz.good", sample=SAMPLES, read_no=['1', '2'])

How to perform simple string operations in snakemake output

I am creating my first snakemake file, and I got to the point where I need to perform a simple string operation on the value of my output, so that my shell command works as expected:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {output}'
I need to apply the split function to {output} so that only the name of the file up to the extension is used. I couldn't find anything in the docs or in related questions.
You could use the params field:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
params:
dir = 'out/genomes'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {params.dir}'
Alternative solution using wildcards:
rule all:
input: 'out/genomes.msh'
rule sketch:
input:
'{file}.txt'
output:
'{file}.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {wildcards.file}'
Untested, but I think this should work.
The advantage over the params solution is that it generalizes better.
Best is to use params:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
params:
prefix=lambda wildcards, output: os.path.splitext(output[0])[0]
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {params.prefix}'
It is always preferable to use params instead of using the run directive, because the run directive cannot be combined with conda environments.
Avoid duplicating text. Don't use params unless you convert your input/outputs to wildcards + extentions. Otherwise you're left with a rule that is hard to maintain.
input:
"{pathDIR}/{genome}.txt"
output:
"{pathDIR}/{genome}.msh"
params:
dir: '{pathDIR}/{genome}'
Otherwise, use Python's slice notation.
I couldn't seem to get slice notation to work in the params using the output wildcard. Here it is in the run directive.
from subprocess import call
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
run:
callString="mash sketch -l " + str(input) + " -k 31 -s 100000 -o " + str(output)[:-4]
print(callString)
call(callString, shell=True)
Python underlies Snakemake. I prefer the "run" directive over the "shell" directive because I find it really unlocks a lot of that beautiful Python functionality. The accessing of params and various things are slightly different that with the "shell" directive.
E.g.
callString=config["mpileup_samtoolsProg"] + ' view -bh -F ' + str(config["bitFlag"]) + ' ' + str(input.inputBAM) + ' ' + wildcards.chrB2M[1:]
A bit of a snippet of J.K. using the run directive.
All of the rules in my modules pretty much use the run directive
You could remove the extension within the shell command
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o $(echo "{output}" | sed -e "s/.msh//")'

Categories