Snakemake runs rule too many times using config.yaml - python

I'm trying to create this snakemake workflow which would evaluate raw reads quality using FastQc and create a raport using MultiQC. I use 4 input files and get expected results, however I just noticed that each rule gets run 4 times and takes all 4 inputs each time and I'm not sure how to fix that. Could anyone help me figure out how to:
Run the rule 4 times but use only one input from config.yaml at a time?
Run the rule 1 time but use all 4 inputs?
I'm trying to follow the snakemake tutorial but no luck so far.
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
rule raw_fastqc:
input:
expand("data/samples/{sample}.fastq", sample=config["samples"])
output:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip"
shell:
"fastqc {input} -o outputs/fastqc_1/"
rule raw_multiqc:
input:
expand("outputs/fastqc_1/{sample}_fastqc.html", sample=config["samples"]),
expand("outputs/fastqc_1/{sample}_fastqc.zip", sample=config["samples"])
output:
"outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
shell:
"multiqc ./outputs/fastqc_1/ -n {output}"
config.yaml file:
samples:
Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq
Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq
KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq
KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
I run the snakemake using command:
snakemake -s Snakefile --core 1
Each rule is run 4 times:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
----------- ------- ------------- -------------
all 1 1 1
raw_fastqc 4 1 1
raw_multiqc 4 1 1
total 9 1 1
But each time all 4 inputs are used:
[Sun May 15 23:06:22 2022]
rule raw_fastqc:
input: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq, data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
output: outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.html, outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.zip
jobid: 3
wildcards: sample=Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001
resources: tmpdir=/tmp

Your problem is using expand() in the input of each rule. Because expand fills in wildcard values, you only need to do that in the all rule since wildcard values are passed on to upstream rules.
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
rule raw_fastqc:
input:
"data/samples/{sample}.fastq"
output:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip"
shell:
"fastqc {input} -o outputs/fastqc_1/"
rule raw_multiqc:
input:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip",
output:
"outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
shell:
"multiqc ./outputs/fastqc_1/ -n {output}"

Related

Snakemake not recognizing multiple files as input

I'm having some trouble running snakemake. I want to perform quality control of some RNA-Seq bulk samples using FastQC. I've written the code in a way that all files following the pattern {sample}_{replicate}.fastq.gz should be used as input, where {sample} is the sample id (i.e. SRR6974023) and {replicate} is 1 or 2. My little scripts follows:
configfile: "config.yaml"
rule all:
input:
expand("raw_qc/{sample}_{replicate}_fastqc.{extension}", sample=config["samples"], replicate=[1, 2], extension=["zip", "html"])
rule fastqc:
input:
rawread=expand("raw_data/{sample}_{replicate}.fastq.gz", sample=config["samples"], replicate=[1, 2])
output:
compress=expand("raw_qc/{sample}_{replicate}_fastqc.zip", sample=config["samples"], replicate=[1, 2]),
net=expand("raw_qc/{sample}_{replicate}_fastqc.html", sample=config["samples"], replicate=[1, 2])
threads:
8
params:
path="raw_qc/"
shell:
"fastqc -t {threads} {input.rawread} -o {params.path}"
Just is case, the config.yaml is:
samples:
SRR6974023
SRR6974024
The raw_data directory with my files look like this:
SRR6974023_1.fastq.gz SRR6974023_2.fastq.gz SRR6974024_1.fastq.gz SRR6974024_2.fastq.gz
Finally, when I run the script, I always see the same error:
Building DAG of jobs...
MissingInputException in line 8 of /home/user/path/Snakefile:
Missing input files for rule fastqc:
raw_data/SRR6974023 SRR6974024_2.fastq.gz
raw_data/SRR6974023 SRR6974024_1.fastq.gz
It see correctly only the last files, in this case SRR6974024_1.fastq.gz and SRR6974024_2.fastq.gz. Whatsoever, the other one it's only seen as SRR6974023. How can I solve this issue? I appreciate some help. Thank you all!
The yaml is not configured correctly. It should have - to turn each row into a list:
samples:
- SRR6974023
- SRR6974024

Snakemake use all samples as one input with porechop

I'm trying to use porechop on several data with a Snakemake workflow.
In my Snakefile, there are three rules, a fastqc rule and a porechop rule, in addition to the all rule. The fastqc rule works very well, I have all three out for my three fastq. But for porechop, instead of running the command three times, it runs the command once with the -i flag for all three files at the same time:
Error in rule porechop:
jobid: 2
output: /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz
conda-env: /ngs/prod/nanocea_project/test/.snakemake/conda/a72fb141b37718b7c37d9f32d597faeb
shell:
porechop -i /ngs/prod/nanocea_project/test/reads/25022021_2.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_1.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_2.fastq.gz -o /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz -t 40 --discard_middle
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
However, when I use it with a single sample, the program works.
Here my code:
import glob
import os
###Global Variables###
FORMATS=["zip", "html"]
DIR_FASTQ="/ngs/prod/nanocea_project/test/reads"
###FASTQ Files###
def list_samples(DIR_FASTQ):
SAMPLES=[]
for file in glob.glob(DIR_FASTQ+"/*.fastq.gz"):
base=os.path.basename(file)
sample=(base.replace('.fastq.gz', ''))
SAMPLES.append(sample)
return(SAMPLES)
SAMPLES=list_samples(DIR_FASTQ)
###Rules###
rule all:
input:
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS),
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
rule fastqc:
input:
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
output:
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS)
threads:
16
conda:
"envs/fastqc.yaml"
shell:
"fastqc {input} -o /ngs/prod/nanocea_project/test/stats/fastqc/ -t {threads}"
rule porechop:
input:
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
output:
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
threads:
40
conda:
"envs/porechop.yaml"
shell:
"porechop -i {input} -o {output} -t {threads} --discard_middle"
Do you have any idea what's wrong?
Thanks !
This question comes up often... If you use expand() in input: or output: then you are feeding the rule with a list of all the files. That is the same as writing:
input:
['sample1.fastq', 'sample2.fastq', ..., 'sampleN.fastq'],
output:
['sample1.pore.fastq', 'sample2.pore.fastq', ..., 'sampleN.pore.fastq'],
To run the rule on each input/output just remove the expand:
rule porechop:
input:
DIR_FASTQ+"/{sample}.fastq.gz"
output:
"/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz",

Snakemake "Missing files after X seconds" error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 pear
1
[Wed Dec 4 17:32:54 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 0
wildcards: sample=Unmap_41, extension=fastq
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakefile is the following:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
"""
set+u; source activate antismash; set -u ;
pear -f {input.forward} -r {input.reverse} -o {output} -t 21
"""
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta
"""
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}
"""
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml -F --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
In rule pear I think you want to use the shell directive instead of run. With run you execute python code which in this case does nothing as you simply "execute" a string so you get no error and no file produced.

Snakemake - How to use every line of input file as wildcard

I am pretty new to using Snakemake and I have looked around on SO to see if there is a solution for the below - I am almost very close to a solution, but not there yet.
I have a single column file containing a list of SRA ids and I want to use snakemake to define my rules such that every SRA id from that file becomes a parameter on command line.
#FileName = Samples.txt
Samples
SRR5597645
SRR5597646
SRR5597647
Snakefile below:
from pathlib import Path
shell.executable("bash")
import pandas as pd
import os
import glob
import shutil
configfile: "config.json"
data_dir=os.getcwd()
units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())
#print(samples)
rule all:
input:
expand("out/{sample}.fastq.gz",sample=samples)
rule clean:
shell: "rm -rf .snakemake/"
include: 'rules/download_sample.smk'
download_sample.smk
rule download_sample:
"""
Download RNA-Seq data from SRA.
"""
input: "{sample}"
output: expand("out/{sample}.fastq.gz", sample=samples)
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
I have tried many different variants of the above code, but somewhere I am getting it wrong.
What I want: For every record in the file Samples.txt, I want the parallel-fastq-dump command to run. Since I have 3 records in Samples.txt, I would like these 3 commands to get executed
parallel-fastq-dump --sra-id SRR5597645 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597646 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597647 --threads 16 --outdir out --gzip
This is the error I get
snakemake -np
WildcardError in line 1 of rules/download_sample.smk:
Wildcards in input files cannot be determined from output files:
'sample'
Thanks in advance
It seems to me that what you need is to access the sample wildcard using the wildcards object:
rule all:
input: expand("out/{sample}_fastq.gz", sample=samples)
rule download_sample:
output:
"out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell:"parallel-fastq-dump --sra-id {wildcards.sample} --threads {params.threads} --outdir {params.outdir} --gzip "
The first solution could be to use the run: section of the rule instead of the shell:. This allows you to employ python code:
rule download_sample:
# ...
run:
for input_file in input:
shell(f"parallel-fastq-dump --sra-id {input_file} --threads {params.threads} --outdir {params.outdir} --gzip")
This straightforward solution however is not idiomatic. From what I can see, you have a one-to-one relationship between input samples and output files. In other words to produce one out/{sample}_fastq.gz file you need a single {sample}. The best solution would be to reduce your rule to the one that makes a single file:
rule download_sample:
input: "{sample}"
output: "out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
The rule all: now requires all targets; the rule download_sample downloads a single sample, the Snakemake workflow does the rest: it constructs a graph of dependences and creates one instance of the rule download_sample per sample. Moreover, if you wish it can run these rules in parallel.

Snakemake wildcard in output only

I have a script which takes a large input file then breaks this down into a number of chunks from 1 to n using an unpredictable algorithm.
Then a following script will process each of these chunks iteratively.
How can I create a snakemake rule which essentially states that the output files will exist from 1 to n, and the following script should be run once for each of the 1 to n input files.
Thanks!
There is dynamic keyword. It can be used like this:
rule all:
input:
dynamic('{id}.png')
rule draw:
input:
'{id}.txt'
output:
'{id}.png'
shell:
'cp {input} {output}'
rule cluster:
input:
'input.csv'
output:
dynamic('{id}.txt')
shell:
'touch 1.txt 2.txt'
Have you tried setting a wildcard? For example, if you are iterating a rule over files 1 to 22, you can set a wildcard at the top of your snakemake file:
num=range(1,23)
Then use that wildcard in your snakemake file names or reference it as in {wildcard.num}

Categories