Snakemake - How to use every line of input file as wildcard - python

I am pretty new to using Snakemake and I have looked around on SO to see if there is a solution for the below - I am almost very close to a solution, but not there yet.
I have a single column file containing a list of SRA ids and I want to use snakemake to define my rules such that every SRA id from that file becomes a parameter on command line.
#FileName = Samples.txt
Samples
SRR5597645
SRR5597646
SRR5597647
Snakefile below:
from pathlib import Path
shell.executable("bash")
import pandas as pd
import os
import glob
import shutil
configfile: "config.json"
data_dir=os.getcwd()
units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())
#print(samples)
rule all:
input:
expand("out/{sample}.fastq.gz",sample=samples)
rule clean:
shell: "rm -rf .snakemake/"
include: 'rules/download_sample.smk'
download_sample.smk
rule download_sample:
"""
Download RNA-Seq data from SRA.
"""
input: "{sample}"
output: expand("out/{sample}.fastq.gz", sample=samples)
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
I have tried many different variants of the above code, but somewhere I am getting it wrong.
What I want: For every record in the file Samples.txt, I want the parallel-fastq-dump command to run. Since I have 3 records in Samples.txt, I would like these 3 commands to get executed
parallel-fastq-dump --sra-id SRR5597645 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597646 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597647 --threads 16 --outdir out --gzip
This is the error I get
snakemake -np
WildcardError in line 1 of rules/download_sample.smk:
Wildcards in input files cannot be determined from output files:
'sample'
Thanks in advance

It seems to me that what you need is to access the sample wildcard using the wildcards object:
rule all:
input: expand("out/{sample}_fastq.gz", sample=samples)
rule download_sample:
output:
"out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell:"parallel-fastq-dump --sra-id {wildcards.sample} --threads {params.threads} --outdir {params.outdir} --gzip "

The first solution could be to use the run: section of the rule instead of the shell:. This allows you to employ python code:
rule download_sample:
# ...
run:
for input_file in input:
shell(f"parallel-fastq-dump --sra-id {input_file} --threads {params.threads} --outdir {params.outdir} --gzip")
This straightforward solution however is not idiomatic. From what I can see, you have a one-to-one relationship between input samples and output files. In other words to produce one out/{sample}_fastq.gz file you need a single {sample}. The best solution would be to reduce your rule to the one that makes a single file:
rule download_sample:
input: "{sample}"
output: "out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
The rule all: now requires all targets; the rule download_sample downloads a single sample, the Snakemake workflow does the rest: it constructs a graph of dependences and creates one instance of the rule download_sample per sample. Moreover, if you wish it can run these rules in parallel.

Related

Snakemake runs rule too many times using config.yaml

I'm trying to create this snakemake workflow which would evaluate raw reads quality using FastQc and create a raport using MultiQC. I use 4 input files and get expected results, however I just noticed that each rule gets run 4 times and takes all 4 inputs each time and I'm not sure how to fix that. Could anyone help me figure out how to:
Run the rule 4 times but use only one input from config.yaml at a time?
Run the rule 1 time but use all 4 inputs?
I'm trying to follow the snakemake tutorial but no luck so far.
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
rule raw_fastqc:
input:
expand("data/samples/{sample}.fastq", sample=config["samples"])
output:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip"
shell:
"fastqc {input} -o outputs/fastqc_1/"
rule raw_multiqc:
input:
expand("outputs/fastqc_1/{sample}_fastqc.html", sample=config["samples"]),
expand("outputs/fastqc_1/{sample}_fastqc.zip", sample=config["samples"])
output:
"outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
shell:
"multiqc ./outputs/fastqc_1/ -n {output}"
config.yaml file:
samples:
Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq
Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq
KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq
KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
I run the snakemake using command:
snakemake -s Snakefile --core 1
Each rule is run 4 times:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
----------- ------- ------------- -------------
all 1 1 1
raw_fastqc 4 1 1
raw_multiqc 4 1 1
total 9 1 1
But each time all 4 inputs are used:
[Sun May 15 23:06:22 2022]
rule raw_fastqc:
input: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq, data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
output: outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.html, outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.zip
jobid: 3
wildcards: sample=Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001
resources: tmpdir=/tmp
Your problem is using expand() in the input of each rule. Because expand fills in wildcard values, you only need to do that in the all rule since wildcard values are passed on to upstream rules.
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
rule raw_fastqc:
input:
"data/samples/{sample}.fastq"
output:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip"
shell:
"fastqc {input} -o outputs/fastqc_1/"
rule raw_multiqc:
input:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip",
output:
"outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
shell:
"multiqc ./outputs/fastqc_1/ -n {output}"

Snakemake use all samples as one input with porechop

I'm trying to use porechop on several data with a Snakemake workflow.
In my Snakefile, there are three rules, a fastqc rule and a porechop rule, in addition to the all rule. The fastqc rule works very well, I have all three out for my three fastq. But for porechop, instead of running the command three times, it runs the command once with the -i flag for all three files at the same time:
Error in rule porechop:
jobid: 2
output: /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz
conda-env: /ngs/prod/nanocea_project/test/.snakemake/conda/a72fb141b37718b7c37d9f32d597faeb
shell:
porechop -i /ngs/prod/nanocea_project/test/reads/25022021_2.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_1.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_2.fastq.gz -o /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz -t 40 --discard_middle
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
However, when I use it with a single sample, the program works.
Here my code:
import glob
import os
###Global Variables###
FORMATS=["zip", "html"]
DIR_FASTQ="/ngs/prod/nanocea_project/test/reads"
###FASTQ Files###
def list_samples(DIR_FASTQ):
SAMPLES=[]
for file in glob.glob(DIR_FASTQ+"/*.fastq.gz"):
base=os.path.basename(file)
sample=(base.replace('.fastq.gz', ''))
SAMPLES.append(sample)
return(SAMPLES)
SAMPLES=list_samples(DIR_FASTQ)
###Rules###
rule all:
input:
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS),
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
rule fastqc:
input:
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
output:
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS)
threads:
16
conda:
"envs/fastqc.yaml"
shell:
"fastqc {input} -o /ngs/prod/nanocea_project/test/stats/fastqc/ -t {threads}"
rule porechop:
input:
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
output:
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
threads:
40
conda:
"envs/porechop.yaml"
shell:
"porechop -i {input} -o {output} -t {threads} --discard_middle"
Do you have any idea what's wrong?
Thanks !
This question comes up often... If you use expand() in input: or output: then you are feeding the rule with a list of all the files. That is the same as writing:
input:
['sample1.fastq', 'sample2.fastq', ..., 'sampleN.fastq'],
output:
['sample1.pore.fastq', 'sample2.pore.fastq', ..., 'sampleN.pore.fastq'],
To run the rule on each input/output just remove the expand:
rule porechop:
input:
DIR_FASTQ+"/{sample}.fastq.gz"
output:
"/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz",

Snakemake input fastq files from each sample directory issue for metagenomics analysis

I am working on a new snakemake metagenomics pipeline to trim fastq files, and run them through kraken. Each sample has a directory containing the forward and reverse reads.
Sample_1/r1_paired.fq.gz
Sample_1/r2_paired.fq.gz
Sample_2/r1_paired.fq.gz
Sample_2/r2_paired.fq.gz
I am providing a sample sheet that users can upload, that contains the sample names and the read names. I used pandas to parse the sample sheet and provide the names required for the snakefile. Here is my snakefile.
#Extract sample names from CSV
import pandas as pd
import os
df = pd.read_csv("sample_table_test.csv")
print(df)
samples = df.library.to_list()
print("Samples being processed:", samples)
R1 = df.r1_file.to_list()
R2 = df.r2_file.to_list()
print(R1,R2)
rule all:
input:
expand("{sample}.bracken", sample=samples),
#Trimmomatic to trim paired end reads
rule trim_reads:
input:
"{sample}/{R1}",
"{sample}/{R2}",
output:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
conda:
"env.yaml",
shell:
"trimmomatic PE -threads 8 {input} {input} {output} {output} SLIDINGWINDOW:4:30 LEADING:2 TRAILING:2 MINLEN:50"
#Kraken2 to bin reads and assign taxonomy
rule kraken2:
input:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
conda:
"env.yaml",
shell:
"kraken2 --gzip-compressed --paired --classified-out {output} {input} {input} --db database/minikraken2_v1_8GB/ --report {sample}_report.txt --threads 1"
#Bracken estimates abundance of a species within a sample
rule bracken:
input:
"{sample}_report.txt",
output:
"{sample}.bracken",
conda:
"env.yaml",
shell:
"bracken -d database/minikraken2_v1_8GB/ -i {input} -o {output} -r 150"
I am receiving the below error and have been struggling to find a better way to write my snakefile to avoid this issue. Any assistance here would be greatly appreciated.
WildcardError in line 19 of /Metagenomics/Metagenomics/snakemake/Snakefile:
Wildcards in input files cannot be determined from output files:
'R1'
Thank you!
The problem is in your rule kraken2:
rule kraken2:
input:
"{sample}/{R1}_1_trim_paired.fq.gz",
"{sample}/{R2}_2_trim_paired.fq.gz",
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
All wildcards in the rule shall be determined from the output section. The logic of each rule is that it offers certain files as a possible output. In your case the rule offers files "{sample}_report.txt" and "{sample}_kraken_cseqs#.fq", where {sample} becomes one level of freedom and is substituted with a certain value that resolves the pattern into a filename. Now Snakemake can determine the inputs for this rule, but only if it has all the information. Ok, the value of {sample} is defined from the output, but what are the values of {R1} and {R2}?
You have several options. The first is to define these values somewhere in the output:. Looks like that is not your case. The second option is to define these values globally (as you are probably trying to do):
R1 = df.r1_file.to_list()
R2 = df.r2_file.to_list()
In this case {R1} and {R2} shall not be wildcards but the parameters of the expand function:
rule kraken2:
input:
expand("{{sample}}/{R1}_1_trim_paired.fq.gz", R1=R1),
expand("{{sample}}/{R1}_1_trim_paired.fq.gz", R2=R2)
output:
"{sample}_report.txt",
"{sample}_kraken_cseqs#.fq",
Or even better:
expand("{{sample}}/{R}_1_trim_paired.fq.gz", R=R1+R2)
Note that the wildcard {sample} now has to be in double braces to be distinguished from the parameters of the expand function.
There are other options like resolving the value of {R1} from the values of other vildcards, like lambda wildcards: ..., but I guess that is not what you need.

snakemake temporary directories

snakemake deletes all output files that are marked temporary but does not do anything to the files if the output is a directory as shown below:
rule all:
input:
'final.txt',
checkpoint split_big_file:
input: 'bigfile.txt'
output: temp(directory('split_files'))
shell: 'mkdir -p {output} ; split -l 5000 -d -e bigfile.txt {output}/part_'
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}'
def aggregate_input(wildcards):
'''
aggregate the file names of the random number of files
generated at the scatter step
'''
checkpoint_output = checkpoints.split_big_file.get(**wildcards).output[0]
print(checkpoint_output)
agg_inp = expand('copy_files/part_{num}.txt', num=glob_wildcards('split_files/part_{num}').num)
print(agg_inp)
return agg_inp
rule merge_small_files:
input: aggregate_input
output: 'final.txt'
shell: 'cat {input} > {output}'
When I run the code shown above with a bigfile.txt that has several thousand lines, everything runs fine but the split_files directory is not empty.
$ wc -l final.txt
61177 final.txt
$ wc -l bigfile.txt
61177 bigfile.txt
$ ls copy_files/
$ ls split_files/
part_00 part_01 part_02 part_03 part_04
part_05 part_06 part_07 part_08 part_09
part_10 part_11 part_12
What I would like to see:
copy_files directory should also be deleted (but apparently since snakemake cannot figure out whether there are any other files unrelated to snakemake in that directory it will not delete directories by default)
contents of the split_files directory (and preferably the directory itself; see point 1 above) should be deleted.
I can not recreate it:
rule all:
input:
"a.txt"
rule first:
output:
temp(directory("dir1"))
shell:
"mkdir {output}; touch {output}/a.txt; sleep 5"
rule second:
input:
"dir1"
output:
"a.txt"
shell:
"touch {output}"
What version of snakemake are you using? Is maybe output_dir listed under rule all for you? Snakemake assumes that the output you want is the input of your first rule (rule all probably). So it won't delete those files, removing output_dir from under rule all will solve this issue.
However I am just guessing since you didn't provide a minimal reproducible example.
edit
Hmm... That should work! Here are two non-ideal solutions I could come up with:
We can fool snakemake to again re-evaluate the DAG and then delete the folder like this, however not sure if the files get deleted early enough for you (files might be very large).
rule merge_small_files:
input: aggregate=aggregate_input, dummy='split_files'
output: 'final.txt'
shell: 'cat {input.aggregate} > {output}'
Or just delete the file after copying, however you will end up with an empty folder in the end:
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}; rm {input}'
You can ofcourse combine both solutions and have the best of both worlds, however it is not very pretty to look at unfortunately :(

Snakemake "Missing files after X seconds" error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 pear
1
[Wed Dec 4 17:32:54 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 0
wildcards: sample=Unmap_41, extension=fastq
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakefile is the following:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
"""
set+u; source activate antismash; set -u ;
pear -f {input.forward} -r {input.reverse} -o {output} -t 21
"""
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta
"""
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}
"""
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml -F --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
In rule pear I think you want to use the shell directive instead of run. With run you execute python code which in this case does nothing as you simply "execute" a string so you get no error and no file produced.

Categories