Apply snakemake rule on all generated files - python

I want to run a simple script "script.py", which will run some caculayions and periodically spit out a step_000n.txt file with n being dependent on the total file execution time. I would then like snakemake to run another rule on all generated files. What would be the proper Snakefile input?
ie
1. run scipt.py
2. get step_000{1,2,3,4 ..}.txt (n being variable and not determined)
3. apply `process.py -in step_000{n}.txt -out step_000{n}.png` on all step_000{1,2,3,4 ..}.txt
My obviously wrong attempt is below
rule all:
input: expand("{step}.png", step=list(map(lambda x: x.split(".")[0], glob.glob("model0*.txt"))))
rule txt:
input: "{step}.txt"
output: "{step}.png"
shell:
"process.py -in {input} -out {output}"
rule first:
output: "{step}.txt"
script: "script.py"
I could not figure out how to define output target here.

I would write all the step_000n.txt files to a dedicated directory and then process all the files in that directory. Something like:
rule all:
input:
'processed.txt',
rule split:
output:
directory('processed_dir'),
shell:
r"""
# Write out step_001.txt, step_002.txt, ..., step_000n.txt
# in output directory `processed_dir`
mkdir {output}
script.py ...
"""
rule process:
input:
indir= 'processed_dir',
output:
out= 'processed.txt',
shell:
r"""
process.py -n {input.indir}/step_*.txt -out {output.out}
"""

Related

Snakemake use all samples as one input with porechop

I'm trying to use porechop on several data with a Snakemake workflow.
In my Snakefile, there are three rules, a fastqc rule and a porechop rule, in addition to the all rule. The fastqc rule works very well, I have all three out for my three fastq. But for porechop, instead of running the command three times, it runs the command once with the -i flag for all three files at the same time:
Error in rule porechop:
jobid: 2
output: /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz, /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz
conda-env: /ngs/prod/nanocea_project/test/.snakemake/conda/a72fb141b37718b7c37d9f32d597faeb
shell:
porechop -i /ngs/prod/nanocea_project/test/reads/25022021_2.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_1.fastq.gz /ngs/prod/nanocea_project/test/reads/02062021_2.fastq.gz -o /ngs/prod/nanocea_project/test/prod/porechop/25022021_2_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_1_pore.fastq.gz /ngs/prod/nanocea_project/test/prod/porechop/02062021_2_pore.fastq.gz -t 40 --discard_middle
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
However, when I use it with a single sample, the program works.
Here my code:
import glob
import os
###Global Variables###
FORMATS=["zip", "html"]
DIR_FASTQ="/ngs/prod/nanocea_project/test/reads"
###FASTQ Files###
def list_samples(DIR_FASTQ):
SAMPLES=[]
for file in glob.glob(DIR_FASTQ+"/*.fastq.gz"):
base=os.path.basename(file)
sample=(base.replace('.fastq.gz', ''))
SAMPLES.append(sample)
return(SAMPLES)
SAMPLES=list_samples(DIR_FASTQ)
###Rules###
rule all:
input:
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS),
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
rule fastqc:
input:
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
output:
expand("/ngs/prod/nanocea_project/test/stats/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=FORMATS)
threads:
16
conda:
"envs/fastqc.yaml"
shell:
"fastqc {input} -o /ngs/prod/nanocea_project/test/stats/fastqc/ -t {threads}"
rule porechop:
input:
expand(DIR_FASTQ+"/{sample}.fastq.gz", sample=SAMPLES)
output:
expand("/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz", sample=SAMPLES)
threads:
40
conda:
"envs/porechop.yaml"
shell:
"porechop -i {input} -o {output} -t {threads} --discard_middle"
Do you have any idea what's wrong?
Thanks !
This question comes up often... If you use expand() in input: or output: then you are feeding the rule with a list of all the files. That is the same as writing:
input:
['sample1.fastq', 'sample2.fastq', ..., 'sampleN.fastq'],
output:
['sample1.pore.fastq', 'sample2.pore.fastq', ..., 'sampleN.pore.fastq'],
To run the rule on each input/output just remove the expand:
rule porechop:
input:
DIR_FASTQ+"/{sample}.fastq.gz"
output:
"/ngs/prod/nanocea_project/test/prod/porechop/{sample}_pore.fastq.gz",

snakemake temporary directories

snakemake deletes all output files that are marked temporary but does not do anything to the files if the output is a directory as shown below:
rule all:
input:
'final.txt',
checkpoint split_big_file:
input: 'bigfile.txt'
output: temp(directory('split_files'))
shell: 'mkdir -p {output} ; split -l 5000 -d -e bigfile.txt {output}/part_'
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}'
def aggregate_input(wildcards):
'''
aggregate the file names of the random number of files
generated at the scatter step
'''
checkpoint_output = checkpoints.split_big_file.get(**wildcards).output[0]
print(checkpoint_output)
agg_inp = expand('copy_files/part_{num}.txt', num=glob_wildcards('split_files/part_{num}').num)
print(agg_inp)
return agg_inp
rule merge_small_files:
input: aggregate_input
output: 'final.txt'
shell: 'cat {input} > {output}'
When I run the code shown above with a bigfile.txt that has several thousand lines, everything runs fine but the split_files directory is not empty.
$ wc -l final.txt
61177 final.txt
$ wc -l bigfile.txt
61177 bigfile.txt
$ ls copy_files/
$ ls split_files/
part_00 part_01 part_02 part_03 part_04
part_05 part_06 part_07 part_08 part_09
part_10 part_11 part_12
What I would like to see:
copy_files directory should also be deleted (but apparently since snakemake cannot figure out whether there are any other files unrelated to snakemake in that directory it will not delete directories by default)
contents of the split_files directory (and preferably the directory itself; see point 1 above) should be deleted.
I can not recreate it:
rule all:
input:
"a.txt"
rule first:
output:
temp(directory("dir1"))
shell:
"mkdir {output}; touch {output}/a.txt; sleep 5"
rule second:
input:
"dir1"
output:
"a.txt"
shell:
"touch {output}"
What version of snakemake are you using? Is maybe output_dir listed under rule all for you? Snakemake assumes that the output you want is the input of your first rule (rule all probably). So it won't delete those files, removing output_dir from under rule all will solve this issue.
However I am just guessing since you didn't provide a minimal reproducible example.
edit
Hmm... That should work! Here are two non-ideal solutions I could come up with:
We can fool snakemake to again re-evaluate the DAG and then delete the folder like this, however not sure if the files get deleted early enough for you (files might be very large).
rule merge_small_files:
input: aggregate=aggregate_input, dummy='split_files'
output: 'final.txt'
shell: 'cat {input.aggregate} > {output}'
Or just delete the file after copying, however you will end up with an empty folder in the end:
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}; rm {input}'
You can ofcourse combine both solutions and have the best of both worlds, however it is not very pretty to look at unfortunately :(

Snakemake "Missing files after X seconds" error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 pear
1
[Wed Dec 4 17:32:54 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 0
wildcards: sample=Unmap_41, extension=fastq
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakefile is the following:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
"""
set+u; source activate antismash; set -u ;
pear -f {input.forward} -r {input.reverse} -o {output} -t 21
"""
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta
"""
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}
"""
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml -F --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
In rule pear I think you want to use the shell directive instead of run. With run you execute python code which in this case does nothing as you simply "execute" a string so you get no error and no file produced.

Snakemake don't execute bash command

I'm trying to run a little pipeline in Snakemake for a software to filter good reads in files from a RNA-seq.
This is my code:
SAMPLES = ['ZN21_S1', 'ZN22_S2','ZN27_S3', 'ZN28_S4', 'ZN29_S5' ,'ZN30_S6']
rule all:
input:
expand("SVA-{sample}_L001_R{read_no}.fastq.gz", sample=SAMPLES, read_no=['1', '2'])
rule fastp:
input:
reads1="SVA-{sample}_L001_R1.fastq.gz",
reads2="SVA-{sample}_L001_R2.fastq.gz"
output:
reads1out="out/SVA-{sample}_L001_R1.fastq.gz.good",
reads2out="out/SVA-{sample}_L001_R2.fastq.gz.good"
shell:
"fastp -i {input.reads1} -I {input.reads2} -o {output.reads1out} -O {output.reads2out}"
All samples (in symbolic link) are in the same folder and I only got the message "Nothing to be done".
What am I not seeing?
In your example, target files in rule all are supposed to match with rule fastp's output files, instead of its input files in your current setup. As per your code, target files in rule all already exist and hence the message Nothing to be done when executing it.
rule all:
input:
expand("out/SVA-{sample}_L001_R{read_no}.fastq.gz.good", sample=SAMPLES, read_no=['1', '2'])

How to perform simple string operations in snakemake output

I am creating my first snakemake file, and I got to the point where I need to perform a simple string operation on the value of my output, so that my shell command works as expected:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {output}'
I need to apply the split function to {output} so that only the name of the file up to the extension is used. I couldn't find anything in the docs or in related questions.
You could use the params field:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
params:
dir = 'out/genomes'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {params.dir}'
Alternative solution using wildcards:
rule all:
input: 'out/genomes.msh'
rule sketch:
input:
'{file}.txt'
output:
'{file}.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {wildcards.file}'
Untested, but I think this should work.
The advantage over the params solution is that it generalizes better.
Best is to use params:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
params:
prefix=lambda wildcards, output: os.path.splitext(output[0])[0]
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {params.prefix}'
It is always preferable to use params instead of using the run directive, because the run directive cannot be combined with conda environments.
Avoid duplicating text. Don't use params unless you convert your input/outputs to wildcards + extentions. Otherwise you're left with a rule that is hard to maintain.
input:
"{pathDIR}/{genome}.txt"
output:
"{pathDIR}/{genome}.msh"
params:
dir: '{pathDIR}/{genome}'
Otherwise, use Python's slice notation.
I couldn't seem to get slice notation to work in the params using the output wildcard. Here it is in the run directive.
from subprocess import call
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
run:
callString="mash sketch -l " + str(input) + " -k 31 -s 100000 -o " + str(output)[:-4]
print(callString)
call(callString, shell=True)
Python underlies Snakemake. I prefer the "run" directive over the "shell" directive because I find it really unlocks a lot of that beautiful Python functionality. The accessing of params and various things are slightly different that with the "shell" directive.
E.g.
callString=config["mpileup_samtoolsProg"] + ' view -bh -F ' + str(config["bitFlag"]) + ' ' + str(input.inputBAM) + ' ' + wildcards.chrB2M[1:]
A bit of a snippet of J.K. using the run directive.
All of the rules in my modules pretty much use the run directive
You could remove the extension within the shell command
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o $(echo "{output}" | sed -e "s/.msh//")'

Categories