snakemake temporary directories - python

snakemake deletes all output files that are marked temporary but does not do anything to the files if the output is a directory as shown below:
rule all:
input:
'final.txt',
checkpoint split_big_file:
input: 'bigfile.txt'
output: temp(directory('split_files'))
shell: 'mkdir -p {output} ; split -l 5000 -d -e bigfile.txt {output}/part_'
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}'
def aggregate_input(wildcards):
'''
aggregate the file names of the random number of files
generated at the scatter step
'''
checkpoint_output = checkpoints.split_big_file.get(**wildcards).output[0]
print(checkpoint_output)
agg_inp = expand('copy_files/part_{num}.txt', num=glob_wildcards('split_files/part_{num}').num)
print(agg_inp)
return agg_inp
rule merge_small_files:
input: aggregate_input
output: 'final.txt'
shell: 'cat {input} > {output}'
When I run the code shown above with a bigfile.txt that has several thousand lines, everything runs fine but the split_files directory is not empty.
$ wc -l final.txt
61177 final.txt
$ wc -l bigfile.txt
61177 bigfile.txt
$ ls copy_files/
$ ls split_files/
part_00 part_01 part_02 part_03 part_04
part_05 part_06 part_07 part_08 part_09
part_10 part_11 part_12
What I would like to see:
copy_files directory should also be deleted (but apparently since snakemake cannot figure out whether there are any other files unrelated to snakemake in that directory it will not delete directories by default)
contents of the split_files directory (and preferably the directory itself; see point 1 above) should be deleted.

I can not recreate it:
rule all:
input:
"a.txt"
rule first:
output:
temp(directory("dir1"))
shell:
"mkdir {output}; touch {output}/a.txt; sleep 5"
rule second:
input:
"dir1"
output:
"a.txt"
shell:
"touch {output}"
What version of snakemake are you using? Is maybe output_dir listed under rule all for you? Snakemake assumes that the output you want is the input of your first rule (rule all probably). So it won't delete those files, removing output_dir from under rule all will solve this issue.
However I am just guessing since you didn't provide a minimal reproducible example.
edit
Hmm... That should work! Here are two non-ideal solutions I could come up with:
We can fool snakemake to again re-evaluate the DAG and then delete the folder like this, however not sure if the files get deleted early enough for you (files might be very large).
rule merge_small_files:
input: aggregate=aggregate_input, dummy='split_files'
output: 'final.txt'
shell: 'cat {input.aggregate} > {output}'
Or just delete the file after copying, however you will end up with an empty folder in the end:
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}; rm {input}'
You can ofcourse combine both solutions and have the best of both worlds, however it is not very pretty to look at unfortunately :(

Related

Apply snakemake rule on all generated files

I want to run a simple script "script.py", which will run some caculayions and periodically spit out a step_000n.txt file with n being dependent on the total file execution time. I would then like snakemake to run another rule on all generated files. What would be the proper Snakefile input?
ie
1. run scipt.py
2. get step_000{1,2,3,4 ..}.txt (n being variable and not determined)
3. apply `process.py -in step_000{n}.txt -out step_000{n}.png` on all step_000{1,2,3,4 ..}.txt
My obviously wrong attempt is below
rule all:
input: expand("{step}.png", step=list(map(lambda x: x.split(".")[0], glob.glob("model0*.txt"))))
rule txt:
input: "{step}.txt"
output: "{step}.png"
shell:
"process.py -in {input} -out {output}"
rule first:
output: "{step}.txt"
script: "script.py"
I could not figure out how to define output target here.
I would write all the step_000n.txt files to a dedicated directory and then process all the files in that directory. Something like:
rule all:
input:
'processed.txt',
rule split:
output:
directory('processed_dir'),
shell:
r"""
# Write out step_001.txt, step_002.txt, ..., step_000n.txt
# in output directory `processed_dir`
mkdir {output}
script.py ...
"""
rule process:
input:
indir= 'processed_dir',
output:
out= 'processed.txt',
shell:
r"""
process.py -n {input.indir}/step_*.txt -out {output.out}
"""

How to define parameters for a snakemake rule with expand input

I have input files in this format:
dataset1/file1.bam
dataset1/file1_rmd.bam
dataset1/file2.bam
dataset1/file2_rmd.bam
I would like to run a command with each and merge the results into a csv file.
The command returns an integer given filename.
$ samtools view -c -F1 dataset1/file1.bam
200
I would like to run the command and merge the output for each file into the following csv file:
file1,200,100
file2,400,300
I can do this without using the expand in input and using an append operator >>, but in order to avoid possible file corruptions it can lead to I would like to use >.
I tried something like this which did not work due to wildcards.param2 part:
rule collect_rc_results:
input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
shell:
"""
RCT=$(samtools view -c -F1 {input.in1})
RCD=$(samtools view -c -F1 {input.in2})
printf "{wildcards.param2},${{RCT}},${{RCD}}\n" > {output}
"""
I am aware that the input is no longer a single file but a list of files created by expand. Therefore I defined a function to work on a list input, but it is still not quite right:
def get_read_count:
return [ os.popen("samtools view -c -F1 "+infile).read() for infile in infiles ]
rule collect_rc_results:
input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
params: rc1=get_read_count("{param1}/{param2}.bam"), rc2=get_read_count("{param1}/{param2}_rmd.bam")
shell:
"""
printf "{wildcards.param2},{params.rc1},{params.rc2}\n" > {output}
"""
What is the best practice to use the wildcards inside the input file IDs when the input file list is defined with expand?
Edit:
I can get the expected result with expand if I use an external bash script, such as script.sh
for INF in "${#}";do
IN1=${INF}
IN2=${IN1%.bam}_rmd.bam
LIB=$(basename ${IN1%.*}|cut -d_ -f1)
RCT=$(samtools view -c -F1 ${IN1} )
RCD=$(samtools view -c -F1 ${IN2} )
printf "${LIB},${RCT},${RCD}\n"
done
with
params: script="script.sh"
shell:
"""
bash {params.script} {input} > {output}
"""
but I am interested in learning if there is an easier way to get the same output only using snakemake.
Edit2:
I can also do it in the shell instead of a separate script,
rule collect_rc_results:
input:
in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
output: "{param1}_merged.csv"
shell:
"""
for INF in {input};do
IN1=${{INF}}
IN2=${{IN1%.bam}}_rmd.bam
LIB=$(basename ${{IN1%.*}}|cut -d_ -f1)
RCT=$(samtools view -c -F1 ${{IN1}} )
RCD=$(samtools view -c -F1 ${{IN2}} )
printf ${{LIB}},${{RCT}},${{RCD}}\n"
done > {output}
"""
thus get the expected files. However, I would be interested to hear about it if anyone has a more elegant or a "best practice" solution.
I don't think there is anything really wrong with your current solution, but I would be more inclined to use a run directive with shell functions to perform your loop.
#bli's suggestion to use temp files is also good, especially if the intermediate step (samtools in this case) is long running; you can get tremendous wall-clock gains from parallelizing those computations. The downside is you will be creating lots of tiny files.
I noticed that your inputs are fully qualified through expand, but based on your example I think you want to leave param1 as a wildcard. Assuming PARS2 is a list, it should be safe to zip in1, in2 and PARS2 together. Here's my take (written but not tested).
rule collect_rc_results:
input:
in1=expand("{param1}/{param2}.bam", param2=PARS2, allow_missing=True),
in2=expand("{param1}/{param2}_rmd.bam", param2=PARS2, allow_missing=True)
output: "{param1}_merged.csv"
run:
with open(output[0], 'w') as outfile:
for infile1, infile2, parameter in zip(in1, in2, PARS2):
# I don't usually use shell, may have to strip newlines from this output?
RCT = shell(f'samtools view -c -F1 {infile1}')
RCD = shell(f'samtools view -c -F1 {infile2}')
outfile.write(f'{parameter},{RCT},{RCD}\n')

Snakemake "Missing files after X seconds" error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 pear
1
[Wed Dec 4 17:32:54 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 0
wildcards: sample=Unmap_41, extension=fastq
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakefile is the following:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
"""
set+u; source activate antismash; set -u ;
pear -f {input.forward} -r {input.reverse} -o {output} -t 21
"""
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta
"""
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}
"""
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml -F --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
In rule pear I think you want to use the shell directive instead of run. With run you execute python code which in this case does nothing as you simply "execute" a string so you get no error and no file produced.

Snakemake - How to use every line of input file as wildcard

I am pretty new to using Snakemake and I have looked around on SO to see if there is a solution for the below - I am almost very close to a solution, but not there yet.
I have a single column file containing a list of SRA ids and I want to use snakemake to define my rules such that every SRA id from that file becomes a parameter on command line.
#FileName = Samples.txt
Samples
SRR5597645
SRR5597646
SRR5597647
Snakefile below:
from pathlib import Path
shell.executable("bash")
import pandas as pd
import os
import glob
import shutil
configfile: "config.json"
data_dir=os.getcwd()
units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())
#print(samples)
rule all:
input:
expand("out/{sample}.fastq.gz",sample=samples)
rule clean:
shell: "rm -rf .snakemake/"
include: 'rules/download_sample.smk'
download_sample.smk
rule download_sample:
"""
Download RNA-Seq data from SRA.
"""
input: "{sample}"
output: expand("out/{sample}.fastq.gz", sample=samples)
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
I have tried many different variants of the above code, but somewhere I am getting it wrong.
What I want: For every record in the file Samples.txt, I want the parallel-fastq-dump command to run. Since I have 3 records in Samples.txt, I would like these 3 commands to get executed
parallel-fastq-dump --sra-id SRR5597645 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597646 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597647 --threads 16 --outdir out --gzip
This is the error I get
snakemake -np
WildcardError in line 1 of rules/download_sample.smk:
Wildcards in input files cannot be determined from output files:
'sample'
Thanks in advance
It seems to me that what you need is to access the sample wildcard using the wildcards object:
rule all:
input: expand("out/{sample}_fastq.gz", sample=samples)
rule download_sample:
output:
"out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell:"parallel-fastq-dump --sra-id {wildcards.sample} --threads {params.threads} --outdir {params.outdir} --gzip "
The first solution could be to use the run: section of the rule instead of the shell:. This allows you to employ python code:
rule download_sample:
# ...
run:
for input_file in input:
shell(f"parallel-fastq-dump --sra-id {input_file} --threads {params.threads} --outdir {params.outdir} --gzip")
This straightforward solution however is not idiomatic. From what I can see, you have a one-to-one relationship between input samples and output files. In other words to produce one out/{sample}_fastq.gz file you need a single {sample}. The best solution would be to reduce your rule to the one that makes a single file:
rule download_sample:
input: "{sample}"
output: "out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
The rule all: now requires all targets; the rule download_sample downloads a single sample, the Snakemake workflow does the rest: it constructs a graph of dependences and creates one instance of the rule download_sample per sample. Moreover, if you wish it can run these rules in parallel.

Snakemake don't execute bash command

I'm trying to run a little pipeline in Snakemake for a software to filter good reads in files from a RNA-seq.
This is my code:
SAMPLES = ['ZN21_S1', 'ZN22_S2','ZN27_S3', 'ZN28_S4', 'ZN29_S5' ,'ZN30_S6']
rule all:
input:
expand("SVA-{sample}_L001_R{read_no}.fastq.gz", sample=SAMPLES, read_no=['1', '2'])
rule fastp:
input:
reads1="SVA-{sample}_L001_R1.fastq.gz",
reads2="SVA-{sample}_L001_R2.fastq.gz"
output:
reads1out="out/SVA-{sample}_L001_R1.fastq.gz.good",
reads2out="out/SVA-{sample}_L001_R2.fastq.gz.good"
shell:
"fastp -i {input.reads1} -I {input.reads2} -o {output.reads1out} -O {output.reads2out}"
All samples (in symbolic link) are in the same folder and I only got the message "Nothing to be done".
What am I not seeing?
In your example, target files in rule all are supposed to match with rule fastp's output files, instead of its input files in your current setup. As per your code, target files in rule all already exist and hence the message Nothing to be done when executing it.
rule all:
input:
expand("out/SVA-{sample}_L001_R{read_no}.fastq.gz.good", sample=SAMPLES, read_no=['1', '2'])

Categories