snakemake expansion no input files - python

I have some different configurations and I need to get the combination of them all to run a python script
versions = ['lg', 'sm']
start_time = ['0', '1']
end_time = ['2']
What I want is snakemake to do this for me:
python my_script.py -v lg -s 0 -e 2 > lg_0_2.out
python my_script.py -v lg -s 1 -e 2 > lg_1_2.out
python my_script.py -v sm -s 0 -e 2 > sm_0_2.out
python my_script.py -v sm -s 1 -e 2 > sm_1_2.out
but I can't seem to figure out how to do this in snakemake. Any ideas?

Snakemake has an expand() method that is shorthand for expanding by an outer product, which is the operation you are describing. Typically, this would be accomplished by generating the output file strings as the input in the first rule (default rule), and then providing a rule (myrule below) that parses such strings to generate the command you would use to generate the outputs. In code, it would go something like
Snakefile
versions = ['lg', 'sm']
start_time = ['0', '1']
end_time = ['2']
rule all:
input:
expand("{version}_{start}_{end}.out",
version=versions, start=start_time, end=end_time)
rule myrule:
output: "{version,[^_]+}_{start,[0-9]+}_{end,[0-9]+}.out"
shell:
"""
python my_script.py -v {wildcards.version} -s {wildcards.start} -e {wildcards.end} > {output}
"""
Running snakemake in the directory where this Snakefile resides would then generate the desired files.

Related

snakemake temporary directories

snakemake deletes all output files that are marked temporary but does not do anything to the files if the output is a directory as shown below:
rule all:
input:
'final.txt',
checkpoint split_big_file:
input: 'bigfile.txt'
output: temp(directory('split_files'))
shell: 'mkdir -p {output} ; split -l 5000 -d -e bigfile.txt {output}/part_'
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}'
def aggregate_input(wildcards):
'''
aggregate the file names of the random number of files
generated at the scatter step
'''
checkpoint_output = checkpoints.split_big_file.get(**wildcards).output[0]
print(checkpoint_output)
agg_inp = expand('copy_files/part_{num}.txt', num=glob_wildcards('split_files/part_{num}').num)
print(agg_inp)
return agg_inp
rule merge_small_files:
input: aggregate_input
output: 'final.txt'
shell: 'cat {input} > {output}'
When I run the code shown above with a bigfile.txt that has several thousand lines, everything runs fine but the split_files directory is not empty.
$ wc -l final.txt
61177 final.txt
$ wc -l bigfile.txt
61177 bigfile.txt
$ ls copy_files/
$ ls split_files/
part_00 part_01 part_02 part_03 part_04
part_05 part_06 part_07 part_08 part_09
part_10 part_11 part_12
What I would like to see:
copy_files directory should also be deleted (but apparently since snakemake cannot figure out whether there are any other files unrelated to snakemake in that directory it will not delete directories by default)
contents of the split_files directory (and preferably the directory itself; see point 1 above) should be deleted.
I can not recreate it:
rule all:
input:
"a.txt"
rule first:
output:
temp(directory("dir1"))
shell:
"mkdir {output}; touch {output}/a.txt; sleep 5"
rule second:
input:
"dir1"
output:
"a.txt"
shell:
"touch {output}"
What version of snakemake are you using? Is maybe output_dir listed under rule all for you? Snakemake assumes that the output you want is the input of your first rule (rule all probably). So it won't delete those files, removing output_dir from under rule all will solve this issue.
However I am just guessing since you didn't provide a minimal reproducible example.
edit
Hmm... That should work! Here are two non-ideal solutions I could come up with:
We can fool snakemake to again re-evaluate the DAG and then delete the folder like this, however not sure if the files get deleted early enough for you (files might be very large).
rule merge_small_files:
input: aggregate=aggregate_input, dummy='split_files'
output: 'final.txt'
shell: 'cat {input.aggregate} > {output}'
Or just delete the file after copying, however you will end up with an empty folder in the end:
rule copy_small_files:
input: 'split_files/part_{num}'
output: temp('copy_files/part_{num}.txt')
shell: 'cp -f {input} {output}; rm {input}'
You can ofcourse combine both solutions and have the best of both worlds, however it is not very pretty to look at unfortunately :(

Snakemake "Missing files after X seconds" error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 pear
1
[Wed Dec 4 17:32:54 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 0
wildcards: sample=Unmap_41, extension=fastq
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakefile is the following:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
"""
set+u; source activate antismash; set -u ;
pear -f {input.forward} -r {input.reverse} -o {output} -t 21
"""
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta
"""
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}
"""
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml -F --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
In rule pear I think you want to use the shell directive instead of run. With run you execute python code which in this case does nothing as you simply "execute" a string so you get no error and no file produced.

How to perform simple string operations in snakemake output

I am creating my first snakemake file, and I got to the point where I need to perform a simple string operation on the value of my output, so that my shell command works as expected:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {output}'
I need to apply the split function to {output} so that only the name of the file up to the extension is used. I couldn't find anything in the docs or in related questions.
You could use the params field:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
params:
dir = 'out/genomes'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {params.dir}'
Alternative solution using wildcards:
rule all:
input: 'out/genomes.msh'
rule sketch:
input:
'{file}.txt'
output:
'{file}.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {wildcards.file}'
Untested, but I think this should work.
The advantage over the params solution is that it generalizes better.
Best is to use params:
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
params:
prefix=lambda wildcards, output: os.path.splitext(output[0])[0]
shell:
'mash sketch -l {input} -k 31 -s 100000 -o {params.prefix}'
It is always preferable to use params instead of using the run directive, because the run directive cannot be combined with conda environments.
Avoid duplicating text. Don't use params unless you convert your input/outputs to wildcards + extentions. Otherwise you're left with a rule that is hard to maintain.
input:
"{pathDIR}/{genome}.txt"
output:
"{pathDIR}/{genome}.msh"
params:
dir: '{pathDIR}/{genome}'
Otherwise, use Python's slice notation.
I couldn't seem to get slice notation to work in the params using the output wildcard. Here it is in the run directive.
from subprocess import call
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
run:
callString="mash sketch -l " + str(input) + " -k 31 -s 100000 -o " + str(output)[:-4]
print(callString)
call(callString, shell=True)
Python underlies Snakemake. I prefer the "run" directive over the "shell" directive because I find it really unlocks a lot of that beautiful Python functionality. The accessing of params and various things are slightly different that with the "shell" directive.
E.g.
callString=config["mpileup_samtoolsProg"] + ' view -bh -F ' + str(config["bitFlag"]) + ' ' + str(input.inputBAM) + ' ' + wildcards.chrB2M[1:]
A bit of a snippet of J.K. using the run directive.
All of the rules in my modules pretty much use the run directive
You could remove the extension within the shell command
rule sketch:
input:
'out/genomes.txt'
output:
'out/genomes.msh'
shell:
'mash sketch -l {input} -k 31 -s 100000 -o $(echo "{output}" | sed -e "s/.msh//")'

Inserting python code in a bash script

I've got the following bash script:
#!/bin/bash
while read line
do
ORD=`echo $line | cut -c 7-21`
if [[ -r ../FASTA_SEC/${ORD}.fa ]]
then
WCR=`fgrep -o N ../FASTA_SEC/$ORD.fa | wc -l`
WCT=`wc -m < ../FASTA_SEC/$ORD.fa`
PER1=`echo print $WCR/$WCT.*100 | python`
WCTRIN=`fgrep -o N ../FASTA_SEC_EDITED/$ORD"_Trimmed.fa" | wc -l`
WCTRI=`wc -m < ../FASTA_SEC_EDITED/$ORD"_Trimmed.fa"`
PER2=`echo print $WCTRIN/$WCTRI.*100 | python`
PER3=`echo print $PER1-$PER2 | python`
echo $ORD $PER1 $PER2 $PER3 >> Log.txt
if [ $PER2 -ge 30 -a $PER3 -lt 10 ]
then
mv ../FASTA_SEC/$ORD.fa ./TRASH/$ORD.fa
mv ../FASTA_SEC_EDITED/$ORD"_Trimmed.fa" ./TRASH/$ORD"_Trimmed.fa"
fi
fi
done < ../READ/Data.txt
$PER variables are floating numbers as u might have noticed so I cannot use them normaly in the nested if conditional. I'd like to do this conditional iteration in python but I have no clue how do it whithin a bash script also I dont know how to import the value of the variables $PER2 and $PER3 into python. Could I write directly python code in the same bash script invvoking python somehow?
Thank you for your help, first time facing this.
You can use python -c CMD to execute a piece of python code from the command line. If you want bash to interpolate your environment variables, you should use double quotes around CMD.
You can return a value by calling sys.exit, but keep in mind that true and false in Python have the reverse meaning in bash.
So your code would be:
if python -c "import sys; sys.exit(not($PER2 > 30 and $PER3 < 10 ))"
It is possible to feed Python code to the standard input of python executable with the help of here document syntax:
variable=$(date)
python2.7 <<SCRIPT
print "The current date: %s" % "${variable}"
SCRIPT
In order to avoid parameter substitution (interpretation within the block), quote the first limit string: <<'SCRIPT'.
If you want to assign the output to a variable, use command substitution:
output=$(python2.7 <<SCRIPT
print "The current date: %s" % "${variable}"
SCRIPT
)
Note, it is not recommended to use back quotes for command substitution, as it is impossible to nest them, and the form $(...) is more readable.
maybe this helps?
$ X=4; Y=7; Z=$(python -c "print($X * $Y)")
$ echo $Z
28
python -c "str" takes "str" as input and runs it.
but then why not rewrite all in python? bash commands can nicely be executed with subprocess which is included in python or (need to install that) sh.

Zip function for shell scripts

I'm trying to write a shell script that will make several targets into several different paths. I'll pass in a space-separated list of paths and a space-separated list of targets, and the script will make DESTDIR=$path $target for each pair of paths and targets. In Python, my script would look something like this:
for path, target in zip(paths, targets):
exec_shell_command('make DESTDIR=' + path + ' ' + target)
However, this is my current shell script:
#! /bin/bash
packages=$1
targets=$2
target=
set_target_number () {
number=$1
counter=0
for temp_target in $targets; do
if [[ $counter -eq $number ]]; then
target=$temp_target
fi
counter=`expr $counter + 1`
done
}
package_num=0
for package in $packages; do
package_fs="debian/tmp/$package"
set_target_number $package_num
echo "mkdir -p $package_fs"
echo "make DESTDIR=$package_fs $target"
package_num=`expr $package_num + 1`
done
Is there a Unix tool equivalent to Python's zip function or an easier way to retrieve an element from a space-separated list by its index? Thanks.
Use an array:
#!/bin/bash
packages=($1)
targets=($2)
if (("${#packages[#]}" != "${#targets[#]}"))
then
echo 'Number of packages and number of targets differ' >&2
exit 1
fi
for index in "${!packages[#]}"
do
package="${packages[$index]}"
target="${targets[$index]}"
package_fs="debian/tmp/$package"
mkdir -p "$package_fs"
make "DESTDIR=$package_fs" "$target"
done
Here is the solution
paste -d ' ' paths targets | sed 's/^/make DESTDIR=/' | sh
paste is equivalent of zip in shell. sed is used to prepend the make command (using regex) and result is passed to sh to execute
There's no way to do that in bash. You'll need to create two arrays from the input and then iterate through a counter using the values from each.

Categories