Missing input files from a directory with Snakemake

Missing input files from a directory with Snakemake - python

I'm trying to write a script for a pipeline, but I'm having trouble declaring the input of a rule from a directory.
My code in these parts:
rule taco:
input:
all_gtf = GTF_DIR + "path_samplesGTF.txt"
output:
taco_out = TACO_DIR
shell:
"taco_run -v -p 20 -o {output.taco_out} \
--filter-min-expr 1 --gtf-expr-attr RPKM {input.all_gtf}"
rule feelnc_filter:
input:
assembly = TACO_DIR + "assembly.gtf",
annotation = GTF
output:
candidate_lncrna = FEELNC_FILTER + "candidate_lncrna.gtf"
shell:
"./FEELnc_filter.pl -i {input.assembly} -a {input.annotation} > {output.candidate_lncrna}"
This is my error:
MissingInputException in line 97 of /workdir/Snakefile:
Missing input files for rule feelnc_filter:
Thank you!
/workdir/pipeline-v01/TACO/assembly.gtf

Your script code definitely is smaller than 97 lines, so the exception description is not very useful. Anyway, MissingInputException means that Snakemake has successfully constructed the workflow DAG (which means that there is nothing wrong with your input/output and wildcards) and started the execution of this workflow. At some point it was trying to execute the rule where the expected output of this rule was not present at the end of the rule's shell script.
Now we have the second problem: your script runs your own Perl script and an unknown taco_run executable: I have no clue what do these program do. I guess that taco_run doesn't create the directory that you specify as -o {output.taco_out}.
I advise you to run your Snakemake with the --printshellcmds key. This would show you the exact commands being run, and you could try to run those commands separately. Check that those commands really create the expected outputs.

Related

How do I use this python function within the params section of my Snakemake rule?

I'm trying to figure out how to extract the read-group lane information from a fastq file, and then use this string within my GATK AddOrReplaceReadGroups Snakemake below (below). I've written a short Python function (at the top of the rule) to do this operation, but can't quite figure out how to actually use it in the script such that I can access the string output (e.g., "ABCHGSX35:2") of the function on a given fastq file in my GATK shell command. Basically, I want to feed {params.PathToReadR1} into the extract_lane_info function, but can't figure out how to integrate them correctly.
Open to any solutions to the problem posed, or if there's an entirely more efficiently and different way to achieve the same result (getting the read-group lane info as a param value), that'd be great, too. Thanks for any help.
def extract_lane_info(file_path):
# Run the bash command using subprocess.run()
elements = subprocess.run(["zcat", file_path], stdout=subprocess.PIPE).stdout.split(b"\n")[0].split(b":")[2:4]
# Extract the lane information from the output of the command using a regular expression
read_group = elements[0].decode().strip("'")
string_after = elements[1].decode().strip("'")
elements = read_group + ":" + string_after
return(elements)
rule AddOrReplaceReadGroups:
input:
"results/{species}/bwa/merged/{read}_pese.bam",
output:
"results/{species}/GATK/AddOrReplace/{read}.pese.bwa_mem.addRG.bam",
threads: config["trimmmomatic"]["cpu"]
log:
"results/{species}/GATK/AddOrReplace/log/{read}_AddOrReplace.stderrs"
message:
"Running AddOrReplaceReadGroups on {wildcards.read}"
conda:
config["CondaEnvs"]
params:
ReadGroupID = lambda wildcards: final_Prefix_ReadGroup[wildcards.read],
PathToReadR1 = lambda wildcards: final_Prefix_RawReadPathR1[wildcards.read],
LIBRARY = lambda wildcards: final_Prefix_ReadGroupLibrary[wildcards.read],
shell:"gatk AddOrReplaceReadGroups -I {input} -O {output} -ID {params.ReadGroupID}.1 -LB {params.LIBRARY} -PL illumina -PU {input.lane_info}:{params.ReadGroupID} -SM {params.ReadGroupID} --SORT_ORDER 'coordinate' --CREATE_INDEX true 2>> {log}"
Basically, {params.PathToReadR1} would be "path/to/file.fastq.gz", and I want this file to be inputted into the extract_lane_info function, and then the output of this function to be used in the -PU section of the shell command (e.g., {params.lane_info}. I keep getting all types of errors as I've messed around with it, and am unsure how to move forward.

MD task with `snakemake`

I want to create a very simple pipeline for molecular dynamic simulation. The program (Amber) just wants 3 files as input, and produces a lot of files, some of them I will be needed in the future. So my pipeline is extremely simple:
Check that *.in, *.prmtop and *.rst are in folder (I guarantee it's only one file for any of these extensions) and warn if these files are not present
Run shell command (based on name of all input files)
Check that *.out, mden, mdinfo and *.nc were produced
That's all. It's standard approach to the program I deal with. One folder, one task, short and simple file names based on file purpose, not on its content.
I wrote a simple pipline:
rule all:
input: '{inp}.out'
rule amber:
input:
'{inp}.in',
'{top}.prmtop',
'{coord}.rst'
output:
'{inp}.out',
'mden',
'mdinfo',
'{inp}.nc'
shell:
'pmemd.cuda'
' -O'
' -i {inp}.in'
' -o {inp}.out'
' -p {top}.prmtop'
' -c {coord}.rst'
' -r {inp}.rst'
' -x {inp}.nc'
' -ref {coord}.rst'
And it doesn't work.
All inputs in all rule must be explicit. (Why? Why it cannot be regex or wildcard expression? If I see *.out in folder and status code of shell script was 0, that's all, work is done)
I must to use all wildcards from input in output, but I want to use some only in shell or in another rules
I must to not expect to get files like mden with potentially "non-unique" names because it's could be change with another task, but I know, that it will be only one task and it's a direct way how my MD program works (yeah, I know about Ambers's -e and -inf keys, but it's over-complication of simple task).
So, I would like to decide is it worth using snakemake for this, or not. It's very simple task, but I already spent several hours, I see a lot of documentation, a lot of examples, that I can't apply to my case. snakemake looks exactly what I need, but I can't express simple task in general terms with this framework, I don't want to specify explicit filenames, because I'll lose in flexibility, I want to run hundreds of simple tasks automatically, only input files will be different. I'm sure I just haven't figured out how to handle this framework yet. Maybe you can show me how should I? Thank you!

Hopefully this will put you in the right direction.
If I understand correctly, the input to snakemake is a folder containing the input files to amber. You know that this folder contains one .in file, one .prmtop file, and one .rst file but you don't know the full names of these files.
If you want snakemake to run on a single input folder, then you don't need wildcards at all and the script below should do.
import glob
import os
input_folder = config['amber_folder']
# We don't know the name of input file. We only know it ends in '.in'
inp = glob.glob(os.path.join(input_folder, '*.in'))
assert len(inp) == 1
inp = inp[0]
name = os.path.splitext(os.path.basename(inp))[0]
output_folder = name + '_results'
out = os.path.join(output_folder, name + '.out')
rule all:
input:
out
rule amber:
input:
inp= inp,
top= glob.glob(os.path.join(input_folder, '*.prmtop')),
rst= glob.glob(os.path.join(input_folder, '*.rst')),
output:
out= out,
nc= os.path.join(output_folder, name + '.nc'),
mden= os.path.join(output_folder, 'mden'),
mdinfo= os.path.join(output_folder, 'mdinfo'),
shell:
r"""
pmemd.cuda \
-O \
-i {input.inp} \
-o {output.out} \
-p {input.top} \
-c {input.rst} \
-r {input.rst} \
-x {output.nc} \
-ref {input.rst}
"""
Execute with:
snakemake -j 1 -C amber_folder='your-input-folder'
If you have many input folders you could write a for-loop to execute the command above but probably better is to pass the list of inputs to snakemake and let it handle them.

Snakemake "Missing files after X seconds" error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 pear
1
[Wed Dec 4 17:32:54 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 0
wildcards: sample=Unmap_41, extension=fastq
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakefile is the following:
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
"""
set+u; source activate antismash; set -u ;
pear -f {input.forward} -r {input.reverse} -o {output} -t 21
"""
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta
"""
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
shell:
"""
set+u; source activate antismash; set -u ;
antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}
"""
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml -F --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)

In rule pear I think you want to use the shell directive instead of run. With run you execute python code which in this case does nothing as you simply "execute" a string so you get no error and no file produced.

calling variables for rule individually and adding an independent environment for a specific rule

I need to run a snakemake rule in the cluster, therefore for some rules, I need some tools and library needed to e loaded whereas, these tools are independent/ exclusive to other rules. I this case how can I specify these in my snakemake rule. For example, for rule score I need to module load r/3.5.1 and export R_lib =/user/tools/software currently, I am running these lines separately in the command line before running snakemake. But it would be great if there is a way to do it within the rule as env.
Question,
I have a rule as following,
rule score:
input:
count=os.path.join(config['general']['paths']['outdir'], 'count_expression', '{sample}.tsv'),
libsize=os.path.join(config['general']['paths']['outdir'], 'count_expression', '{sample}.size_tsv')
params:
result_dir=os.path.join(config['general']['paths']['outdir'], 'score'),
cancertype=config['general']['paths']['cancertype'],
sample_id=expand('{sample}',sample=samples['sample'].unique())
output:
files=os.path.join(config['general']['paths']['outdir'], 'score', '{sample}_bg_scores.tsv', '{sample}_tp_scores.tsv')
shell:
'mkdir -p {params.result_dir};Rscript {config[general][paths][tool]} {params.result_dir} {params.cancertype} {params.sample_id} {input.count} {input.libsize}'
My actual behavior for the above code snippet is:
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC GNMS4 MRT5T /cluster/projects/test/results/exp/MRT5T.tsv /cluster/projects/test/results/Exp/MRT5T.size.tsv
Whereas, the expected behavior is:
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC MRT5T /cluster/projects/test/results/exp/MRT5T.tsv /cluster/projects/test/results/Exp/MRT5T.size.tsv
and for the second sample,
shell:
mkdir -p /cluster/user/snakemake_test/results_april30/score;Rscript /cluster/home/user/Projects/R_scripts/scoretool.R /cluster/user/snakemake_test/results_april30/score DMC GNMS4 /cluster/projects/test/results/exp/GNMS4.tsv /cluster/projects/test/results/Exp/GNMS4.ize.tsv
I need the variable sample_d ['GNMS4', 'MRT5T'] should be taken separately, not together in one shell command line.

Regarding your first question: You can put whatever module load or export commands you like in the shell section of a rule.
Regarding your second question, you should probably not use expand in the params section of your rule. In expand('{sample}',sample=samples['sample'].unique()) you are actually not using the value of the sample wildcard, but generating a list of all unique values in sample['sample']. You probably just need to use wildcards.sample in the definition of your shell command instead of using a params element.
If you want to run several instances of the score rule based on possible values for sample, you need to "drive" this using another rule that wants the output of score as its input.
Note that to improve readability, you can use python's multi-line strings (triple-quoted).
To sum up, you might try something like this:
rule all:
input:
expand(
os.path.join(
config['general']['paths']['outdir'],
'score',
'{sample}_bg_scores.tsv',
'{sample}_tp_scores.tsv'),
sample=samples['sample'].unique())
rule score:
input:
count = os.path.join(
config['general']['paths']['outdir'],
'count_expression', '{sample}.tsv'),
libsize = os.path.join(
config['general']['paths']['outdir'],
'count_expression', '{sample}.size_tsv')
params:
result_dir = os.path.join(config['general']['paths']['outdir'], 'score'),
cancertype = config['general']['paths']['cancertype'],
output:
files = os.path.join(
config['general']['paths']['outdir'],
'score', '{sample}_bg_scores.tsv', '{sample}_tp_scores.tsv')
shell:
"""
module load r/3.5.1
export R_lib =/user/tools/software
mkdir -p {params.result_dir}
Rscript {config[general][paths][tool]} {params.result_dir} {params.cancertype} {wildcards.sample} {input.count} {input.libsize}
"""

onstart would work I think. Note that dryruns don't trigger this handler, which is acceptable in your scenario.
onstart:
shell("load tools")
Simple bash for loop should solve the problem. However, if you want each sample to be run as a separate rule, you would have to use sample name as part of output filename.
shell:
'''
for sample in {param.sample_id}
do
your command $sample
done
'''

Snakemake don't execute bash command

I'm trying to run a little pipeline in Snakemake for a software to filter good reads in files from a RNA-seq.
This is my code:
SAMPLES = ['ZN21_S1', 'ZN22_S2','ZN27_S3', 'ZN28_S4', 'ZN29_S5' ,'ZN30_S6']
rule all:
input:
expand("SVA-{sample}_L001_R{read_no}.fastq.gz", sample=SAMPLES, read_no=['1', '2'])
rule fastp:
input:
reads1="SVA-{sample}_L001_R1.fastq.gz",
reads2="SVA-{sample}_L001_R2.fastq.gz"
output:
reads1out="out/SVA-{sample}_L001_R1.fastq.gz.good",
reads2out="out/SVA-{sample}_L001_R2.fastq.gz.good"
shell:
"fastp -i {input.reads1} -I {input.reads2} -o {output.reads1out} -O {output.reads2out}"
All samples (in symbolic link) are in the same folder and I only got the message "Nothing to be done".
What am I not seeing?

In your example, target files in rule all are supposed to match with rule fastp's output files, instead of its input files in your current setup. As per your code, target files in rule all already exist and hence the message Nothing to be done when executing it.
rule all:
input:
expand("out/SVA-{sample}_L001_R{read_no}.fastq.gz.good", sample=SAMPLES, read_no=['1', '2'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Missing input files from a directory with Snakemake - python

Related

How do I use this python function within the params section of my Snakemake rule?

MD task with `snakemake`

Snakemake "Missing files after X seconds" error

calling variables for rule individually and adding an independent environment for a specific rule

Snakemake don't execute bash command

Categories

Resources