Why one snakemake rule is always skipped or ignored - python

I am trying to fix a Snakefile. There are two rules (see the code below), each one would work if it's the only one, but only the rule prernaseqc would work when both are kept.
It seems that snakemake totally ignores the other.
I tried to touch file files_to_rnaseqc.txt etc, and it does not help. Why?
Any ideas would be appreciated.
import os
configfile: "run.json"
workpath = config['project']['workpath']
samples=sorted(list(config['project']['units'].keys()))
from snakemake.utils import R
from os.path import join
configfile: "run.json"
from os import listdir
star_dir="STAR_files"
bams_dir="bams"
log_dir="logfiles"
rseqc_dir="RSeQC"
kraken_dir="kraken"
preseq_dir="preseq"
pfamily = 'rnaseq'
rule prernaseqc:
input:
expand(join(workpath,bams_dir,"{name}.star_rg_added.sorted.dmark.bam"), name=samples)
output:
out1=join(workpath,bams_dir,"files_to_rnaseqc.txt")
priority: 2
params:
rname='pl:prernaseqc',batch='--mem=4g --time=04:00:00'
run:
with open(output.out1, "w") as out:
out.write("Sample ID\tBam file\tNotes\n")
for f in input:
out.write("%s\t" % f)
out.write("%s\t" % f)
out.write("%s\n" % f)
out.close()
rule rnaseqc:
input:
join(workpath,bams_dir,"files_to_rnaseqc.txt")
output:
join(workpath,"STAR_QC")
priority: 2
params:
rname='pl:rnaseqc',
batch='--mem=24g --time=48:00:00',
bwaver=config['bin'][pfamily]['tool_versions']['BWAVER'],
rrnalist=config['references'][pfamily]['RRNALIST'],
rnaseqcver=config['bin'][pfamily]['RNASEQCJAR'],
rseqcver=config['bin'][pfamily]['tool_versions']['RSEQCVER'],
gtffile=config['references'][pfamily]['GTFFILE'],
genomefile=config['references'][pfamily]['GENOMEFILE']
shell: """
module load {params.bwaver}
module load {params.rseqcver}
var="{params.rrnalist}"
if [ $var == "-" ]; then
java -Xmx48g -jar {params.rnaseqcver} -n 1000 -s {input} -t {params.gtffile} -r {params.genomefile} -o {output}
else
java -Xmx48g -jar {params.rnaseqcver} -n 1000 -s {input} -t {params.gtffile} -r {params.genomefile} -rRNA {params.rrnalist} -o {output}
fi

Snakemake, by design, uses output files listed in first rule of the file as target files (i.e. files that need to be created). Hence, in your case, whichever rule happens to be the first gets executed, while the other remains unexecuted.
You need to specify a target rule that lists all output files. It is customary to name it rule all.
rule all:
input:
join(workpath,bams_dir,"files_to_rnaseqc.txt"),
join(workpath,"STAR_QC")

Related

Assertion error or snakemake: error: argument --snakefile/-s: expected one argument when working with Snakemake pipeline

I am trying to run a Snakemake pipeline, but it either runs into an assertion error or snakemake is unable to find the snakefile for some reason. I am running the pipeline through a Snakefile, a config.yml file, and a bash script.
Code:
Config.yml file:
REPO_DIR="/path/to/pipeline"
REF_FASTA ="$REPO_DIR/data/genome/sacCer3.fasta"
FASTQ_DIR="$REPO_DIR/pipelinetest/fastq"
OUTPUT_DIR="$REPO_DIR/pipelineoutput"
ANC_DIR="$REPO_DIR/pipelineanc"
LOG_FILE="$OUTPUT_DIR/00_logs/pipeline.log"
SNAKE_FILE="$REPO_DIR/workflow/Snakefile.py"
CONFIG_FILE="$REPO_DIR/config/config.yml"
cd $REPO_DIR
Bash script:
#!/bin/bash
# activate conda env
source activate pipeline_env
# run the pipeline
snakemake --cores --snakefile snakefile=$SNAKE_FILE --configfile snakefile config_file=$CONFIG_FILE \
--config output_dir=$OUTPUT_DIR fastq_dir=$FASTQ_DIR anc_dir=$ANC_DIR ref_fasta=$REF_FASTA\
--use-conda --conda-prefix="$HOME/.snakemake/conda"
echo -e "\nDONE!\n"
Part of snakefile:
import os
import json
from datetime import datetime
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Define Constants ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# discover input files using path from run config
SAMPLES = list(set(glob_wildcards(f"{config['fastq_dir']}/{{sample}}_R1_001.fastq.gz").sample))
# read output dir path from run config
OUTPUT_DIR = config['output_dir']
# Project name and date for bam header
SEQID='pipeline_align'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# https://snakemake.readthedocs.io/en/v7.14.0/tutorial/basics.html#step-7-adding-a-target-rule
rule all:
input:
f'{OUTPUT_DIR}/DONE.txt'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Set Up Reference Files ~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
# export the current run configuration in JSON format
#
rule export_run_config:
output:
path=f"{OUTPUT_DIR}/00_logs/00_run_config.json"
run:
with open(output.path, 'w') as outfile:
json.dump(dict(config), outfile, indent=4)
#
# make a list of discovered samples
#
rule list_samples:
output:
f"{OUTPUT_DIR}/00_logs/00_sample_list.txt"
shell:
"echo -e '{}' > {{output}}".format('\n'.join(SAMPLES))
#
# copy the supplied reference genome fasta to the pipeline output directory for reference
#
rule copy_fasta:
input:
config['ref_fasta']
output:
f"{OUTPUT_DIR}/01_ref_files/{os.path.basename(config['ref_fasta'])}"
shell:
"cp {input} {output}"
rule index_fasta:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.fai"
conda:
'envs/main.yml'
shell:
"samtools faidx {input}"
rule create_ref_dict:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}".rstrip('fasta') + 'dict'
conda:
'envs/main.yml'
shell:
"picard CreateSequenceDictionary -R {input}"
#
# create a BWA index from the copied fasta reference genome
#
rule create_bwa_index:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.amb",
f"{rules.copy_fasta.output}.ann",
f"{rules.copy_fasta.output}.bwt",
f"{rules.copy_fasta.output}.pac",
f"{rules.copy_fasta.output}.sa",
conda:
'envs/main.yml'
shell:
"bwa index {input}"
And then I start putting the ancestor and sample files through the pipeline. However, the problem occurs upstream of when the Snakefile gets executed. I've tried reinstalling the git repo, but to no avail. I've also tried echoing the file path in the bash script, but the Snakefile still wasn't able to be found. I am submitting the bash script to a cluster, and it fails almost immediately.
How do I make sure the Snakefile is recognized through the bash script?
Error:
assert v is not None
AssertionError
or
snakemake: error: argument --snakefile/-s: expected one argument
I am using mamba as a package manager.
The shell call should be changed to this:
snakemake --cores --snakefile "$SNAKE_FILE" --configfile config_file ...
Where ... is the rest of the command. The main problem is that --snakefile (or -s) expects a string path to the Snakefile without any further keywords. Similarly, --configfile does not expect further keywords.

Snakemake: Constrain output to list of files

I do have a dictionary of {"file" -> "url"} mappings and would like to download these using Snakemake.
The corresponding download rule looks like this:
rule download_file:
output:
file="{file}"
params:
url=lambda wildcards: files_to_url_dict[wildcards.file],
shell: """wget "{params.url}" -o "{output.file}""""
However, the output rule does match any possible file now.
In other words, I cannot use any other rules any more.
How do I constrain the output to the file names in files_to_url_dict.keys()?
The best way is to find a pattern in the filenames of the keys in your dictionary, and to employ this information:
rule download_file:
output:
file="prefix{wildcard1}affix{wildcard2}suffix"
params:
url=lambda wildcards: files_to_url_dict[f"prefix{wildcards.wildcard1}affix{wildcards.wildcard2}suffix"],
shell: """wget "{params.url}" -o "{output.file}" """
This is a recommended way, as providing magic dictionaries is not an idiomatic solution for Snakemake: its power is to retrieve the information from the directory structure based on the known patterns.
If you don't know these prefix/affix/suffix, but know some restrictions that may distinguish the files to download from other files, you may use wildcard_constraints:
rule download_file:
output:
file="{file}"
wildcard_constraints:
file=<provide your regex here>
params:
url=lambda wildcards: files_to_url_dict[wildcards.file],
shell: """wget "{params.url}" -o "{output.file}" """
See more about wildcard constraints here.
I think your problem is really that your download_file rule can match any output. To fix this, you need to make this rule's output less general. One strategy is to put the output of download_file in its own directory:
rule download_file:
...
output:
file="download_file/{file}"
...
Other rules can then use the downloaded file as input by referring to the input file string:
rule gzip:
input:
file="download_file/{file}"
output:
gzipped="gzipped/{file}"
shell:
"gzip -c {input.file} > {output.gzipped}"
Or the output of download_file directly:
rule gzip:
input:
file=rules.download_file.output.file
...
Here's a short working example:
import shlex
files_to_url_dict = {
"sample1": "https://storage.googleapis.com/gatk-test-data/wgs_ubam/NA12878_20k/NA12878_A.bam",
"sample2": "https://storage.googleapis.com/gatk-test-data/wgs_ubam/NA12878_20k/NA12878_B.bam",
}
rule collect_reads:
input:
sam = ["uncompressed/{file}.sam".format(file=file) for file in files_to_url_dict.keys()]
output:
concat_reads = "collected/all_reads.txt"
params:
shlex_input = lambda wldc, input: ' '.join(map(shlex.quote, input.sam)) # Handle samples/files with spaces
shell:
"""
cat {params.shlex_input} > '{output.concat_reads}'
"""
rule download_file:
output:
file = "download_file/{file}"
params:
url = lambda wildcards: files_to_url_dict[wildcards.file],
shell:
"""
wget '{params.url}' -o '{output.file}'
"""
rule bam_to_sam:
input:
file = rules.download_file.output.file,
output:
sam = temp("uncompressed/{file}.sam")
shell:
"""
samtools view '{input.file}' > '{output.sam}'
"""
However, the output rule does match any possible file now. In other words, I cannot use any other rules any more.
How do I constrain the output to the file names in files_to_url_dict.keys()?
I don't understand what you mean by that. What is it in the example below that doesn't work for you?
files2url = {'sample1':'http://foo.txt',
'sample2':'http://bar.txt'}
rule all:
input:
expand('{sample}.txt', sample= files2url.keys()),
rule download_file:
output:
sample= '{sample}.tmp',
params:
url= lambda wc: files2url[wc.sample],
shell:
r"""
wget "{params.url}" -o "{output.sample}"
"""
rule copy_file:
input:
'{sample}.tmp',
output:
'{sample}.txt',
shell:
r"""
cp {input} {output}
"""

snakemake checkpoint calling variable not defined

I have the below snakefile with checkpoints. I am trying to run this for 2 samples (defined as RUNS). However, everytime I try I'm getting an additional variable included. Any thoughts on how to resolve this? Thank you..
import os
from tempfile import TemporaryDirectory
configfile: "config/CONFIG.yaml"
DATA_DIR = config["data_dir"]
RESULTS_DIR = config["results_dir"]
DB_DIR=config["db_dir"]
RUNS=["S1_select", "S3_select"]
BARCODES=config["no_barcode"]
rule all:
input: expand(os.path.join(RESULTS_DIR, "basecalled/{run}/{barcode}.fastq.gz"), run=RUNS, barcode=BARCODES)
checkpoint guppy_gpu_basecall:
input: os.path.join(DATA_DIR, "multifast5/{run}")
output: directory(os.path.join(RESULTS_DIR, "basecalled/{run}")) #folder with many files
log: os.path.join(RESULTS_DIR, "basecalled/{run}/basecalling")
threads: config["guppy_gpu"]["cpu_threads"]
shell:
"""
run_guppy
"""
rule intermediate_basecalling:
input: os.path.join(RESULTS_DIR, "basecalled/{run}/{i}.fastq.gz")
output: os.path.join(RESULTS_DIR, "basecalled/{run}/no_nobarcode/{i}.fastq.gz")
log: os.path.join(RESULTS_DIR, "basecalled/{run}/no_barcode_{i}")
shell:
"""
(date &&\
ln -s {input} {output} &&\
date) 2> >(tee {log}.stderr) > >(tee {log}.stdout)
"""
def aggregate_dummy_basecalling(wildcards):
checkpoint_output = checkpoints.guppy_gpu_basecall.get(**wildcards).output[0]
return expand(os.path.join(RESULTS_DIR, "basecalled/{run}/no_nobarcode/{id}.fastq.gz"),
run=wildcards.run,
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fastq.gz")).i)
rule merge_individual_fastq_per_barcode:
input: aggregate_dummy_basecalling
output: os.path.join(RESULTS_DIR, "basecalled/{run}/{barcode}/{barcode}.fastq.gz")
shell:
"""
date
cat $(find $(dirname {output}) -name "*.fastq.gz" | sort) > {output}
touch {output}
date
"""
I'm getting the following error:
Missing input files for rule guppy_gpu_basecall:
data/multifast5/S1_select/no_barcode.fastq.gz
Thank you for your pointers!

Snakemake ambiguity

I have an ambiguity error and I can't figure out why and how to solve it.
Defining the wildcards:
rule all:
input:
xls = expand("reports/{sample}.xlsx", sample = config["samples"]),
runfolder_xls = expand("{runfolder}.xlsx", runfolder = config["runfolder"])
Actual rules:
rule sample_report:
input:
vcf = "vcfs/{sample}.annotated.vcf",
cov = "stats/{sample}.coverage.gz",
mod_bed = "tmp/mod_ref_{sample}.bed",
nirvana_g2t = "/mnt/storage/data/NGS/nirvana_genes2transcripts"
output:
"reports/{sample}.xlsx"
params:
get_nb_samples()
log:
"logs/{sample}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_sample_report.py -v {input.vcf} -c {input.cov} -r {input.mod_bed} -n {input.nirvana_g2t} -r {rule};
exitcode=$? ;
if [[ {params} > 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.sample}
elif [[ {params} == 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r sample_mode -n {wildcards.sample}
else
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e 1 -r {rule} -n {wildcards.sample}
fi
"""
rule runfolder_report:
input:
sample_sheet = "SampleSheet.csv"
output:
"{runfolder}.xlsx"
log:
"logs/{runfolder}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {wildcards.runfolder} -s {input.sample_sheet} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.runfolder}
"""
Config file:
runfolder: "CP0340"
samples: ['C014044p', 'C130157', 'C014040p', 'C014054b-1', 'C051198-A', 'C014042p', 'C052007W-C', 'C051198-B', 'C014038p', 'C052004-B', 'C051198-C', 'C052004-C', 'C052003-B', 'C052003-A', 'C052004-A', 'C052002-C', 'C052005-C', 'C052002-A', 'C130157N', 'C052006-B', 'C014063pW', 'C014054b-2', 'C052002-B', 'C052006-C', 'C052007W-B', 'C052003-C', 'C014064bW', 'C052005-B', 'C052006-A', 'C052005-A']
Error:
$ snakemake -n -s ../niles/Snakefile --configfile logs/CP0340_config.yaml
Building DAG of jobs...
AmbiguousRuleException:
Rules runfolder_report and sample_report are ambiguous for the file reports/C014044p.xlsx.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
runfolder_report: runfolder=reports/C014044p
sample_report: sample=C014044p
Expected input files:
runfolder_report: SampleSheet.csv
sample_report: vcfs/C014044p.annotated.vcf stats/C014044p.coverage.gz tmp/mod_ref_C014044p.bed /mnt/storage/data/NGS/nirvana_genes2transcriptsExpected output files:
runfolder_report: reports/C014044p.xlsx
sample_report: reports/C014044p.xlsx
If I understand Snakemake correctly, the wildcards in the rules are defined in my all rule so I don't understand why the runfolder_report rule tries to put reports/C014044p.xlsx as an output + how the output has a sample name instead of the runfolder name (as defined in the config file).
As the error message suggests, you could assign a distinct prefix to the output of each rule. So your original code will work if you replace {runfolder}.xlsx with, e.g., "runfolder/{runfolder}.xlsx" in rule all and in runfolder_report. Alternatively, constraint the wildcards (my preferred solution) by adding before rule all something like:
wildcard_constraints:
sample= '|'.join([re.escape(x) for x in config["samples"]]),
runfolder= re.escape(config["runfolder"]),
The reason for this is that snakemake matches input and output strings using regular expressions (the fine details of how it's done, I must admit, escape me...)
Ok here is my solution:
rule runfolder_report:
input:
"SampleSheet.csv"
output:
expand("{runfolder}.xlsx", runfolder = config["runfolder"])
params:
config["runfolder"]
log:
expand("logs/{runfolder}.log", runfolder = config["runfolder"])
shell: """
set +e ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {params} -s {input} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {params}
"""
However I still don't understand why it had errors and I know that it work previously.

`run` block in snakemake reset config value?

For example, I have python script snakemake.py:
from snakemake import snakemake
cfg={'a':'aaaa', 'b':'bbbb', 'c': 'cccc'}
snakemake(
'Snakefile',
targets=['all'],
printshellcmds=True,
forceall=True,
config=cfg,
# configfile=config,
keep_target_files=True,
keep_logger=False)
Snakefile looks like this:
print(config)
print('------------------------------------------------------------------------------------------')
rule a:
output:
'a.out'
shell:
"echo %s ; "
"touch {output[0]}" % config['a']
rule b:
output:
'b.out'
shell:
"echo %s ; touch {output[0]}" % config['b']
rule c:
output:
'c.out'
run:
print(config['c'])
import os
os.system('touch ' + output[0])
rule all:
input:
'a.out', 'b.out', 'c.out'
When I run python snakemake.py, I met an error:
{'a': 'aaaa', 'c': 'cccc', 'b': 'bbbb'}
------------------------------------------------------------------------------------------
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 a
1 all
1 b
1 c
4
rule c:
output: c.out
jobid: 1
{}
------------------------------------------------------------------------------------------
KeyError in line 8 of /Users/zech/Desktop/snakemake/Snakefile:
'a'
File "/Users/zech/Desktop/snakemake/Snakefile", line 8, in <module>
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
When I remove the c.out from rule all, then it runs perfectly fine. It looks like every run block in the rules reset config passed to snakemake function to empty? Isn't it a weird behavior? Is there any workaround?
I am using snakemake version 3.11.2 (installed from bioconda channel of anaconda) on latest OSX.
NOTE: It runs fine when I run snakemake command line snakemake -p --keep-target-files all --config a="aaaa" b="bbb" c="cccc". So this looks like a problem for the API.
What is your current version of Snakemake?
I use snakemake/3.11.2 and I have no problem with your script.
But you have to know that the params section of each rules is the right way to invoke the configuration parameters.
In your script it should be like this :
rule a:
params:
name = config['a']
output:
'a.out'
shell:
"echo {params.name} ; "
"touch {output}"
rule b:
params:
name = config['b']
output:
'b.out'
shell:
"echo {params.name} ;"
"touch {output}"
rule c:
params:
name = config['c']
output:
'c.out'
run:
print(params.name)
import os
os.system('touch ' + output[0])

Categories