I'm trying to write a Log param in my Snakefile, but I don't know what I'm missing. This is my code:
include:
'config.py'
rule all:
input:
expand(WORK_DIR +"/trimmed/TFB{sample}_R{read_no}.fastq.gz.good",
sample=SAMPLE_TFB ,read_no=['1', '2'])
rule fastp:
input:
R1= SAMPLES_DIR + "/TFB{sample}_R1.fastq.gz",
R2= SAMPLES_DIR + "/TFB{sample}_R2.fastq.gz"
output:
R1out= WORK_DIR + "/trimmed/TFB{sample}_R1.fastq.gz.good",
R2out= WORK_DIR + "/trimmed/TFB{sample}_R2.fastq.gz.good"
log:
log = WORK_DIR + "/logs/fastp/{sample}.html"
shell:
"fastp -i {input.R1} -I {input.R2} -o {output.R1out} -O {output.R2out} \
-h {log.log}"
This is the error I get after executing the snakemake.
SyntaxError in line 16 of /work/users/leboralli/trofoZikaLincRNAs/scripts/Snakefile:
Colon expected after keyword log. (Snakefile, line 16)
I tried many options and nothing worked.
This software, fastp, has a param for logging: -h, with a .html output. Without the log my code works fine.
Thanks in advance.
Related
I am trying to run a Snakemake pipeline, but it either runs into an assertion error or snakemake is unable to find the snakefile for some reason. I am running the pipeline through a Snakefile, a config.yml file, and a bash script.
Code:
Config.yml file:
REPO_DIR="/path/to/pipeline"
REF_FASTA ="$REPO_DIR/data/genome/sacCer3.fasta"
FASTQ_DIR="$REPO_DIR/pipelinetest/fastq"
OUTPUT_DIR="$REPO_DIR/pipelineoutput"
ANC_DIR="$REPO_DIR/pipelineanc"
LOG_FILE="$OUTPUT_DIR/00_logs/pipeline.log"
SNAKE_FILE="$REPO_DIR/workflow/Snakefile.py"
CONFIG_FILE="$REPO_DIR/config/config.yml"
cd $REPO_DIR
Bash script:
#!/bin/bash
# activate conda env
source activate pipeline_env
# run the pipeline
snakemake --cores --snakefile snakefile=$SNAKE_FILE --configfile snakefile config_file=$CONFIG_FILE \
--config output_dir=$OUTPUT_DIR fastq_dir=$FASTQ_DIR anc_dir=$ANC_DIR ref_fasta=$REF_FASTA\
--use-conda --conda-prefix="$HOME/.snakemake/conda"
echo -e "\nDONE!\n"
Part of snakefile:
import os
import json
from datetime import datetime
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Define Constants ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# discover input files using path from run config
SAMPLES = list(set(glob_wildcards(f"{config['fastq_dir']}/{{sample}}_R1_001.fastq.gz").sample))
# read output dir path from run config
OUTPUT_DIR = config['output_dir']
# Project name and date for bam header
SEQID='pipeline_align'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# https://snakemake.readthedocs.io/en/v7.14.0/tutorial/basics.html#step-7-adding-a-target-rule
rule all:
input:
f'{OUTPUT_DIR}/DONE.txt'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Set Up Reference Files ~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
# export the current run configuration in JSON format
#
rule export_run_config:
output:
path=f"{OUTPUT_DIR}/00_logs/00_run_config.json"
run:
with open(output.path, 'w') as outfile:
json.dump(dict(config), outfile, indent=4)
#
# make a list of discovered samples
#
rule list_samples:
output:
f"{OUTPUT_DIR}/00_logs/00_sample_list.txt"
shell:
"echo -e '{}' > {{output}}".format('\n'.join(SAMPLES))
#
# copy the supplied reference genome fasta to the pipeline output directory for reference
#
rule copy_fasta:
input:
config['ref_fasta']
output:
f"{OUTPUT_DIR}/01_ref_files/{os.path.basename(config['ref_fasta'])}"
shell:
"cp {input} {output}"
rule index_fasta:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.fai"
conda:
'envs/main.yml'
shell:
"samtools faidx {input}"
rule create_ref_dict:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}".rstrip('fasta') + 'dict'
conda:
'envs/main.yml'
shell:
"picard CreateSequenceDictionary -R {input}"
#
# create a BWA index from the copied fasta reference genome
#
rule create_bwa_index:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.amb",
f"{rules.copy_fasta.output}.ann",
f"{rules.copy_fasta.output}.bwt",
f"{rules.copy_fasta.output}.pac",
f"{rules.copy_fasta.output}.sa",
conda:
'envs/main.yml'
shell:
"bwa index {input}"
And then I start putting the ancestor and sample files through the pipeline. However, the problem occurs upstream of when the Snakefile gets executed. I've tried reinstalling the git repo, but to no avail. I've also tried echoing the file path in the bash script, but the Snakefile still wasn't able to be found. I am submitting the bash script to a cluster, and it fails almost immediately.
How do I make sure the Snakefile is recognized through the bash script?
Error:
assert v is not None
AssertionError
or
snakemake: error: argument --snakefile/-s: expected one argument
I am using mamba as a package manager.
The shell call should be changed to this:
snakemake --cores --snakefile "$SNAKE_FILE" --configfile config_file ...
Where ... is the rest of the command. The main problem is that --snakefile (or -s) expects a string path to the Snakefile without any further keywords. Similarly, --configfile does not expect further keywords.
I'm trying to automatize a shell command that use the software using snakemake :
./chopchop.py -G hg38 -o temp -Target chr16:46390060-46390782
In this command the 'chr16:46390060-46390782' input will change.
All different input are in a file in which I'll have to parse to get the appropiate input format
cat test.bed
chr16 46390060 46390782
chr21 33931554 33931728
I have a simple snakemake rule that run shell command
rule run_chopchop :
input:
"/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/enhancer_dataset/jurkat/test.bed"
output:
"/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/chopchop_output/guide_chopchop.txt"
shell:'''
set +u; source /gpfs/home/user/Apps/anaconda3/bin/activate chopchop; set -u
./gpfs/home/user/git/chopchop/chopchop.py -G hg38 -o temp -Target {input} > {output}
'''
How can I use the content of the file as input in snakemake and change the syntax of the line to get the appropriate format ? I have really no idea of the syntax. If some can help me.
Thanks
You can always replace the shell: section with the run: one. In this case you need to call the shell() function each time you need to run the script:
rule run_chopchop :
input: "/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/enhancer_dataset/jurkat/test.bed"
output: "/gpfs/home/user/crispr_project/CRISPRi_Enh_TALL/chopchop_output/guide_chopchop.txt"
run:
# for each line in the file
input = ...
output = ...
shell(f'''
set +u; source /gpfs/home/user/Apps/anaconda3/bin/activate chopchop; set -u
./gpfs/home/user/git/chopchop/chopchop.py -G hg38 -o temp -Target {input} > {output}
''')
I have an ambiguity error and I can't figure out why and how to solve it.
Defining the wildcards:
rule all:
input:
xls = expand("reports/{sample}.xlsx", sample = config["samples"]),
runfolder_xls = expand("{runfolder}.xlsx", runfolder = config["runfolder"])
Actual rules:
rule sample_report:
input:
vcf = "vcfs/{sample}.annotated.vcf",
cov = "stats/{sample}.coverage.gz",
mod_bed = "tmp/mod_ref_{sample}.bed",
nirvana_g2t = "/mnt/storage/data/NGS/nirvana_genes2transcripts"
output:
"reports/{sample}.xlsx"
params:
get_nb_samples()
log:
"logs/{sample}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_sample_report.py -v {input.vcf} -c {input.cov} -r {input.mod_bed} -n {input.nirvana_g2t} -r {rule};
exitcode=$? ;
if [[ {params} > 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.sample}
elif [[ {params} == 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r sample_mode -n {wildcards.sample}
else
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e 1 -r {rule} -n {wildcards.sample}
fi
"""
rule runfolder_report:
input:
sample_sheet = "SampleSheet.csv"
output:
"{runfolder}.xlsx"
log:
"logs/{runfolder}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {wildcards.runfolder} -s {input.sample_sheet} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.runfolder}
"""
Config file:
runfolder: "CP0340"
samples: ['C014044p', 'C130157', 'C014040p', 'C014054b-1', 'C051198-A', 'C014042p', 'C052007W-C', 'C051198-B', 'C014038p', 'C052004-B', 'C051198-C', 'C052004-C', 'C052003-B', 'C052003-A', 'C052004-A', 'C052002-C', 'C052005-C', 'C052002-A', 'C130157N', 'C052006-B', 'C014063pW', 'C014054b-2', 'C052002-B', 'C052006-C', 'C052007W-B', 'C052003-C', 'C014064bW', 'C052005-B', 'C052006-A', 'C052005-A']
Error:
$ snakemake -n -s ../niles/Snakefile --configfile logs/CP0340_config.yaml
Building DAG of jobs...
AmbiguousRuleException:
Rules runfolder_report and sample_report are ambiguous for the file reports/C014044p.xlsx.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
runfolder_report: runfolder=reports/C014044p
sample_report: sample=C014044p
Expected input files:
runfolder_report: SampleSheet.csv
sample_report: vcfs/C014044p.annotated.vcf stats/C014044p.coverage.gz tmp/mod_ref_C014044p.bed /mnt/storage/data/NGS/nirvana_genes2transcriptsExpected output files:
runfolder_report: reports/C014044p.xlsx
sample_report: reports/C014044p.xlsx
If I understand Snakemake correctly, the wildcards in the rules are defined in my all rule so I don't understand why the runfolder_report rule tries to put reports/C014044p.xlsx as an output + how the output has a sample name instead of the runfolder name (as defined in the config file).
As the error message suggests, you could assign a distinct prefix to the output of each rule. So your original code will work if you replace {runfolder}.xlsx with, e.g., "runfolder/{runfolder}.xlsx" in rule all and in runfolder_report. Alternatively, constraint the wildcards (my preferred solution) by adding before rule all something like:
wildcard_constraints:
sample= '|'.join([re.escape(x) for x in config["samples"]]),
runfolder= re.escape(config["runfolder"]),
The reason for this is that snakemake matches input and output strings using regular expressions (the fine details of how it's done, I must admit, escape me...)
Ok here is my solution:
rule runfolder_report:
input:
"SampleSheet.csv"
output:
expand("{runfolder}.xlsx", runfolder = config["runfolder"])
params:
config["runfolder"]
log:
expand("logs/{runfolder}.log", runfolder = config["runfolder"])
shell: """
set +e ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {params} -s {input} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {params}
"""
However I still don't understand why it had errors and I know that it work previously.
I am trying to fix a Snakefile. There are two rules (see the code below), each one would work if it's the only one, but only the rule prernaseqc would work when both are kept.
It seems that snakemake totally ignores the other.
I tried to touch file files_to_rnaseqc.txt etc, and it does not help. Why?
Any ideas would be appreciated.
import os
configfile: "run.json"
workpath = config['project']['workpath']
samples=sorted(list(config['project']['units'].keys()))
from snakemake.utils import R
from os.path import join
configfile: "run.json"
from os import listdir
star_dir="STAR_files"
bams_dir="bams"
log_dir="logfiles"
rseqc_dir="RSeQC"
kraken_dir="kraken"
preseq_dir="preseq"
pfamily = 'rnaseq'
rule prernaseqc:
input:
expand(join(workpath,bams_dir,"{name}.star_rg_added.sorted.dmark.bam"), name=samples)
output:
out1=join(workpath,bams_dir,"files_to_rnaseqc.txt")
priority: 2
params:
rname='pl:prernaseqc',batch='--mem=4g --time=04:00:00'
run:
with open(output.out1, "w") as out:
out.write("Sample ID\tBam file\tNotes\n")
for f in input:
out.write("%s\t" % f)
out.write("%s\t" % f)
out.write("%s\n" % f)
out.close()
rule rnaseqc:
input:
join(workpath,bams_dir,"files_to_rnaseqc.txt")
output:
join(workpath,"STAR_QC")
priority: 2
params:
rname='pl:rnaseqc',
batch='--mem=24g --time=48:00:00',
bwaver=config['bin'][pfamily]['tool_versions']['BWAVER'],
rrnalist=config['references'][pfamily]['RRNALIST'],
rnaseqcver=config['bin'][pfamily]['RNASEQCJAR'],
rseqcver=config['bin'][pfamily]['tool_versions']['RSEQCVER'],
gtffile=config['references'][pfamily]['GTFFILE'],
genomefile=config['references'][pfamily]['GENOMEFILE']
shell: """
module load {params.bwaver}
module load {params.rseqcver}
var="{params.rrnalist}"
if [ $var == "-" ]; then
java -Xmx48g -jar {params.rnaseqcver} -n 1000 -s {input} -t {params.gtffile} -r {params.genomefile} -o {output}
else
java -Xmx48g -jar {params.rnaseqcver} -n 1000 -s {input} -t {params.gtffile} -r {params.genomefile} -rRNA {params.rrnalist} -o {output}
fi
Snakemake, by design, uses output files listed in first rule of the file as target files (i.e. files that need to be created). Hence, in your case, whichever rule happens to be the first gets executed, while the other remains unexecuted.
You need to specify a target rule that lists all output files. It is customary to name it rule all.
rule all:
input:
join(workpath,bams_dir,"files_to_rnaseqc.txt"),
join(workpath,"STAR_QC")
i am new in sankemake, i am trying to run this code but i have an error.
I have my input directories structured like this:
Library:
-MMETSP1:
SRR1_1.fastq.gz
SRR1_2.fastq.gz
-MMETSP2:
SRR2_1.fastq.gz
SRR2_2.fastq.gz
So what i want to do is to run the rule twice for each directory. For tht i have used the expand function in rule all and i have two jobs counted by snakemake. That is fine for me. But my probelm is not to retrieve the fasta file inside my directories. For that i have used regex in the execution of the command but it is not working.
Can someone help me please.
Thank you in advance !
#!/usr/bin/python
import os
import glob
import sys
SALMON_BY_LIBRARY_DIR = OUT_DIR + "salmon_by_library_out"
salmon = config["software"]["salmon"]
(LIBRARY, FASTQ, SENS) = glob_wildcards(LIBRARY_DIR + "{mmetsp}/{reads}_{type}.fastq.gz")
rule all:
input:
salmon_by_library_out = expand(SALMON_BY_LIBRARY_DIR + "/" + "{mmetsp}", zip, mmetsp=LIBRARY),
rule salmon_by_library:
input:
transcript = TRINITY_DIR + "/Trinity.fasta",
fastq = LIBRARY_DIR + "{mmetsp}",
output:
salmon_out = directory(SALMON_BY_LIBRARY_DIR + "/" + "{mmetsp}"),
log:
OUT_DIR + "{mmetsp}/salmon.log"
threads:
config["threads"]["salmon"]
params:
trimmomatic_dir = directory(TRIMMOMATIC_DIR)
run:
shell(""" mkdir -p {output.salmon_out}/index """)
shell("""
{salmon}
index \
-t {input.transcript} \
-i {output.salmon_out}/index \
--type quasi \
-k 31 \
-p {threads} > {log} &&
{salmon}
quant \
-i {output.salmon_out}/index \
-l A \
-1 {input.fastq}/*_1.fastq.gz \
-2 {input.fastq}/*_2.fastq.gz \
-o {output.salmon_out} \
-p {threads} > {log}
""")
You need to use an input function to retrieve the path to the fastq from the config. Did you do the official tutorial ( http://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html)? It covers exactly this use case. For real world best practices I suggest to further have a look at https://github.com/snakemake-workflows/docs.