Snakemake: using regex in rule run

Snakemake: using regex in rule run - python

i am new in sankemake, i am trying to run this code but i have an error.
I have my input directories structured like this:
Library:
-MMETSP1:
SRR1_1.fastq.gz
SRR1_2.fastq.gz
-MMETSP2:
SRR2_1.fastq.gz
SRR2_2.fastq.gz
So what i want to do is to run the rule twice for each directory. For tht i have used the expand function in rule all and i have two jobs counted by snakemake. That is fine for me. But my probelm is not to retrieve the fasta file inside my directories. For that i have used regex in the execution of the command but it is not working.
Can someone help me please.
Thank you in advance !
#!/usr/bin/python
import os
import glob
import sys
SALMON_BY_LIBRARY_DIR = OUT_DIR + "salmon_by_library_out"
salmon = config["software"]["salmon"]
(LIBRARY, FASTQ, SENS) = glob_wildcards(LIBRARY_DIR + "{mmetsp}/{reads}_{type}.fastq.gz")
rule all:
input:
salmon_by_library_out = expand(SALMON_BY_LIBRARY_DIR + "/" + "{mmetsp}", zip, mmetsp=LIBRARY),
rule salmon_by_library:
input:
transcript = TRINITY_DIR + "/Trinity.fasta",
fastq = LIBRARY_DIR + "{mmetsp}",
output:
salmon_out = directory(SALMON_BY_LIBRARY_DIR + "/" + "{mmetsp}"),
log:
OUT_DIR + "{mmetsp}/salmon.log"
threads:
config["threads"]["salmon"]
params:
trimmomatic_dir = directory(TRIMMOMATIC_DIR)
run:
shell(""" mkdir -p {output.salmon_out}/index """)
shell("""
{salmon}
index \
-t {input.transcript} \
-i {output.salmon_out}/index \
--type quasi \
-k 31 \
-p {threads} > {log} &&
{salmon}
quant \
-i {output.salmon_out}/index \
-l A \
-1 {input.fastq}/*_1.fastq.gz \
-2 {input.fastq}/*_2.fastq.gz \
-o {output.salmon_out} \
-p {threads} > {log}
""")

You need to use an input function to retrieve the path to the fastq from the config. Did you do the official tutorial ( http://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html)? It covers exactly this use case. For real world best practices I suggest to further have a look at https://github.com/snakemake-workflows/docs.

Related

Snakemake ambiguity

I have an ambiguity error and I can't figure out why and how to solve it.
Defining the wildcards:
rule all:
input:
xls = expand("reports/{sample}.xlsx", sample = config["samples"]),
runfolder_xls = expand("{runfolder}.xlsx", runfolder = config["runfolder"])
Actual rules:
rule sample_report:
input:
vcf = "vcfs/{sample}.annotated.vcf",
cov = "stats/{sample}.coverage.gz",
mod_bed = "tmp/mod_ref_{sample}.bed",
nirvana_g2t = "/mnt/storage/data/NGS/nirvana_genes2transcripts"
output:
"reports/{sample}.xlsx"
params:
get_nb_samples()
log:
"logs/{sample}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_sample_report.py -v {input.vcf} -c {input.cov} -r {input.mod_bed} -n {input.nirvana_g2t} -r {rule};
exitcode=$? ;
if [[ {params} > 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.sample}
elif [[ {params} == 1 ]]
then
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r sample_mode -n {wildcards.sample}
else
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e 1 -r {rule} -n {wildcards.sample}
fi
"""
rule runfolder_report:
input:
sample_sheet = "SampleSheet.csv"
output:
"{runfolder}.xlsx"
log:
"logs/{runfolder}.log"
shell: """
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {wildcards.runfolder} -s {input.sample_sheet} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {wildcards.runfolder}
"""
Config file:
runfolder: "CP0340"
samples: ['C014044p', 'C130157', 'C014040p', 'C014054b-1', 'C051198-A', 'C014042p', 'C052007W-C', 'C051198-B', 'C014038p', 'C052004-B', 'C051198-C', 'C052004-C', 'C052003-B', 'C052003-A', 'C052004-A', 'C052002-C', 'C052005-C', 'C052002-A', 'C130157N', 'C052006-B', 'C014063pW', 'C014054b-2', 'C052002-B', 'C052006-C', 'C052007W-B', 'C052003-C', 'C014064bW', 'C052005-B', 'C052006-A', 'C052005-A']
Error:
$ snakemake -n -s ../niles/Snakefile --configfile logs/CP0340_config.yaml
Building DAG of jobs...
AmbiguousRuleException:
Rules runfolder_report and sample_report are ambiguous for the file reports/C014044p.xlsx.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
runfolder_report: runfolder=reports/C014044p
sample_report: sample=C014044p
Expected input files:
runfolder_report: SampleSheet.csv
sample_report: vcfs/C014044p.annotated.vcf stats/C014044p.coverage.gz tmp/mod_ref_C014044p.bed /mnt/storage/data/NGS/nirvana_genes2transcriptsExpected output files:
runfolder_report: reports/C014044p.xlsx
sample_report: reports/C014044p.xlsx
If I understand Snakemake correctly, the wildcards in the rules are defined in my all rule so I don't understand why the runfolder_report rule tries to put reports/C014044p.xlsx as an output + how the output has a sample name instead of the runfolder name (as defined in the config file).

As the error message suggests, you could assign a distinct prefix to the output of each rule. So your original code will work if you replace {runfolder}.xlsx with, e.g., "runfolder/{runfolder}.xlsx" in rule all and in runfolder_report. Alternatively, constraint the wildcards (my preferred solution) by adding before rule all something like:
wildcard_constraints:
sample= '|'.join([re.escape(x) for x in config["samples"]]),
runfolder= re.escape(config["runfolder"]),
The reason for this is that snakemake matches input and output strings using regular expressions (the fine details of how it's done, I must admit, escape me...)

Ok here is my solution:
rule runfolder_report:
input:
"SampleSheet.csv"
output:
expand("{runfolder}.xlsx", runfolder = config["runfolder"])
params:
config["runfolder"]
log:
expand("logs/{runfolder}.log", runfolder = config["runfolder"])
shell: """
set +e ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_create_runfolder_report.py -run {params} -s {input} -r {rule} ;
exitcode=$? ;
python /mnt/storage/home/kimy/projects/automate_CP/niles/NILES_check_exitcode.py -e $exitcode -r {rule} -n {params}
"""
However I still don't understand why it had errors and I know that it work previously.

Why one snakemake rule is always skipped or ignored

I am trying to fix a Snakefile. There are two rules (see the code below), each one would work if it's the only one, but only the rule prernaseqc would work when both are kept.
It seems that snakemake totally ignores the other.
I tried to touch file files_to_rnaseqc.txt etc, and it does not help. Why?
Any ideas would be appreciated.
import os
configfile: "run.json"
workpath = config['project']['workpath']
samples=sorted(list(config['project']['units'].keys()))
from snakemake.utils import R
from os.path import join
configfile: "run.json"
from os import listdir
star_dir="STAR_files"
bams_dir="bams"
log_dir="logfiles"
rseqc_dir="RSeQC"
kraken_dir="kraken"
preseq_dir="preseq"
pfamily = 'rnaseq'
rule prernaseqc:
input:
expand(join(workpath,bams_dir,"{name}.star_rg_added.sorted.dmark.bam"), name=samples)
output:
out1=join(workpath,bams_dir,"files_to_rnaseqc.txt")
priority: 2
params:
rname='pl:prernaseqc',batch='--mem=4g --time=04:00:00'
run:
with open(output.out1, "w") as out:
out.write("Sample ID\tBam file\tNotes\n")
for f in input:
out.write("%s\t" % f)
out.write("%s\t" % f)
out.write("%s\n" % f)
out.close()
rule rnaseqc:
input:
join(workpath,bams_dir,"files_to_rnaseqc.txt")
output:
join(workpath,"STAR_QC")
priority: 2
params:
rname='pl:rnaseqc',
batch='--mem=24g --time=48:00:00',
bwaver=config['bin'][pfamily]['tool_versions']['BWAVER'],
rrnalist=config['references'][pfamily]['RRNALIST'],
rnaseqcver=config['bin'][pfamily]['RNASEQCJAR'],
rseqcver=config['bin'][pfamily]['tool_versions']['RSEQCVER'],
gtffile=config['references'][pfamily]['GTFFILE'],
genomefile=config['references'][pfamily]['GENOMEFILE']
shell: """
module load {params.bwaver}
module load {params.rseqcver}
var="{params.rrnalist}"
if [ $var == "-" ]; then
java -Xmx48g -jar {params.rnaseqcver} -n 1000 -s {input} -t {params.gtffile} -r {params.genomefile} -o {output}
else
java -Xmx48g -jar {params.rnaseqcver} -n 1000 -s {input} -t {params.gtffile} -r {params.genomefile} -rRNA {params.rrnalist} -o {output}
fi

Snakemake, by design, uses output files listed in first rule of the file as target files (i.e. files that need to be created). Hence, in your case, whichever rule happens to be the first gets executed, while the other remains unexecuted.
You need to specify a target rule that lists all output files. It is customary to name it rule all.
rule all:
input:
join(workpath,bams_dir,"files_to_rnaseqc.txt"),
join(workpath,"STAR_QC")

Snakemake problems with Log

I'm trying to write a Log param in my Snakefile, but I don't know what I'm missing. This is my code:
include:
'config.py'
rule all:
input:
expand(WORK_DIR +"/trimmed/TFB{sample}_R{read_no}.fastq.gz.good",
sample=SAMPLE_TFB ,read_no=['1', '2'])
rule fastp:
input:
R1= SAMPLES_DIR + "/TFB{sample}_R1.fastq.gz",
R2= SAMPLES_DIR + "/TFB{sample}_R2.fastq.gz"
output:
R1out= WORK_DIR + "/trimmed/TFB{sample}_R1.fastq.gz.good",
R2out= WORK_DIR + "/trimmed/TFB{sample}_R2.fastq.gz.good"
log:
log = WORK_DIR + "/logs/fastp/{sample}.html"
shell:
"fastp -i {input.R1} -I {input.R2} -o {output.R1out} -O {output.R2out} \
-h {log.log}"
This is the error I get after executing the snakemake.
SyntaxError in line 16 of /work/users/leboralli/trofoZikaLincRNAs/scripts/Snakefile:
Colon expected after keyword log. (Snakefile, line 16)
I tried many options and nothing worked.
This software, fastp, has a param for logging: -h, with a .html output. Without the log my code works fine.
Thanks in advance.

How to extract a part of a string using regex, when starting and ending portion of the string is given

I'm writing a fabric script to change the nodejs version. To do that i need to remove
node-v0.10.32-linux-x64
and replace it with
node-v6.9.1-linux-x64
and vice versa.
Below is the line and i need a regular expression to get the bold section of the line
/home/portweb/software/nodejs/node-v0.10.32-linux-x64/bin
Below is the code that changes the nodejs version.
#task
def changeVersion(appname='nodejs', rootdir='/home/portweb/software',homedir='/home/portweb',tarfile='node-v6.9.1-linux-x64.tar.gz',nodeversion='node-v'):
"""install nodejs"""
base_dir = rootdir +'/'+appname
run ('if [ ! -d ' + base_dir + ' ] ; then mkdir -p ' + base_dir + '; fi')
put('../package/'+tarfile, base_dir + '/', use_sudo=False)
with cd (base_dir):
run('tar -zxf '+ tarfile)
run ('sed -i \'s/regex/'+nodeversion+'/g\' /home/portweb/.bashrc')
print "****Nodejs Version Changed"
nodeversion = node-v0.10.32-linux-x64 or node-v6.9.1-linux-x64;

Here is the regex for sed:
"sed -i 's,/home/portweb/software/nodejs/\(.*\)/bin,/home/portweb/software/nodejs/'" + nodeversion + "'/bin,g'"
Note that you can use separators other than / for legibility, so here we used ,

You can use fabric sed function to replace in prueba.txt the chain you want:
def sed_node():
sed('/home/user/test/prueba.txt', 'node-v0.10.32-linux-x64', 'node-v6.9.1-linux-x64')
And run:
$ fab sed_node

Python script truncating directory names

Suppose one has the following script:
import csv
import re
#this script is designed to take a list of probe files and
#generate a corresponding swarm file for blasting them against
#some user defined database.
def __main__():
#args
infile = sys.argv[1] #list of sequences to run
outfile = sys.argv[2] #location of resulting swarm file
outdir = sys.argv[3] #location to store results from blast run
db = sys.argv[4] #database to query against
with open(infile) as fi:
data = [x[0].strip('\n') for x in list(csv.reader(fi))]
cmd = open(outfile, 'w')
blast = 'module load blast/2.2.26; blastall -v 5 -b 5 -a 4 -p blastn '
db = ' -d ' + db
def f(x):
input = ' -i ' + str(x)
out = re.search('(?<=./)([^\/]*)(?=\.)', x).group(0)
out = ' -o ' + outdir + out + '.out'
cmd.write(blast + db + input + out + '\n')
map(f, data)
__main__()
If I run it with:
python blast-probes.py /data/cornishjp/array-annotations/agilent_4x44k_human/probe-seq-fasta-list.csv ./human.cmd ./ x
An example from human.cmd would be:
module load blast/2.2.26; blastall -v 5 -b 5 -a 4 -p blastn -d x -i /data/corni
shjp/array-annotations/agilent_4x44k_human/probe-seqs/A_33_P3344603.fas -o ./A_3
3_P3344603.out
If I run it with:
python blast-probes.py /data/cornishjp/array-annotations/agilent_4x44k_mouse/probe-seq-fasta-list.csv ./mouse.cmd ./ x
An example from mouse.cmd would be:
module load blast/2.2.26; blastall -v 5 -b 5 -a 4 -p blastn -d x -i /data/cornishp/array-annotations/agilent_4x44k_mouse/probe-seqs/A_51_P100327.fas -o ./A_51_P100327.out
The difference is when the ending of agilent_4x44k_ is human, the directory is written to file correctly with cornishjp. When the ending is mouse, the directory is written incorrectly as cornishp, the j is left out. I've swapped everything around (saving human as mouse.cmd, and so on) and I cannot for the life of me figure it out.
The only thing that comes to mind is that when I generate the arguments for the python script, I use tab to autocomplete (linux). Could this be the problem? It is correctly reading the input file, as the script would fail.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Snakemake: using regex in rule run - python

Related

Snakemake ambiguity

Why one snakemake rule is always skipped or ignored

Snakemake problems with Log

How to extract a part of a string using regex, when starting and ending portion of the string is given

Python script truncating directory names

Categories

Resources