Snakemake: Does not want to execute my rule? - python

I'm a beginner with coding and Snakemake and I'm really struggling to understand my problem. Running the snakefile below will produce no error. But it does not execute the Bowtie rule. After using --dryrun it will show:
Building DAG of jobs... Nothing to be done.
My guess would be that I mixed something up with the wildcards and Snakemake thinks the file already exist so it does not execute the rule at all. The rule does work when I hard code it. I tried to change the wildcards but can't get it to run.
#Snakefile
configfile: "../../config/config.yaml"
INPUTDIR = str(config["paths"]["input"])
OUTPUTDIR = str(config["paths"]["output"])
FILE_FORMAT = str(config["paths"]["file_format"])
DATABASEPATH = str(config["database"]["Bowtie_Database"])
(SAMPLES, NUMBERS) = glob_wildcards(INPUTDIR + "/{sample}_{number, [1,2]}." + FILE_FORMAT)
DATABASE, = glob_wildcards(DATABASEPATH + "/{bowtie_ref}")
#Outputfiles
rule all:
input:
#FastQC raw
expand(OUTPUTDIR + "/FastQC/raw/{sample}_{number}_fastqc.html", sample=SAMPLES,number=NUMBERS),
expand(OUTPUTDIR + "/FastQC/raw/{sample}_{number}_fastqc.zip", sample=SAMPLES,number=NUMBERS),
#Bowtie output
expand(OUTPUTDIR + "/Bowtie/{bowtie_ref}_{sample}.sam", bowtie_ref=DATABASE,sample=SAMPLES),
#macs2 output
#"/home/henri/MPI/Pipeline/Mus/results/Macs2/eg2.bed"
#######
# Q C #
#######
#Quality Control for raw data with FastQC
rule qc_raw_fastqc:
params:
threads = config["threads"]
conda:
"envs/fastqc.yml"
input:
INPUTDIR + "/{sample}_{number}." + FILE_FORMAT
output:
html = OUTPUTDIR + "/FastQC/raw/{sample}_{number}_fastqc.html",
zip = OUTPUTDIR + "/FastQC/raw/{sample}_{number}_fastqc.zip"
message:
"Doing quality control for raw reads with FastQC"
shell:
"fastqc -o {config[paths][output]}/FastQC/raw {input}"
################
## B O W T I E #
################
#mapping on ref. genome with Bowtie2
#rule Bowtie:
params:
threads = config["threads"]
conda:
"envs/fastqc.yml"
input:
expand(INPUTDIR + "/Bowtie_Database/{{bowtie_ref}}{ending}", ending=[".1.bt2",".2.bt2",".3.bt2",".4.bt2",".rev.1.bt2",".rev.2.bt2"]),
R1 = INPUTDIR + "{sample}.fastq",
R2 = INPUTDIR + "{sample}.fastq"
output:
OUTPUTDIR + "/Bowtie/{bowtie_ref}_{sample}.sam"
message:
"Alignment with Bowtie2 this will take a while"
shell:
"bowtie2 -x {INPUTDIR}/{wildcards.bowtie_ref} -1 {input.R1} -2 {input.R2} -S {output}"
Any help or Ideas would be really appreciated, thank you!

#DmitryKuzminov is probably right - the output files exist and they are newer than the input.
You can force the re-execution of rule Bowtie (and everything that depends on its output) with:
snakemake --forcerun Bowtie ...

Related

Directory files as input in Snakemake

I plan to implement a pipeline where I search for specific transcripts at three different genomes to align the best hits and estimate some statistics by each.
To automatize the task in Snakemake, I run blat given the genomes sequence. Nonetheless, at some point, the pipeline needs to use the output transcripts from blat as the subsequent inputs. The problem is that I don't know which transcripts will be output by the checkpoint get_transcripts (see below). Does someone know how to read the directory containing the transcripts and use them as parallel inputs for the next steps? I tried to implement a function to read the path, but then the rule macse_align (see bellow) gets as input a list of files, and Snakemake does not iterate by transcripts name but instead tries to execute the rule using all the input at once.
I have checked similar post, but the solution usually use the list of file inside a directory as an input list for the next rule (e.g: use directories or all files in directories as input in snakemake)
Here is my code
import os
import glob
configfile: 'config.yaml'
ROOT = os.path.abspath('genomes') + '/'
for d in ['pslx','info','genes','annotations']:
os.makedirs(ROOT + d,exist_ok=True)
rule all:
input:
outgroups = expand(ROOT+config['FOCAL'] + '_{outgroup}.fa',outgroup=config['OUTGROUPS']),
"genomes/genes/",
[f+'/'+f.split('/')[-1] + '_NT.fa' for f in glob.glob('genomes/genes/*')]
rule parse_cds:
input:
ROOT + config['PREFIX'] + 'cds.all.fa.gz'
output:
ROOT + config['FOCAL'] + '_cds.fa'
shell:
"""bioawk -c fastx '{{print ">"$name"\\n"$seq}}' {input} > {output}"""
rule pblat:
input:
cds = ROOT + config['FOCAL'] + '_cds.fa',
genome = ROOT + "{outgroup}.fa.gz"
output:
ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}.pslx"
threads:
config['THREADS']
shell:
"""
pblat {input.genome} {input.cds} -t=dna -q=dna -minIdentity=60 -fine -threads={threads} -out=pslx {output}
"""
rule pslx_reps:
input:
ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}.pslx"
output:
pslx=ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}_reps.pslx",
psr=ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}_reps.psr"
shell:
"pslReps -nohead {input} {output.pslx} {output.psr}"
rule pslx_info:
input:
ROOT + "pslx/" + "human_{outgroup}_reps.pslx"
output:
fa = ROOT + "human_{outgroup}.fa",
info = ROOT + "info/" + "human_{outgroup}.info"
shell:
"perl scripts/pslx_to_fasta.pl --pslx {input} --fasta {output.fa} --info {output.info}"
checkpoint get_transcripts:
input:
outgroups = expand(ROOT+config['FOCAL'] + '_{outgroup}.fa',outgroup=config['OUTGROUPS']),
infos = expand(ROOT+'info/{outgroup}_back.info',outgroup=config['OUTGROUPS']),
focal = ROOT + config['FOCAL'] + '_cds.fa',
output:
directory(ROOT+'genes/')
params:
species_dict=config["OUTGROUPS"],
distance=config['DISTANCE'],
gtf = ROOT + config['GTF']
script:
'scripts/test.py'
# input function for rule macse_align, return paths to all files produced by the checkpoint 'somestep'
def list_transcripts(wildcards):
checkpoint_output = checkpoints.get_transcripts.get(**wildcards).output[0]
in_dir = glob.glob(checkpoint_output+'/*/*')
return [i + "/" + i.split('/')[-1] + '.fa' for i in in_dir]
rule macse_align:
input:
list_transcripts
output:
[f+'/'+f.split('/')[-1] + '_NT.fa' for f in glob.glob('genomes/genes/*')]
shell:
"""
java -Xmx8G -jar scripts/macse_v2.06.jar -prog alignSequences -seq {input} -out_NT {output}
"""

Weird NameError with python function in Snakemake script

This is an extension of a question I asked yesterday. I have looked all over StackOverflow and have not found an instance of this specific NameError:
Building DAG of jobs...
Updating job done.
InputFunctionException in line 148 of /home/nasiegel/2022-h1n1/Snakefile:
Error:
NameError: free variable 'combinator' referenced before assignment in enclosing scope
Wildcards:
Traceback:
File "/home/nasiegel/2022-h1n1/Snakefile", line 131, in aggregate_decompress_h1n1
I assumed it was an issue having to do with the symbolic file paths in my function:
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
However, hardcoding the paths does not resolve the issue. I've attached the full Snakefile below any advice would be appreciated.
Original file
# Snakemake file - input raw reads to generate quant files for analysis in R
configfile: "config.yaml"
import io
import os
import pandas as pd
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
PROJ = config["proj_name"]
INPUTDIR = config["raw-data"]
SCRATCH = config["scratch"]
REFERENCE = config["ref"]
OUTPUTDIR = config["outputDIR"]
# Adapters
SE_ADAPTER = config['seq']['SE']
SE_SEQUENCE = config['seq']['trueseq-se']
# Organsim
TRANSCRIPTOME = config['transcriptome']['rhesus']
SPECIES = config['species']['rhesus']
SAMPLE_LIST = glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz").basenames
rule all:
input:
"final.txt",
# dowload referemce files
REFERENCE + SE_ADAPTER,
REFERENCE + SPECIES,
# multiqc
SCRATCH + "fastqc/raw_multiqc.html",
SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
SCRATCH + "fastqc/trimmed_multiqc.html",
SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
rule download_trimmomatic_adapter_file:
output: REFERENCE + SE_ADAPTER
shell: "curl -L -o {output} {SE_SEQUENCE}"
rule download_transcriptome:
output: REFERENCE + SPECIES
shell: "curl -L -o {output} {TRANSCRIPTOME}"
rule download_data:
output: "high_quality_files.tgz"
shell: "curl -L -o {output} https://osf.io/pcxfg/download"
checkpoint decompress_h1n1:
output: directory(INPUTDIR)
input: "high_quality_files.tgz"
shell: "tar xzvf {input}"
rule fastqc:
input: INPUTDIR + "{basenames}_R1.fastq.gz"
output:
raw_html = SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
raw_zip = SCRATCH + "fastqc/{basenames}_R1_fastqc.zip"
conda: "env/rnaseq.yml"
wrapper: "0.80.3/bio/fastqc"
rule multiqc:
input:
raw_qc = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand(SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
output:
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
trim_multi_html = SCRATCH + "fastqc/trimmed_multiqc.html",
trim_multi_stats = SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
conda: "env/rnaseq.yml"
shell:
"""
multiqc -n multiqc.html {input.raw_qc} #run multiqc
mv multiqc.html {output.raw_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.raw_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
#repeat for trimmed data
multiqc -n multiqc.html {input.trim_qc} #run multiqc
mv multiqc.html {output.trim_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.trim_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
"""
rule trimmmomatic_se:
input:
reads= INPUTDIR + "{basenames}_R1.fastq.gz",
adapters= REFERENCE + SE_ADAPTER,
output:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
unpaired = SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz"
conda: "env/rnaseq.yml"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trim_unpaired.log"
shell:
"""
trimmomatic SE {input.reads} \
{output.reads} {output.unpaired} \
ILLUMINACLIP:{input.adapters}:2:0:15 LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 MINLEN:25
"""
rule fastqc_trim:
input: SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz"
output:
html = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
zip = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trimmed.log"
conda: "env/rnaseq.yml"
wrapper: "0.35.2/bio/fastqc"
rule salmon_quant:
input:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
index_dir = OUTPUTDIR + "quant/sc_ensembl_index"
output: OUTPUTDIR + "{basenames}_quant/quant.sf"
params: OUTPUTDIR + "{basenames}_quant"
log: SCRATCH + "logs/salmon/{basenames}_quant.log"
conda: "env/rnaseq.yml"
shell:
"""
salmon quant -i {input.index_dir} --libType A -r {input.reads} -o {params} --seqBias --gcBias --validateMappings
"""
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
rule salmon_index:
input: REFERENCE + SPECIES
output: directory(OUTPUTDIR + "quant/sc_ensembl_index")
conda: "env/rnaseq.yml"
shell: "salmon index --index {output} --transcripts {input} # --type quasi"
rule done:
input: aggregate_decompress_h1n1
output: "final.txt"
shell: "touch {output}"
I think it's due to you used expand function in a wrong way, expand only accepts two positional arguments, where the first one is pattern and the second one is function (optional). If you want to supply multiple patterns you should wrap these patterns in list.
After some studying on source code of snakemake, it turns out expand function doesn't check if user provides < 3 positional arguments, there is a variable combinator in if-else that would only be created when there are 1 or 2 positional arguments, the massive amount of positional arguments you provide skip this part and lead to the error when it tries to use combinator later.
Source code: https://snakemake.readthedocs.io/en/v6.5.4/_modules/snakemake/io.html

Comparing two files and removing all whitespaces

Is there a more elegant way of comparing these two files?
Right now I am getting the following error message: syntax error near unexpected token (... diff <( tr -d ' '.
result = Popen("diff <( tr -d ' \n' <" + file1 + ") <( tr -d ' \n' <"
+ file2 + ") | wc =l", shell=True, stdout=PIPE).stdout.read()
Python seems to read "\n" as a literal character.
The constructs you are using are interpreted by bash and do not form a standalone statement that you can pass to system() or exec().
<( ${CMD} )
< ${FILE}
${CMD1} | ${CMD2}
As such, you will need to wire-up the redirection and pipelines yourself, or call on bash to interpret the line for you (as #wizzwizz4 suggests).
A better solution would be to use something like difflib that will perform this internally to your process rather than calling on system() / fork() / exec().
Using difflib.unified_diff will give you a similar result:
import difflib
def read_file_no_blanks(filename):
with open(filename, 'r') as f:
lines = f.readlines()
for line in lines:
if line == '\n':
continue
yield line
def count_differences(diff_lines):
diff_count = 0
for line in diff_lines:
if line[0] not in [ '-', '+' ]:
continue
if line[0:3] in [ '---', '+++' ]:
continue
diff_count += 1
return diff_count
a_lines = list(read_file_no_blanks('a'))
b_lines = list(read_file_no_blanks('b'))
diff_lines = difflib.unified_diff(a_lines, b_lines)
diff_count = count_differences(diff_lines)
print('differences: %d' % ( diff_count ))
This will fail when you fix the syntax error because you are attempting to use bash syntax in what is implemented as a C system call.
If you wish to do this in this way, either write a shell script or use the following:
result = Popen(['bash', '-c',
"diff <( tr -d ' \n' <" + file1 + ") <( tr -d ' \n' <"
+ file2 + ") | wc =l"], shell=True, stdout=PIPE).stdout.read()
This is not an elegant solution, however, since it is relying on the GNU coreutils and bash. A more elegant solution would be pure Python. You could do this with the difflib module and the re module.

Can't get output from Tesseract command run through os.system

I've created a function which loops over images and gets the orientation from the image with the tesseract library. The code looks like this:
def fix_incorrect_orientation(pathName):
for filename in os.listdir(pathName):
tesseractResult = str(os.system('tesseract ' + pathName + '/' + filename + ' - -psm 0'))
print('tesseractResult: ' + tesseractResult)
regexObj = re.search('([Orientation:]+[\s][0-9]{1})',tesseractResult)
if regexObj:
orientation = regexObj.groups(0)[0]
print('orientation123: ' + str(orientation))
else:
print('Not getting in the Regex.')
The result from the variable tesseractResult is always 0 though. But in the terminal I will get the following result from the command:
Orientation: 3
Orientation in degrees: 90
Orientation confidence: 19.60
Script: 1
Script confidence: 21.33
I've tried catching the output from the os.system in multiple ways, such as with Popen and subprocess but without any succes. It seems that I can't catch the output from the tesseract library.
So, how exactly should I do this?
Thanks,
Yenthe
Literally 10 minutes after asking the question I found a way.. First import commands:
import commands
And then the following code will do the trick:
def fix_incorrect_orientation(pathName):
for filename in os.listdir(pathName):
tesseractResult = str(commands.getstatusoutput('tesseract ' + pathName + '/' + filename + ' - -psm 0'))
print('tesseractResult: ' + tesseractResult)
regexObj = re.search('([Orientation:]+[\s][0-9]{1})',tesseractResult)
if regexObj:
orientation = regexObj.groups(0)[0]
print('orientation123: ' + str(orientation))
else:
print('Not getting in the Regex.')
This will pass the command around with the commands library and the output is caught thanks to getstatusoutput from the commands library.

Trouble calling EMBOSS program from python

I am having trouble calling an EMBOSS program (which runs via command line) called sixpack through Python.
I run Python via Windows 7, Python version 3.23, Biopython version 1.59, EMBOSS version 6.4.0.4. Sixpack is used to translate a DNA sequence in all six reading frames and creates two files as output; a sequence file identifying ORFs, and a file containing the protein sequences.
There are three required arguments which I can successfully call from command line: (-sequence [input file], -outseq [output sequence file], -outfile [protein sequence file]). I have been using the subprocess module in place of os.system as I have read that it is more powerful and versatile.
The following is my python code, which runs without error but does not produce the desired output files.
from Bio import SeqIO
import re
import os
import subprocess
infile = input('Full path to EXISTING .fasta file would you like to open: ')
outdir = input('NEW Directory to write outfiles to: ')
os.mkdir(outdir)
for record in SeqIO.parse(infile, "fasta"):
print("Translating (6-Frame): " + record.id)
ident=re.sub("\|", "-", record.id)
print (infile)
print ("Old record ID: " + record.id)
print ("New record ID: " + ident)
subprocess.call (['C:\memboss\sixpack.exe', '-sequence ' + infile, '-outseq ' + outdir + ident + '.sixpack', '-outfile ' + outdir + ident + '.format'])
print ("Translation of: " + infile + "\nWritten to: " + outdir + ident)
Found the answer.. I was using the wrong syntax to call subprocess. This is the correct syntax:
subprocess.call (['C:\memboss\sixpack.exe', '-sequence', infile, '-outseq', outdir + ident + '.sixpack', '-outfile', outdir + ident + '.format'])

Categories