Directory files as input in Snakemake - python

I plan to implement a pipeline where I search for specific transcripts at three different genomes to align the best hits and estimate some statistics by each.
To automatize the task in Snakemake, I run blat given the genomes sequence. Nonetheless, at some point, the pipeline needs to use the output transcripts from blat as the subsequent inputs. The problem is that I don't know which transcripts will be output by the checkpoint get_transcripts (see below). Does someone know how to read the directory containing the transcripts and use them as parallel inputs for the next steps? I tried to implement a function to read the path, but then the rule macse_align (see bellow) gets as input a list of files, and Snakemake does not iterate by transcripts name but instead tries to execute the rule using all the input at once.
I have checked similar post, but the solution usually use the list of file inside a directory as an input list for the next rule (e.g: use directories or all files in directories as input in snakemake)
Here is my code
import os
import glob
configfile: 'config.yaml'
ROOT = os.path.abspath('genomes') + '/'
for d in ['pslx','info','genes','annotations']:
os.makedirs(ROOT + d,exist_ok=True)
rule all:
input:
outgroups = expand(ROOT+config['FOCAL'] + '_{outgroup}.fa',outgroup=config['OUTGROUPS']),
"genomes/genes/",
[f+'/'+f.split('/')[-1] + '_NT.fa' for f in glob.glob('genomes/genes/*')]
rule parse_cds:
input:
ROOT + config['PREFIX'] + 'cds.all.fa.gz'
output:
ROOT + config['FOCAL'] + '_cds.fa'
shell:
"""bioawk -c fastx '{{print ">"$name"\\n"$seq}}' {input} > {output}"""
rule pblat:
input:
cds = ROOT + config['FOCAL'] + '_cds.fa',
genome = ROOT + "{outgroup}.fa.gz"
output:
ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}.pslx"
threads:
config['THREADS']
shell:
"""
pblat {input.genome} {input.cds} -t=dna -q=dna -minIdentity=60 -fine -threads={threads} -out=pslx {output}
"""
rule pslx_reps:
input:
ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}.pslx"
output:
pslx=ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}_reps.pslx",
psr=ROOT + 'pslx/' + config['FOCAL'] + "_{outgroup}_reps.psr"
shell:
"pslReps -nohead {input} {output.pslx} {output.psr}"
rule pslx_info:
input:
ROOT + "pslx/" + "human_{outgroup}_reps.pslx"
output:
fa = ROOT + "human_{outgroup}.fa",
info = ROOT + "info/" + "human_{outgroup}.info"
shell:
"perl scripts/pslx_to_fasta.pl --pslx {input} --fasta {output.fa} --info {output.info}"
checkpoint get_transcripts:
input:
outgroups = expand(ROOT+config['FOCAL'] + '_{outgroup}.fa',outgroup=config['OUTGROUPS']),
infos = expand(ROOT+'info/{outgroup}_back.info',outgroup=config['OUTGROUPS']),
focal = ROOT + config['FOCAL'] + '_cds.fa',
output:
directory(ROOT+'genes/')
params:
species_dict=config["OUTGROUPS"],
distance=config['DISTANCE'],
gtf = ROOT + config['GTF']
script:
'scripts/test.py'
# input function for rule macse_align, return paths to all files produced by the checkpoint 'somestep'
def list_transcripts(wildcards):
checkpoint_output = checkpoints.get_transcripts.get(**wildcards).output[0]
in_dir = glob.glob(checkpoint_output+'/*/*')
return [i + "/" + i.split('/')[-1] + '.fa' for i in in_dir]
rule macse_align:
input:
list_transcripts
output:
[f+'/'+f.split('/')[-1] + '_NT.fa' for f in glob.glob('genomes/genes/*')]
shell:
"""
java -Xmx8G -jar scripts/macse_v2.06.jar -prog alignSequences -seq {input} -out_NT {output}
"""

Related

Weird NameError with python function in Snakemake script

This is an extension of a question I asked yesterday. I have looked all over StackOverflow and have not found an instance of this specific NameError:
Building DAG of jobs...
Updating job done.
InputFunctionException in line 148 of /home/nasiegel/2022-h1n1/Snakefile:
Error:
NameError: free variable 'combinator' referenced before assignment in enclosing scope
Wildcards:
Traceback:
File "/home/nasiegel/2022-h1n1/Snakefile", line 131, in aggregate_decompress_h1n1
I assumed it was an issue having to do with the symbolic file paths in my function:
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
However, hardcoding the paths does not resolve the issue. I've attached the full Snakefile below any advice would be appreciated.
Original file
# Snakemake file - input raw reads to generate quant files for analysis in R
configfile: "config.yaml"
import io
import os
import pandas as pd
import pathlib
from snakemake.exceptions import print_exception, WorkflowError
#----SET VARIABLES----#
PROJ = config["proj_name"]
INPUTDIR = config["raw-data"]
SCRATCH = config["scratch"]
REFERENCE = config["ref"]
OUTPUTDIR = config["outputDIR"]
# Adapters
SE_ADAPTER = config['seq']['SE']
SE_SEQUENCE = config['seq']['trueseq-se']
# Organsim
TRANSCRIPTOME = config['transcriptome']['rhesus']
SPECIES = config['species']['rhesus']
SAMPLE_LIST = glob_wildcards(INPUTDIR + "{basenames}_R1.fastq.gz").basenames
rule all:
input:
"final.txt",
# dowload referemce files
REFERENCE + SE_ADAPTER,
REFERENCE + SPECIES,
# multiqc
SCRATCH + "fastqc/raw_multiqc.html",
SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
SCRATCH + "fastqc/trimmed_multiqc.html",
SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
rule download_trimmomatic_adapter_file:
output: REFERENCE + SE_ADAPTER
shell: "curl -L -o {output} {SE_SEQUENCE}"
rule download_transcriptome:
output: REFERENCE + SPECIES
shell: "curl -L -o {output} {TRANSCRIPTOME}"
rule download_data:
output: "high_quality_files.tgz"
shell: "curl -L -o {output} https://osf.io/pcxfg/download"
checkpoint decompress_h1n1:
output: directory(INPUTDIR)
input: "high_quality_files.tgz"
shell: "tar xzvf {input}"
rule fastqc:
input: INPUTDIR + "{basenames}_R1.fastq.gz"
output:
raw_html = SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
raw_zip = SCRATCH + "fastqc/{basenames}_R1_fastqc.zip"
conda: "env/rnaseq.yml"
wrapper: "0.80.3/bio/fastqc"
rule multiqc:
input:
raw_qc = expand(SCRATCH + "fastqc/{basenames}_R1_fastqc.zip", basenames=SAMPLE_LIST),
trim_qc = expand(SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip", basenames=SAMPLE_LIST)
output:
raw_multi_html = SCRATCH + "fastqc/raw_multiqc.html",
raw_multi_stats = SCRATCH + "fastqc/raw_multiqc_general_stats.txt",
trim_multi_html = SCRATCH + "fastqc/trimmed_multiqc.html",
trim_multi_stats = SCRATCH + "fastqc/trimmed_multiqc_general_stats.txt"
conda: "env/rnaseq.yml"
shell:
"""
multiqc -n multiqc.html {input.raw_qc} #run multiqc
mv multiqc.html {output.raw_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.raw_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
#repeat for trimmed data
multiqc -n multiqc.html {input.trim_qc} #run multiqc
mv multiqc.html {output.trim_multi_html} #rename html
mv multiqc_data/multiqc_general_stats.txt {output.trim_multi_stats} #move and rename stats
rm -rf multiqc_data #clean-up
"""
rule trimmmomatic_se:
input:
reads= INPUTDIR + "{basenames}_R1.fastq.gz",
adapters= REFERENCE + SE_ADAPTER,
output:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
unpaired = SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz"
conda: "env/rnaseq.yml"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trim_unpaired.log"
shell:
"""
trimmomatic SE {input.reads} \
{output.reads} {output.unpaired} \
ILLUMINACLIP:{input.adapters}:2:0:15 LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 MINLEN:25
"""
rule fastqc_trim:
input: SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz"
output:
html = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
zip = SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip"
log: SCRATCH + "logs/fastqc/{basenames}_R1_trimmed.log"
conda: "env/rnaseq.yml"
wrapper: "0.35.2/bio/fastqc"
rule salmon_quant:
input:
reads = SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
index_dir = OUTPUTDIR + "quant/sc_ensembl_index"
output: OUTPUTDIR + "{basenames}_quant/quant.sf"
params: OUTPUTDIR + "{basenames}_quant"
log: SCRATCH + "logs/salmon/{basenames}_quant.log"
conda: "env/rnaseq.yml"
shell:
"""
salmon quant -i {input.index_dir} --libType A -r {input.reads} -o {params} --seqBias --gcBias --validateMappings
"""
def aggregate_decompress_h1n1(wildcards):
checkpoint_output = checkpoints.decompress_h1n1.get(**wildcards).output[0]
filenames = expand(
SCRATCH + "fastqc/{basenames}_R1_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_fastqc.zip",
SCRATCH + "trimmed/{basenames}_R1_trim.fastq.gz",
SCRATCH + "trimmed/{basenames}_R1.unpaired.fastq.gz",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.html",
SCRATCH + "fastqc/{basenames}_R1_trimmed_fastqc.zip",
OUTPUTDIR + "{basenames}_quant/quant.sf",
basenames = glob_wildcards(os.path.join(checkpoint_output, "{basenames}_R1.fastq.gz")).basenames)
return filenames
rule salmon_index:
input: REFERENCE + SPECIES
output: directory(OUTPUTDIR + "quant/sc_ensembl_index")
conda: "env/rnaseq.yml"
shell: "salmon index --index {output} --transcripts {input} # --type quasi"
rule done:
input: aggregate_decompress_h1n1
output: "final.txt"
shell: "touch {output}"
I think it's due to you used expand function in a wrong way, expand only accepts two positional arguments, where the first one is pattern and the second one is function (optional). If you want to supply multiple patterns you should wrap these patterns in list.
After some studying on source code of snakemake, it turns out expand function doesn't check if user provides < 3 positional arguments, there is a variable combinator in if-else that would only be created when there are 1 or 2 positional arguments, the massive amount of positional arguments you provide skip this part and lead to the error when it tries to use combinator later.
Source code: https://snakemake.readthedocs.io/en/v6.5.4/_modules/snakemake/io.html

Snakemake: Does not want to execute my rule?

I'm a beginner with coding and Snakemake and I'm really struggling to understand my problem. Running the snakefile below will produce no error. But it does not execute the Bowtie rule. After using --dryrun it will show:
Building DAG of jobs... Nothing to be done.
My guess would be that I mixed something up with the wildcards and Snakemake thinks the file already exist so it does not execute the rule at all. The rule does work when I hard code it. I tried to change the wildcards but can't get it to run.
#Snakefile
configfile: "../../config/config.yaml"
INPUTDIR = str(config["paths"]["input"])
OUTPUTDIR = str(config["paths"]["output"])
FILE_FORMAT = str(config["paths"]["file_format"])
DATABASEPATH = str(config["database"]["Bowtie_Database"])
(SAMPLES, NUMBERS) = glob_wildcards(INPUTDIR + "/{sample}_{number, [1,2]}." + FILE_FORMAT)
DATABASE, = glob_wildcards(DATABASEPATH + "/{bowtie_ref}")
#Outputfiles
rule all:
input:
#FastQC raw
expand(OUTPUTDIR + "/FastQC/raw/{sample}_{number}_fastqc.html", sample=SAMPLES,number=NUMBERS),
expand(OUTPUTDIR + "/FastQC/raw/{sample}_{number}_fastqc.zip", sample=SAMPLES,number=NUMBERS),
#Bowtie output
expand(OUTPUTDIR + "/Bowtie/{bowtie_ref}_{sample}.sam", bowtie_ref=DATABASE,sample=SAMPLES),
#macs2 output
#"/home/henri/MPI/Pipeline/Mus/results/Macs2/eg2.bed"
#######
# Q C #
#######
#Quality Control for raw data with FastQC
rule qc_raw_fastqc:
params:
threads = config["threads"]
conda:
"envs/fastqc.yml"
input:
INPUTDIR + "/{sample}_{number}." + FILE_FORMAT
output:
html = OUTPUTDIR + "/FastQC/raw/{sample}_{number}_fastqc.html",
zip = OUTPUTDIR + "/FastQC/raw/{sample}_{number}_fastqc.zip"
message:
"Doing quality control for raw reads with FastQC"
shell:
"fastqc -o {config[paths][output]}/FastQC/raw {input}"
################
## B O W T I E #
################
#mapping on ref. genome with Bowtie2
#rule Bowtie:
params:
threads = config["threads"]
conda:
"envs/fastqc.yml"
input:
expand(INPUTDIR + "/Bowtie_Database/{{bowtie_ref}}{ending}", ending=[".1.bt2",".2.bt2",".3.bt2",".4.bt2",".rev.1.bt2",".rev.2.bt2"]),
R1 = INPUTDIR + "{sample}.fastq",
R2 = INPUTDIR + "{sample}.fastq"
output:
OUTPUTDIR + "/Bowtie/{bowtie_ref}_{sample}.sam"
message:
"Alignment with Bowtie2 this will take a while"
shell:
"bowtie2 -x {INPUTDIR}/{wildcards.bowtie_ref} -1 {input.R1} -2 {input.R2} -S {output}"
Any help or Ideas would be really appreciated, thank you!
#DmitryKuzminov is probably right - the output files exist and they are newer than the input.
You can force the re-execution of rule Bowtie (and everything that depends on its output) with:
snakemake --forcerun Bowtie ...

Why Python program execution slows down when using functions?

So I have a rather general question I was hoping to get some help with. I put together a Python program that runs through and automates workflows at the state level for all the different counties. The entire program was created for research at school - not actual state work. Anyways, I have two designs shown below. The first is an updated version. It takes about 40 minutes to run. The second design shows the original work. Note that it is not a well structured design. However, it takes about five minutes to run the entire program. Could anybody give any insight why there are such differences between the two? The updated version is still ideal as it is much more reusable (can run and grab any dataset in the url) and easy to understand. Furthermore, 40 minutes to get about a hundred workflows completed is still a plus. Also, this is still a work in progress. A couple minor issues still need to be addressed in the code but it is still a pretty cool program.
Updated Design
import os, sys, urllib2, urllib, zipfile, arcpy
from arcpy import env
path = os.getcwd()
def pickData():
myCount = 1
path1 = 'path2URL'
response = urllib2.urlopen(path1)
print "Enter the name of the files you need"
numZips = raw_input()
numZips2 = numZips.split(",")
myResponse(myCount, path1, response, numZips2)
def myResponse(myCount, path1, response, numZips2):
myPath = os.getcwd()
for each in response:
eachNew = each.split(" ")
eachCounty = eachNew[9].strip("\n").strip("\r")
try:
myCountyDir = os.mkdir(os.path.expanduser(myPath+ "\\counties" + "\\" + eachCounty))
except:
pass
myRetrieveDir = myPath+"\\counties" + "\\" + eachCounty
os.chdir(myRetrieveDir)
myCount+=1
response1 = urllib2.urlopen(path1 + eachNew[9])
for all1 in response1:
allNew = all1.split(",")
allFinal = allNew[0].split(" ")
allFinal1 = allFinal[len(allFinal)-1].strip(" ").strip("\n").strip("\r")
numZipsIter = 0
path8 = path1 + eachNew[9][0:len(eachNew[9])-2] +"/"+ allFinal1
downZip = eachNew[9][0:len(eachNew[9])-2]+".zip"
while(numZipsIter <len(numZips2)):
if (numZips2[numZipsIter][0:3].strip(" ") == "NWI") and ("remap" not in allFinal1):
numZips2New = numZips2[numZipsIter].split("_")
if (numZips2New[0].strip(" ") in allFinal1 and numZips2New[1] != "remap" and numZips2New[2].strip(" ") in allFinal1) and (allFinal1[-3:]=="ZIP" or allFinal1[-3:]=="zip"):
urllib.urlretrieve (path8, allFinal1)
zip1 = zipfile.ZipFile(myRetrieveDir +"\\" + allFinal1)
zip1.extractall(myRetrieveDir)
#maybe just have numzips2 (raw input) as the values before the county number
#numZips2[numZipsIter][0:-7].strip(" ") in allFinal1 or numZips2[numZipsIter][0:-7].strip(" ").lower() in allFinal1) and (allFinal1[-3:]=="ZIP" or allFinal1[-3:]=="zip"
elif (numZips2[numZipsIter].strip(" ") in allFinal1 or numZips2[numZipsIter].strip(" ").lower() in allFinal1) and (allFinal1[-3:]=="ZIP" or allFinal1[-3:]=="zip"):
urllib.urlretrieve (path8, allFinal1)
zip1 = zipfile.ZipFile(myRetrieveDir +"\\" + allFinal1)
zip1.extractall(myRetrieveDir)
numZipsIter+=1
pickData()
#client picks shapefiles to add to map
#section for geoprocessing operations
# get the data frames
#add new data frame, title
#check spaces in ftp crawler
os.chdir(path)
env.workspace = path+ "\\symbology\\"
zp1 = os.listdir(path + "\\counties\\")
def myGeoprocessing(layer1, layer2):
#the code in this function is used for geoprocessing operations
#it returns whatever output is generated from the tools used in the map
try:
arcpy.Clip_analysis(path + "\\symbology\\Stream_order.shp", layer1, path + "\\counties\\" + layer2 + "\\Streams.shp")
except:
pass
streams = arcpy.mapping.Layer(path + "\\counties\\" + layer2 + "\\Streams.shp")
arcpy.ApplySymbologyFromLayer_management(streams, path+ '\\symbology\\streams.lyr')
return streams
def makeMap():
#original wetlands layers need to be entered as NWI_line or NWI_poly
print "Enter the layer or layers you wish to include in the map"
myInput = raw_input();
counter1 = 1
for each in zp1:
print each
print path
zp2 = os.listdir(path + "\\counties\\" + each)
for eachNew in zp2:
#print eachNew
if (eachNew[-4:] == ".shp") and ((myInput in eachNew[0:-7] or myInput.lower() in eachNew[0:-7])or((eachNew[8:12] == "poly" or eachNew[8:12]=='line') and eachNew[8:12] in myInput)):
print eachNew[0:-7]
theMap = arcpy.mapping.MapDocument(path +'\\map.mxd')
df1 = arcpy.mapping.ListDataFrames(theMap,"*")[0]
#this is where we add our layers
layer1 = arcpy.mapping.Layer(path + "\\counties\\" + each + "\\" + eachNew)
if(eachNew[7:11] == "poly" or eachNew[7:11] =="line"):
arcpy.ApplySymbologyFromLayer_management(layer1, path + '\\symbology\\' +myInput+'.lyr')
else:
arcpy.ApplySymbologyFromLayer_management(layer1, path + '\\symbology\\' +eachNew[0:-7]+'.lyr')
# Assign legend variable for map
legend = arcpy.mapping.ListLayoutElements(theMap, "LEGEND_ELEMENT", "Legend")[0]
# add wetland layer to map
legend.autoAdd = True
try:
arcpy.mapping.AddLayer(df1, layer1,"AUTO_ARRANGE")
#geoprocessing steps
streams = myGeoprocessing(layer1, each)
# more geoprocessing options, add the layers to map and assign if they should appear in legend
legend.autoAdd = True
arcpy.mapping.AddLayer(df1, streams,"TOP")
df1.extent = layer1.getExtent(True)
arcpy.mapping.ExportToJPEG(theMap, path + "\\counties\\" + each + "\\map.jpg")
# Save map document to path
theMap.saveACopy(path + "\\counties\\" + each + "\\map.mxd")
del theMap
print "done with map " + str(counter1)
except:
print "issue with map or already exists"
counter1+=1
makeMap()
Original Design
import os, sys, urllib2, urllib, zipfile, arcpy
from arcpy import env
response = urllib2.urlopen('path2URL')
path1 = 'path2URL'
myCount = 1
for each in response:
eachNew = each.split(" ")
myCount+=1
response1 = urllib2.urlopen(path1 + eachNew[9])
for all1 in response1:
#print all1
allNew = all1.split(",")
allFinal = allNew[0].split(" ")
allFinal1 = allFinal[len(allFinal)-1].strip(" ")
if allFinal1[-10:-2] == "poly.ZIP":
response2 = urllib2.urlopen('path2URL')
zipcontent= response2.readlines()
path8 = 'path2URL'+ eachNew[9][0:len(eachNew[9])-2] +"/"+ allFinal1[0:len(allFinal1)-2]
downZip = str(eachNew[9][0:len(eachNew[9])-2])+ ".zip"
urllib.urlretrieve (path8, downZip)
# Set the path to the directory where your zipped folders reside
zipfilepath = 'F:\Misc\presentation'
# Set the path to where you want the extracted data to reside
extractiondir = 'F:\Misc\presentation\counties'
# List all data in the main directory
zp1 = os.listdir(zipfilepath)
# Creates a loop which gives use each zipped folder automatically
# Concatinates zipped folder to original directory in variable done
for each in zp1:
print each[-4:]
if each[-4:] == ".zip":
done = zipfilepath + "\\" + each
zip1 = zipfile.ZipFile(done)
extractiondir1 = extractiondir + "\\" + each[:-4]
zip1.extractall(extractiondir1)
path = os.getcwd()
counter1 = 1
# get the data frames
# Create new layer for all files to be added to map document
env.workspace = "E:\\Misc\\presentation\\symbology\\"
zp1 = os.listdir(path + "\\counties\\")
for each in zp1:
zp2 = os.listdir(path + "\\counties\\" + each)
for eachNew in zp2:
if eachNew[-4:] == ".shp":
wetlandMap = arcpy.mapping.MapDocument('E:\\Misc\\presentation\\wetland.mxd')
df1 = arcpy.mapping.ListDataFrames(wetlandMap,"*")[0]
#print eachNew[-4:]
wetland = arcpy.mapping.Layer(path + "\\counties\\" + each + "\\" + eachNew)
#arcpy.Clip_analysis(path + "\\symbology\\Stream_order.shp", wetland, path + "\\counties\\" + each + "\\Streams.shp")
streams = arcpy.mapping.Layer(path + "\\symbology\\Stream_order.shp")
arcpy.ApplySymbologyFromLayer_management(wetland, path + '\\symbology\\wetland.lyr')
arcpy.ApplySymbologyFromLayer_management(streams, path+ '\\symbology\\streams.lyr')
# Assign legend variable for map
legend = arcpy.mapping.ListLayoutElements(wetlandMap, "LEGEND_ELEMENT", "Legend")[0]
# add the layers to map and assign if they should appear in legend
legend.autoAdd = True
arcpy.mapping.AddLayer(df1, streams,"TOP")
legend.autoAdd = True
arcpy.mapping.AddLayer(df1, wetland,"AUTO_ARRANGE")
df1.extent = wetland.getExtent(True)
# Export the map to a pdf
arcpy.mapping.ExportToJPEG(wetlandMap, path + "\\counties\\" + each + "\\wetland.jpg")
# Save map document to path
wetlandMap.saveACopy(path + "\\counties\\" + each + "\\wetland.mxd")
del wetlandMap
print "done with map " + str(counter1)
counter1+=1
Have a look at this guide:
https://wiki.python.org/moin/PythonSpeed/PerformanceTips
Let me quote:
Function call overhead in Python is relatively high, especially compared with the execution speed of a builtin function. This strongly suggests that where appropriate, functions should handle data aggregates.
So effectively this suggests, to not factor out something as a function that is going to be called hundreds of thousands of times.
In Python functions won't be inlined, and calling them is not cheap. If in doubt use a profiler to find out how many times is each function called, and how long does it take on average. Then optimize.
You might also give PyPy a shot, as they have certain optimizations built in. Reducing the function call overhead in some cases seems to be one of them:
Python equivalence to inline functions or macros
http://pypy.org/performance.html

Can't get output from Tesseract command run through os.system

I've created a function which loops over images and gets the orientation from the image with the tesseract library. The code looks like this:
def fix_incorrect_orientation(pathName):
for filename in os.listdir(pathName):
tesseractResult = str(os.system('tesseract ' + pathName + '/' + filename + ' - -psm 0'))
print('tesseractResult: ' + tesseractResult)
regexObj = re.search('([Orientation:]+[\s][0-9]{1})',tesseractResult)
if regexObj:
orientation = regexObj.groups(0)[0]
print('orientation123: ' + str(orientation))
else:
print('Not getting in the Regex.')
The result from the variable tesseractResult is always 0 though. But in the terminal I will get the following result from the command:
Orientation: 3
Orientation in degrees: 90
Orientation confidence: 19.60
Script: 1
Script confidence: 21.33
I've tried catching the output from the os.system in multiple ways, such as with Popen and subprocess but without any succes. It seems that I can't catch the output from the tesseract library.
So, how exactly should I do this?
Thanks,
Yenthe
Literally 10 minutes after asking the question I found a way.. First import commands:
import commands
And then the following code will do the trick:
def fix_incorrect_orientation(pathName):
for filename in os.listdir(pathName):
tesseractResult = str(commands.getstatusoutput('tesseract ' + pathName + '/' + filename + ' - -psm 0'))
print('tesseractResult: ' + tesseractResult)
regexObj = re.search('([Orientation:]+[\s][0-9]{1})',tesseractResult)
if regexObj:
orientation = regexObj.groups(0)[0]
print('orientation123: ' + str(orientation))
else:
print('Not getting in the Regex.')
This will pass the command around with the commands library and the output is caught thanks to getstatusoutput from the commands library.

Trouble calling EMBOSS program from python

I am having trouble calling an EMBOSS program (which runs via command line) called sixpack through Python.
I run Python via Windows 7, Python version 3.23, Biopython version 1.59, EMBOSS version 6.4.0.4. Sixpack is used to translate a DNA sequence in all six reading frames and creates two files as output; a sequence file identifying ORFs, and a file containing the protein sequences.
There are three required arguments which I can successfully call from command line: (-sequence [input file], -outseq [output sequence file], -outfile [protein sequence file]). I have been using the subprocess module in place of os.system as I have read that it is more powerful and versatile.
The following is my python code, which runs without error but does not produce the desired output files.
from Bio import SeqIO
import re
import os
import subprocess
infile = input('Full path to EXISTING .fasta file would you like to open: ')
outdir = input('NEW Directory to write outfiles to: ')
os.mkdir(outdir)
for record in SeqIO.parse(infile, "fasta"):
print("Translating (6-Frame): " + record.id)
ident=re.sub("\|", "-", record.id)
print (infile)
print ("Old record ID: " + record.id)
print ("New record ID: " + ident)
subprocess.call (['C:\memboss\sixpack.exe', '-sequence ' + infile, '-outseq ' + outdir + ident + '.sixpack', '-outfile ' + outdir + ident + '.format'])
print ("Translation of: " + infile + "\nWritten to: " + outdir + ident)
Found the answer.. I was using the wrong syntax to call subprocess. This is the correct syntax:
subprocess.call (['C:\memboss\sixpack.exe', '-sequence', infile, '-outseq', outdir + ident + '.sixpack', '-outfile', outdir + ident + '.format'])

Categories