Python function performance

Python function performance - python

I have 130 lines of code in which part except from line 79 to 89 work fine like compiles in ~0.16 seconds however after adding function which is 10 lines(between 79-89) it works in 70-75 seconds. In that function the data file(u.data) is 100000 lines of numerical data in this format:
>196 242 3 881250949
4 grouped numbers in every line. The thing is that when I ran that function in another Python file while testing (before implementing it in the main program) it showed that it works in 0.15 seconds however when I implemented it in main one (same code) it takes whole program 70 seconds almost.
Here is my code:
""" Assignment 5: Movie Reviews
Date: 30.12.2016
"""
import os.path
import time
start_time = time.time()
""" FUNCTIONS """
# Getting film names in film folder
def get_film_name():
name = ''
for word in read_data.split(' '):
if ('(' in word) == False:
name += word + ' '
else:
break
return name.strip(' ')
# Function for removing date for comparison
def throw_date(string):
a_list = string.split()[:-1]
new_string = ''
for i in a_list:
new_string += i + ' '
return new_string.strip(' ')
def film_genre(film_name):
oboist = []
genr_list = ['unknown', 'Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama',
'Fantasy',
'Movie-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
for item in u_item_list:
if throw_date(str(item[1])) == film_name:
for i in range(4, len(item)):
oboist.append(item[i])
dictionary = dict(zip(genr_list, oboist))
genres = ''
for key, value in dictionary.items():
if value == '1':
genres += key + ' '
return genres.strip(' ')
def film_link(film_name):
link = ''
for item in u_item_list:
if throw_date(str(item[1])) == film_name:
link += item[3]
return link
def film_review(film_name):
review = ''
for r, d, filess in os.walk('film'):
for fs in filess:
fullpat = os.path.join(r, fs)
with open(fullpat, 'r') as a_file:
data = a_file.read()
if str(film_name).lower() in str(data.split('\n', 1)[0]).lower():
for i, line in enumerate(data):
if i > 1:
review += line
a_file.close()
return review
def film_id(film_name):
for film in u_item_list:
if throw_date(film[1]) == film_name:
return film[0]
def total_user_and_rate(film_name):
rate = 0
user = 0
with open('u.data', 'r') as data_file:
rate_data = data_file.read()
for l in rate_data.split('\n'):
if l.split('\t')[1] == film_id(film_name):
user += 1
rate += int(l.split('\t')[2])
data_file.close()
print('Total User:' + str(int(user)) + '\nTotal Rate: ' + str(rate / user))
""" MAIN CODE"""
review_file = open("review.txt", 'w')
film_name_list = []
# Look for txt files and extract the film names
for root, dirs, files in os.walk('film'):
for f in files:
fullpath = os.path.join(root, f)
with open(fullpath, 'r') as file:
read_data = file.read()
film_name_list.append(get_film_name())
file.close()
with open('u.item', 'r') as item_file:
item_data = item_file.read()
item_file.close()
u_item_list = []
for line in item_data.split('\n'):
temp = [word for word in line.split('|')]
u_item_list.append(temp)
film_name_list = [i.lower() for i in film_name_list]
updated_film_list = []
print(u_item_list)
# Operation for review.txt
for film_data_list in u_item_list:
if throw_date(str(film_data_list[1]).lower()) in film_name_list:
strin = film_data_list[0] + " " + film_data_list[1] + " is found in the folder" + '\n'
print(film_data_list[0] + " " + film_data_list[1] + " is found in the folder")
updated_film_list.append(throw_date(str(film_data_list[1])))
review_file.write(strin)
else:
strin = film_data_list[0] + " " + film_data_list[1] + " is not found in the folder. Look at " + film_data_list[
3] + '\n'
print(film_data_list[0] + " " + film_data_list[1] + " is not found in the folder. Look at " + film_data_list[3])
review_file.write(strin)
total_user_and_rate('Titanic')
print("time elapsed: {:.2f}s".format(time.time() - start_time))
And my question is what can be the reason for that? Is the function
("total_user_and_rate(film_name)")
problematic? Or can there be other problems in other parts? Or is it normal because of the file?

I see a couple of unnecessary things.
You call film_id(film_name) inside the loop for every line of the file, you really only need to call it once before the loop.
You don't need to read the file, then split it to iterate over it, just iterate over the lines of the file.
You split each line twice, just do it once
Refactored for these changes:
def total_user_and_rate(film_name):
rate = 0
user = 0
f_id = film_id(film_name)
with open('u.data', 'r') as data_file:
for line in data_file:
line = line.split('\t')
if line[1] == f_id:
user += 1
rate += int(line[2])
data_file.close()
print('Total User:' + str(int(user)) + '\nTotal Rate: ' + str(rate / user))

In your test you were probably testing with a much smaller u.item file. Or doing something else to ensure film_id was much quicker. (By quicker, I mean it probably ran on the nanosecond scale.)
The problem you have is that computers are so fast you didn't realise when you'd actually made a big mistake doing something that runs "slowly" in computer time.
If your if l.split('\t')[1] == film_id(film_name): line takes 1 millisecond, then when processing a 100,000 line u.data file, you could expect your total_user_and_rate function to take 100 seconds.
The problem is that film_id iterates all your films to find the correct id for every single line in u.data. You'd be lucky, if the the film_id you're looking for is near the beginning of u_item_list because then the function would return in probably less than a nanosecond. But as soon as you run your new function for a film near the end of u_item_list, you'll notice performance problems.
wwii has explained how to optimise the total_user_and_rate function. But you could also gain performance improvements by changing u_item_list to use a dictionary. This would improve the performance of functions like film_id from O(n) complexity to O(1). I.e. it would still run on the nanosecond scale no matter how many films are included.

Related

need help regarding this error: can only concatenate list (not "str") to list

I'm learning python so I am pretty new to it.
I've been working on a class assignment and iv'e been facing some error, such as the one in the title.
This is my code:
import random
def getWORDS(filename):
f = open(filename, 'r')
templist = []
for line in f:
templist.append(line.split("\n"))
return tuple(templist)
articles = getWORDS("articles.txt")
nouns = getWORDS("nouns.txt")
verbs = getWORDS("verbs.txt")
prepositions = getWORDS("prepositions.txt")
def sentence():
return nounphrase() + " " + verbphrase()
def nounphrase():
return random.choice(articles) + " " + random.choice(nouns)
def verbphrase():
return random.choice(verbs) + " " + nounphrase() + " " + \
prepositionalphrase()
def prepositionalphrase():
return random.choice(prepositions) + " " + nounphrase()
def main():
number = int(input("enter the number of sentences: "))
for count in range(number):
print(sentence())
main()
However, whenever I run it I get an this error:
TypeError: can only concatenate list (not "str") to list.
Now, I know there are tons of question like this but I tried a lot of time, I am not able to fix it, I'm new to programming so I've been learning the basics since last week.
Thank you

Here I've modified the function slightly - it'll fetch every words into a tuple. Use with to open the files - it will close the pointer once the values have been fetched.
I hope this will work for you!
def getWORDS(filename):
result = []
with open(filename) as f:
file = f.read()
texts = file.splitlines()
for line in texts:
result.append(line)
return tuple(result)

I think the problem is in this line:
templist.append(line.split("\n"))
split() will return a list that is then appended to templist. If you're wanting to remove the newline character from the end of the line use rstrip() as this will return a string.

When working with a file, you should use the read() method:
file = f.read()
To split the file to lines and add to a list, you first split, then append line by line.
file = f.read()
lines = file.split("\n")
for line in lines:
templist.append(line)
In your case, you are using the list of lines as-is, so I would write:
file = f.read()
templist = file.split("\n")
Edit 1:
Another useful tool when working with files is f.readline(), which returns the first line when calling it for the first time, second when calling it once again... third... and so on, although the previous ways I showed would be more efficient here.
Edit 2:
When you are done using the file, use the close() method, or start using the file with a with ... as method which closes the file at the end of the code block.
Code example using with ... as (The best written code in this answer):
def getWORDS(filename):
with open(filename, 'r') as f:
file = f.read()
templist = file.split("\n")
return tuple(templist)
Code example using close():
def getWORDS(filename):
f = open(filename, 'r')
file = f.read()
templist = file.split("\n")
f.close()
return tuple(templist)

This is how I would write the full code.
(fixed file opening and reading + fixed capitalization)
import random
def getWORDS(filename):
with open(filename, 'r') as f:
file = f.read()
templist = file.split("\n")
return tuple(templist)
articles = getWORDS("articles.txt")
nouns = getWORDS("nouns.txt")
verbs = getWORDS("verbs.txt")
prepositions = getWORDS("prepositions.txt")
def sentence():
sentence = nounphrase() + " " + verbphrase()
sentence = sentence.split(" ")
sentence[0] = sentence[0].capitalize()
sentence = " ".join(sentence)
return sentence
def nounphrase():
return random.choice(articles).lower() + " " + random.choice(nouns).capitalize()
def verbphrase():
return random.choice(verbs).lower() + " " + nounphrase() + " " + \
prepositionalphrase()
def prepositionalphrase():
return random.choice(prepositions).lower() + " " + nounphrase()
def main():
number = int(input("enter the number of sentences: "))
for count in range(number):
print(sentence())
main()

python multiprocessing map_async does take much more time than the sequential

I want to find optimum path for the Travelling salesman problem, it works fine with the sequential algorithm but I have a problem with the parallel algorithm, for example when I run sequentially everything is ok but in parallel, it took 200x more time in 4 cores computer.
Here is my sequential code:
#!/usr/local/bin/python
#Traveling Salesman Solution
#Scott Stevenson, Jack Koppenaal, John Costantino
import time, itertools, math
#Get the city file
def getFile():
cityFile = 'in.txt'
try:
f = open(cityFile,'r')
return f
except Exception as e:
print(cityFile + ' could not be opened: ' + str(e))
def find_max():
file = open('in.txt','r')
max = file.read().split()
return max[1]
def initfiles(path):
file = open(path, 'r')
max = file.read().splitlines()
file.close()
file = open(path, 'r')
num = file.read().split()
file.close()
final = list()
final.append(str(num[0] + ' ' + num[1] + ' OK'))
if (num[2] != 'OK'):
print('File has been edited.')
file_write = open(path, 'wt')
for i in range(1, int(num[0]) + 1):
final.append(str(i) + ' ' + str(max[i]))
for line in range(len(final)):
file_write.write(str(final[line]) + '\n')
file_write.close()
#Get distance between 2 cities
#Cities stored as [ident, x, y]
def getDistance(city1, city2):
return math.sqrt((int(city2[1]) - int(city1[1]))**2 + (int(city2[2])-int(city1[2]))**2)
#Get the cities from a specified city file
def getCities(cityFile):
for line in cityFile:
try:
#Lines split by a space char will look like:
#[ident, x-coord, y-coord]
ident = line.split(' ')[0]
x = line.split(' ')[1]
y = line.split(' ')[2].strip('\n')
#print(x,y,ident)
#If the ident is not an int (not a city) skip it otherwise add it
if ident.isdigit():
#Cities are just lists with values (almost a pseudo-class)
city = [int(ident), int(x), int(y)]
cities.append(city)
except:
#The ident was not an int so we skip (pass) it and move on to the next
pass
#A function for bruteforcing the TSP problem based on a list of cities
def bruteForce(cities):
global maxweight
#Tours are also stored as pseudo-class lists
#tour[0] is the path and tour[1] is the weight
#Make a start tour with a weight of infinity so all other tours will be smaller
tour = [[], float("inf")]
permparm = []
#In order to get all permutations we need an array containing the values 1 through n
#These values are the idents of the cities (their 0th element) so we get and add them
for city in cities:
permparm.append(city[0])
#We now generate all permutations of n length from the array containing 1 - n idents
#and loop through them looking for the smallest distance
for perm in list(itertools.permutations(permparm, len(permparm))):
#Get the total weight of the permutation
dist = getWeight(perm)
#Make a new tour to represent the current permutation
thisTour = [perm, dist]
#If the current tour is shorter than the old tour, point the old tour to the new one
if thisTour[1] < tour[1] and thisTour[1]<=int (maxweight):
tour = thisTour
return tour
#Once we have gone through every permutation we have the shortest tour so return it
return tour
#A function to get the total weight of a path
#This function is messy because of an off-by-1 error introduced by the tour file starting at 1 instead of 0
def getWeight(perm):
#Set the initial distance to 0
dist = 0
#We now need to calculate and add the distance between each city in the path from 0 to n
for index in range(len(perm)):
try:
#Pass the 2 cities to the distance formula and add the value to dist
dist += getDistance(cities[perm[index]-1], cities[perm[index+1]-1])
except:
#We don't need to check bounds because the final pass will throw an out-of-bounds
#exception so we just catch it and skip that calculation
pass
#All TSP solutions are cycles so we now have to add the final city back to the initial city to the total dist
#Python has a nifty convention where list[-1] will return the last element, list[-2] will return second to last, etc.
dist += getDistance(cities[perm[-1]-1], cities[perm[0]-1])
#We now have the total distance so return it
#if (int(dist)<=80.0):
return dist
#A function to write the output of a tour to a file in a specified format
def toFile(tour):
loc = input('Enter the location where you would like to save the tour:\n')
fname = cityFile.name
#Index for the last file separator to get ONLY the file name not its path
sep_index = 0
#Linux/OSX files use /'s to separate dirs so get the position of the last one in the name
if '/' in fname:
sep_index = fname.rindex('/')+1
#Windows files use \'s to separate dirs so get the position of the last one in the name
if '\\' in fname:
sep_index = fname.rindex('\\')+1
#Create the header for the output file
header = ('NAME : ' + str(fname[sep_index:-4]) + '.opt.tour\n'
'COMMENT : Optimal tour for ' + str(fname[sep_index:]) + ' (' + str(tour[1]) + ')\n'
'TYPE : Tour\n'
'DIMENSON : ' + str(len(tour[0])) + '\n'
'TOUR_SECTION\n')
#Create the trailer for the output file
trailer = "-1\nEOF\n"
#Create the output file and write the results to it
try:
f = open(loc,'w')
f.write(header)
for city in tour[0]:
f.write(str(city) + '\n')
f.write(trailer)
f.close()
print ('Successfully saved tour data to: ' + loc)
except Exception as e:
print (loc + ' could not be written to: ' + str(e))
#-------------------The actual script begins here-----------------------
cities = []
cityFile = getFile()
maxweight = find_max()
initfiles('in.txt')
#Start the stopwatch
start = time.time()
getCities(cityFile)
opt_tour = bruteForce(cities)
#Stop the stopwatch
finish = time.time()
print ('The optimum tour is: %s (%f)' % (opt_tour[0], opt_tour[1]))
print ('This solution took %0.3f seconds to calculate.' % (finish-start))
And my parallel code:
#!/usr/local/bin/python
import time, itertools, math
from multiprocessing import Pool,Manager
thisTour = None
tour = None
dist = None
totalct = 0
#Get the city file
def getFile():
cityFile = 'in.txt'
try:
f = open(cityFile,'r')
return f
except Exception as e:
print(cityFile + ' could not be opened: ' + str(e))
def find_max():
file = open('in.txt','r')
max = file.read().split()
return max[1]
def initfiles(path):
global totalct
file = open(path, 'r')
max = file.read().splitlines()
totalct = len(max) - 2
file.close()
file = open(path, 'r')
num = file.read().split()
file.close()
final = list()
#print(totalct)
final.append(str(num[0] + ' ' + num[1] + ' OK'))
if (num[2] != 'OK'):
file_write = open(path, 'wt')
for i in range(1, int(num[0]) + 1):
# print(max[i])
final.append(str(i) + ' ' + str(max[i]))
# print(i)
# print(len(final))
for line in range(len(final)):
# print(final[line])
file_write.write(str(final[line]) + '\n')
print('File has been edited.')
file_write.close()
#Get distance between 2 cities
#Cities stored as [ident, x, y]
def getDistance(city1, city2):
return math.sqrt((int(city2[1]) - int(city1[1]))**2 + (int(city2[2])-int(city1[2]))**2)
#Get the cities from a specified city file
def getCities(cityFile):
for line in cityFile:
try:
#Lines split by a space char will look like:
#[ident, x-coord, y-coord]
ident = line.split(' ')[0]
x = line.split(' ')[1]
y = line.split(' ')[2].strip('\n')
#If the ident is not an int (not a city) skip it otherwise add it
if ident.isdigit():
#Cities are just lists with values (almost a pseudo-class)
city = [int(ident), int(x), int(y)]
cities.append(city)
except:
#The ident was not an int so we skip (pass) it and move on to the next
pass
#A function for bruteforcing the TSP problem based on a list of cities
def bruteForce(perm):
global maxweight
global tour
global thisTour
dist = getWeight(perm)
thisTour = [perm, dist]
#If the current tour is shorter than the old tour, point the old tour to the new one
if thisTour[1] < tour[1] and thisTour[1] <= int(maxweight):
tour = thisTour
return tour
#Once we have gone through every permutation we have the shortest tour so return it
return tour
#A function to get the total weight of a path
#This function is messy because of an off-by-1 error introduced by the tour file starting at 1 instead of 0
def getWeight(perm):
#Set the initial distance to 0
dist = 0
#We now need to calculate and add the distance between each city in the path from 0 to n
for index in range(len(perm)):
try:
#Pass the 2 cities to the distance formula and add the value to dist
dist += getDistance(cities[perm[index]-1], cities[perm[index+1]-1])
except:
#We don't need to check bounds because the final pass will throw an out-of-bounds
#exception so we just catch it and skip that calculation
pass
#All TSP solutions are cycles so we now have to add the final city back to the initial city to the total dist
#Python has a nifty convention where list[-1] will return the last element, list[-2] will return second to last, etc.
dist += getDistance(cities[perm[-1] - 1], cities[perm[0] - 1])
#We now have the total distance so return it
return dist
#A function to write the output of a tour to a file in a specified format
def toFile(tour):
loc = input('Enter the location where you would like to save the tour:\n')
fname = cityFile.name
#Index for the last file separator to get ONLY the file name not its path
sep_index = 0
#Linux/OSX files use /'s to separate dirs so get the position of the last one in the name
if '/' in fname:
sep_index = fname.rindex('/')+1
#Windows files use \'s to separate dirs so get the position of the last one in the name
if '\\' in fname:
sep_index = fname.rindex('\\')+1
#Create the header for the output file
header = ('NAME : ' + str(fname[sep_index:-4]) + '.opt.tour\n'
'COMMENT : Optimal tour for ' + str(fname[sep_index:]) + ' (' + str(tour[1]) + ')\n'
'TYPE : Tour\n'
'DIMENSON : ' + str(len(tour[0])) + '\n'
'TOUR_SECTION\n')
#Create the trailer for the output file
trailer = "-1\nEOF\n"
#Create the output file and write the results to it
try:
f = open(loc,'w')
f.write(header)
for city in tour[0]:
f.write(str(city) + '\n')
f.write(trailer)
f.close()
print ('Successfully saved tour data to: ' + loc)
except Exception as e:
print (loc + ' could not be written to: ' + str(e))
#-------------------The actual script begins here-----------------------
def permutations():
allperm = list(itertools.permutations(act, len(act)))
for i in allperm:
allpermlist.append(list(i))
print('permutations compute has been finished.')
return allpermlist
def allct ():
for i in range(1, totalct):
act.append(i)
def init():
global thisTour
global tour
global dist
if __name__ == "__main__":
cities = []
cityFile = getFile()
maxweight = find_max()
initfiles('in.txt')
manager = Manager()
allpermlist = manager.list()
tour = [[], float("inf")]
allpermlist = []
act = []
allct()
permutations()
#Start the stopwatch
start = time.time()
getCities(cityFile)
with Pool(processes=8, initializer=init) as p:
opt_tour = p.map_async(bruteForce, allpermlist, chunksize=2048)
opt_tour.wait()
print(opt_tour.get())
p.close()
#p.join()
finish = time.time()
print('This solution took %0.3f seconds to calculate.' % (finish-start))
And my in.txt :
9 47 OK
1 13 15
2 4 21
3 7 17
4 8 11
5 10 14
6 2 15
7 14 11
8 15 20
9 13 17

Alternative to Bio.Entrez EFetch for downloading full genome sequences from NCBI

My goal is to download full metazoan genome sequences from NCBI. I have a list of unique ID numbers for the genome sequences I need. I planned to use the Bio.Entrez module EFetch to download the data but learned today via the Nov 2, 2011 release notes (http://1.usa.gov/1TA5osg) that EFetch does not support the 'Genome' database. Can anyone suggest an alternative package/module or some other way around this? Thank you in advance!

Here is a script for you -- though you may need to tinker with it to make it work. Name the script whatever you prefer, but when you call the script do so as follows:
python name_of_script[with .py extension] your_email_address.
You need to add your email to the end of the call else it will not work. If you have a text file of accession numbers (1/line), then choose option 2. If you choose option 1, it will ask you for items like the name of the organism, strain name, and keywords. Use as many keywords as you would like -- just be certain to separate them by commas. If you go with the first option, NCBI will be searched and will return GI numbers [NOTE: NCBI is phasing out the GI numbers in 9.2016 so this script may not work after this point] which will then be used to snag the accession numbers. Once all the accession numbers are present, a folder is created, and a subfolder is created for each accession number (named as the accession number). In each subfolder, the corresponding fasta AND genbank file will be downloaded. These files will carry the accession number as the file name (e.g. accession_number.fa, accession_number.gb). Edit script to your purposes.
ALSO...Please note the warning (ACHTUNG) portion of the script. Sometimes the rules can be bent...but if you are egregious enough, your IP may be blocked from NCBI. You have been warned.
import os
import os.path
import sys
import re #regular expressions
from Bio import Entrez
import datetime
import time
import glob
arguments = sys.argv
Entrez.email = arguments[1] #email
accession_ids = []
print('Select method for obtaining the accession numbers?\n')
action = input('1 -- Input Search Terms\n2 -- Use text file\n')
if action == '1':
print('\nYou will be asked to enter an organism name, a strain name, and keywords.')
print('It is not necessary to provide a value to each item (you may just hit [ENTER]), but you must provide at least one item.\n')
organism = input('Enter the organism you wish to search for (e.g. Escherichia coli [ENTER])\n')
strain = input('Enter the strain you wish to search for. (e.g., HUSEC2011 [ENTER])\n')
keywords = input('Enter the keywords separated by a comma (e.g., complete genome, contigs, partial [ENTER])\n')
search_phrase = ''
if ',' in keywords:
keywords = keywords.split(',')
ncbi_terms = ['organism', 'strain', 'keyword']
ncbi_values = [organism, strain, keywords]
for index, n in enumerate(ncbi_values):
if index == 0 and n != '':
search_phrase = '(' + n + '[' + ncbi_terms[index] + '])'
else:
if n != '' and index != len(ncbi_values)-1:
search_phrase = search_phrase + ' AND (' + n + '[' + ncbi_terms[index] + '])'
if index == len(ncbi_values)-1 and n != '' and type(n) is not list:
search_phrase = search_phrase + ' AND (' + n + '[' + ncbi_terms[index] + '])'
if index == len(ncbi_values)-1 and n != '' and type(n) is list:
for name in n:
name = name.lstrip()
search_phrase = search_phrase + ' AND (' + name + '[' + ncbi_terms[index] + '])'
print('Here is the complete search line that will be used: \n\n', search_phrase)
handle = Entrez.esearch(db='nuccore', term=search_phrase, retmax=1000, rettype='acc', retmode='text')
result = Entrez.read(handle)
handle.close()
#print(result['Count'])
gi_numbers = result['IdList']
fetch_handle = Entrez.efetch(db='nucleotide', id=result['IdList'], rettype='acc', retmode='text')
accession_ids = [id.strip() for id in fetch_handle]
fetch_handle.close()
if action == '2': #use this option if you have a file of accession #s
file_name = input('Enter the name of the file\n')
with open(file_name, 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line = line.replace('\n', '')
accession_ids.append(line)
#--------------------------------------------------------------------------------------------------------------
#----------------------------------- Make directory to store files --------------------------------------------
new_path = 'Genbank_Files/'
if not os.path.exists(new_path):
os.makedirs(new_path)
print('You have ' + str(len(accession_ids)) + ' file(s) to download.') #print(accession_ids)
ending='.gb'
files = []
##CHECK IF FILE HAS BEEN DOWNLOADED
for dirpath, dirnames, filenames in os.walk(new_path):
for filename in [f for f in filenames if f.endswith(ending)]: #for zipped files
files.append(os.path.join(dirpath,filename))
for f in files:
f = f.rsplit('/')[-1]
f = f.replace('.gb', '')
if f in accession_ids:
ind = accession_ids.index(f)
accession_ids.pop(ind)
print('')
print('You have ' + str(len(accession_ids)) + ' file(s) to download.')
#--------------------------------------------------------------------------
###############################################################################
#---ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG----#
###############################################################################
# Call Entrez to download files
# If downloading more than 100 files...
# Run this script only between 9pm-5am Monday - Friday EST
# Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
# Make no more than 3 requests every 1 second (Biopython takes care of this).
# Use URL parameter email & tool for distributed software
# NCBI's Disclaimer and Copyright notice must be evident to users of your service.
#
# Use this script at your own risk.
# Neither the script author nor author's employers are responsible for consequences arising from improper usage
###############################################################################
# CALL ENTREZ: Call Entrez to download genbank AND fasta (nucleotide) files using accession numbers.
###############################################################################
start_day = datetime.date.today().weekday() # 0 is Monday, 6 is Sunday
start_time = datetime.datetime.now().time()
print(str(start_day), str(start_time))
print('')
if ((start_day < 5 and start_time > datetime.time(hour=21)) or (start_day < 5 and start_time < datetime.time(hour=5)) or start_day > 5 or len(accession_ids) <= 100 ):
print('Calling Entrez...')
for a in accession_ids:
if ((datetime.date.today().weekday() < 5 and datetime.datetime.now().time() > datetime.time(hour=21)) or
(datetime.date.today().weekday() < 5 and datetime.datetime.now().time() < datetime.time(hour=5)) or
(datetime.date.today().weekday() == start_day + 1 and datetime.datetime.now().time() < datetime.time(hour=5)) or
(datetime.date.today().weekday() > 5) or len(accession_ids) <= 100 ):
print('Downloading ' + a)
new_path = 'Genbank_Files/' + a + '/'
if not os.path.exists(new_path):
os.makedirs(new_path)
handle=Entrez.efetch(db='nucleotide', id=a, rettype='gb', retmode='text', seq_start=0)
FILENAME = new_path + a + '.gb'
local_file=open(FILENAME,'w')
local_file.write(handle.read())
handle.close()
local_file.close()
handle=Entrez.efetch(db='nucleotide', id=a, rettype='fasta', retmode='text')
FILENAME = new_path + a + '.fna'
local_file=open(FILENAME,'w')
local_file.write(handle.read())
handle.close()
local_file.close()
else:
print('You have too many files to download at the time. Try again later.')
#-------

Python Nested Loops - continue iterates first loop

Brand new to programming but very enjoyable challenge.
Here's a question which I suspect may be caused by a misunderstanding of python loops.
System info: Using notepad++ and IDLE python 3.4.3 on Win 7 32-bit
My solution is to open 1 database, use it to look for a correct master entry from database 2, pulls a index number (task_no), then write a 3rd file identical to the first database, this time with the correct index number.
My problem is that it performs 1st and 2nd loop correctly, then on the 2nd iteration of loop 1, tries to perform a block in loop 2 while iterating through the rows of loop 1, not the task_rows of loop 2.
footnote: Both files are quite large (several MB) so I'm note sure if storing them in memory is a good idea.
This was a relevant question that I found closest to this problem:
python nested loop using loops and files
What I got out of it was that I tried moving the file opening within the 1st loop, but the problem persists. Something to do with how I'm using CSV reader?
I also have the sinking suspicion that there may be a root cause in problem solving so I am welcome to suggestions for alternative ways to solve the problem.
Thanks in advance!
The gist:
for row in readerCurrentFile: #LOOP 1
# iterates through readerCurrentFile to define search variables
[...]
for task_row in readerTaskHeader: #LOOP 2
# searches each row iteratively through readerTaskHeader
# Match compid
#if no match, continue <<<- This is where it goes back to 1st loop
[...]
# Match task frequency
#if no match, continue
[...]
# once both of the above matches check out, will grab data (task_no from task_row[0]
task_no = ""
task_no = task_row[0]
if task_row:
break
[...]
# writes PM code
print("Successful write of PM schedule row")
print(compid + " " + dict_freq_names[str(pmfreqx) + str(pmfreq)] + ": " + pmid + " " + task_no)
The entire code:
import csv
import re
#Writes schedule
csvNewPMSchedule = open('new_pm_schedule.csv', 'a', newline='')
writerNewPMSchedule = csv.writer(csvNewPMSchedule)
# Dictionaries of PM Frequency
def re_compile_dict(d,f):
for k in d:
d[k] = re.compile(d[k], flags=f)
dict_month = {60:'Quin',36:'Trien',24:'Bi-An',12:'Annual(?<!Bi-)(?<!Semi-)',6:'Semi-An',3:'Quart',2:'Bi-Month',1:'Month(?<!Bi-)'}
dict_week = {2:'Bi-Week',1:'Week(?<!Bi-)'}
dict_freq_names = {'60Months':'Quintennial','36Months':'Triennial','24Months':'Bi-Annual','12Months':'Annual','6Months':'Semi-Annual','3Months':'Quarterly','2Months':'Bi-Monthly','1Months':'Monthly','2Weeks':'Bi-Weekly','1Weeks':'Weekly'}
re_compile_dict(dict_month,re.IGNORECASE)
re_compile_dict(dict_week, re.IGNORECASE)
# Unique Task Counter
task_num = 0
total_lines = 0
#Error catcher
error_in_row = []
#Blank out all rows
pmid = 0
compid = 0
comp_desc = 0
pmfreqx = 0
pmfreq = 0
pmfreqtype = 0
# PM Schedule Draft (as provided by eMaint)
currentFile = open('pm_schedule.csv', encoding='windows-1252')
readerCurrentFile = csv.reader(currentFile)
# Loop 1
for row in readerCurrentFile:
if row[0] == "pmid":
continue
#defines row items
pmid = row[0]
compid = row[1]
comp_desc = row[2]
#quantity of pm frequency
pmfreqx_temp = row[3]
#unit of pm frequency, choices are: Months, Weeks
pmfreq = row[4]
#pmfreqtype is currently only static not sure what other options we have
pmfreqtype = row[5]
#pmnextdate is the next scheduled due date from this one. we probably need logic later that closes out any past due date
pmnextdate = row[6]
# Task Number This is what we want to change
# pass
# We want to change this to task header's task_desc
sched_task_desc = row[8]
#last done date
last_pm_date = row[9]
#
#determines frequency search criteria
#
try:
pmfreqx = int(pmfreqx_temp)
except (TypeError, ValueError):
print("Invalid PM frequency data, Skipping row " + pmid)
error_in_row.append(pmid)
continue
#
#defines frequency search variable
#
freq_search_var = ""
if pmfreq == "Weeks":
freq_search_var = dict_week[pmfreqx]
elif pmfreq == "Months":
freq_search_var = dict_month[pmfreqx]
if not freq_search_var:
print("Error in assigning frequency" + compid + " " + str(pmfreqx) + " " + pmfreq)
error_in_row.append(pmid)
continue
#defines Equipment ID Search Variable
print(compid + " frequency found: " + str(pmfreqx) + " " + str(pmfreq))
compid_search_var = re.compile(compid,re.IGNORECASE)
#
# Matching function - search taskHeader for data
#
#PM Task Header Reference
taskHeader = open('taskheader.csv', encoding='windows-1252')
readerTaskHeader = csv.reader(taskHeader)
for task_row in readerTaskHeader:
# task_row[0]: taskHeader pm number
# task_row[1]: "taskHeader task_desc
# task_row[2]: taskHeader_task_notes
#
# search for compid
compid_match = ""
compid_match = compid_search_var.search(task_row[1])
if not compid_match:
print(task_row[1] + " does not match ID for " + compid + ", trying next row.") #debug 2
continue # <<< STOPS ITERATING RIGHT OVER HERE
print("Found compid " + task_row[1]) # debug line
#
freq_match = ""
freq_match = freq_search_var.search(task_row[1])
if not freq_match:
print(task_row[1] + " does not match freq for " + compid + " " + dict_freq_names[str(pmfreqx) + str(pmfreq)] + ", trying next row.") #debug line
continue
print("Frequency Match: " + compid + " " + dict_freq_names[str(pmfreqx) + str(pmfreq)]) # freq debug line
#
task_no = ""
print("Assigning Task Number to " + task_row[0])
task_no = task_row[0]
if task_row:
break
#
#error check
#
if not task_no:
print("ERROR IN SEARCH " + compid + " " + pmid)
error_in_row.append(pmid)
continue
#
# Writes Rows
#
writerNewPMSchedule.writerow([pmid,compid,comp_desc,pmfreqx,pmfreq,pmfreqtype,pmnextdate,task_no,sched_task_desc,last_pm_date])
print("Successful write of PM schedule row")
print(compid + " " + dict_freq_names[str(pmfreqx) + str(pmfreq)] + ": " + pmid + " " + task_no)
print("==============")
# Error reporting lined out for now
# for row in error_in_row:
# writerNewPMSchedule.writerow(["Error in row:",str(error_in_row[row])])
# print("Error in row: " + str(error_in_row[row]))
print("Finished")

Is there some kind of limit to the amount of output Python 3.4 allows using the write() method at one time?

I put trailing print() methods right next to my write() method lines at the end of my code to test why my output files were incomplete. But, the print() output is "all the stuff" I expect; while the write() output is off by a confusing amount (only 150 out of 200 'things'). Reference Image of Output: IDLE versus external output file
FYI: Win 7 64 // Python 3.4.2
My modules take an SRT captions file ('test.srt') and returns a list object I create from it; in particular, one with 220 list entries of the form: [[(index), [time], string]]
times = open('times.txt', 'w')
### A portion of Riobard's SRT Parser: srt.py
import re
def tc2ms(tc):
''' convert timecode to millisecond '''
sign = 1
if tc[0] in "+-":
sign = -1 if tc[0] == "-" else 1
tc = tc[1:]
TIMECODE_RE = re.compile('(?:(?:(?:(\d?\d):)?(\d?\d):)?(\d?\d))?(?:[,.](\d?\d?\d))?')
match = TIMECODE_RE.match(tc)
try:
assert match is not None
except AssertionError:
print(tc)
hh,mm,ss,ms = map(lambda x: 0 if x==None else int(x), match.groups())
return ((hh*3600 + mm*60 + ss) * 1000 + ms) * sign
# my code
with open('test.srt') as f:
file = f.read()
srt = []
for line in file:
splitter = file.split("\n\n")
# SRT splitter
i = 0
j = len(splitter)
for items in splitter:
while i <= j - 2:
split_point_1 = splitter[i].index("\n")
split_point_2 = splitter[i].index("\n", split_point_1 + 1)
index = splitter[i][:split_point_1]
time = [splitter[i][split_point_1:split_point_2]]
time = time[0][1:]
string = splitter[i][split_point_2:]
string = string[1:]
list = [[(index), [time], string]]
srt += list
i += 1
# time info outputter
i = 0
j = 1
for line in srt:
if i != len(srt) - 1:
indexer = srt[i][1][0].index(" --> ")
timein = srt[i][1][0][:indexer]
timeout = srt[i][1][0][-indexer:]
line_time = (tc2ms(timeout) - tc2ms(timein))/1000
space_time = ((tc2ms((srt[j][1][0][:indexer]))) - (tc2ms(srt[i][1][0][-indexer:])))/1000
out1 = "The space between Line " + str(i) + " and Line " + str(j) + " lasts " + str(space_time) + " seconds." + "\n"
out2 = "Line " + str(i) + ": " + str(srt[i][2]) + "\n\n"
times.write(out1)
times.write(out2)
print(out1, end="")
print(out2)
i += 1
j += 1
else:
indexer = srt[i][1][0].index(" --> ")
timein = srt[i][1][0][:indexer]
timeout = srt[i][1][0][-indexer:]
line_time = (tc2ms(timeout) - tc2ms(timein))/1000
outend = "Line " + str(i) + ": " + str(srt[i][2]) + "\n<End of File>"
times.write(outend)
print(outend)
My two write() method output files, respectively, only print out either ~150 or ~200 items of the 220 things it otherwise correctly prints to the screen.

You want to close your times file when done writing; operating systems use write buffers to speed up file I/O, collecting larger blocks of data to be written to disk in one go; closing the file flushes that buffer:
times.close()
Consider opening the file in a with block:
with open('times.txt', 'w') as times:
# all code that needs to write to times

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python function performance - python

Related

need help regarding this error: can only concatenate list (not "str") to list

python multiprocessing map_async does take much more time than the sequential

Alternative to Bio.Entrez EFetch for downloading full genome sequences from NCBI

Python Nested Loops - continue iterates first loop

Is there some kind of limit to the amount of output Python 3.4 allows using the write() method at one time?

Categories

Resources