I have the following code:
for k in pool:
x = []
y = []
try:
exec(pool[k])
except Exception as e:
...
do_something(x)
do_something_else(y)
where pool[k] is python code that will eventually append items to x and y (that's why I am using exec instead of eval).
I have tried already to execute the same code with pypy but for this particular block I don't get much better, that line with exec is still my bottleneck.
That said, my question is:
Is there a faster alternative to exec?
If not, do you have any workaround to get some speed up in such a case?
--UPDATE--
To clarify, pool contains around one million keys, to each key it is associated a script (around 50 line of code). The inputs for the scripts are defined before the for loop and the outputs generated by a script are stored in x and y. So, each script has a line in the code stating x.append(something) and y.append(something). The rest of the program will evaluate the results and score each script. Therefore, I need to loop over each script, execute it and process the results. The scripts are originally stored in different text files. pool is a dictionary obtained by parsing these files.
P.S.
Using the pre-compiled version of the code:
for k in pool.keys():
pool[k] = compile(pool[k], '<string>', 'exec')
I have got 5x speed increase, not much but it is something already. I am experimenting with other solutions...
If you really need to execute some code in such manner, use compile() to prepare it.
I.e. do not pass raw Python code into exec but compiled object. Use compile() on your codes before to make them Python byte compiled objects.
But still, it'll make more sense to write a function that will perform what you need on an input argument i.e. pool[k] and return results that are corresponding to x and y.
If you are getting your code out of a file, you have also IO slowdowns
to cope with. So it would be nice to have these files already compiled to *.pyc.
You may think about using execfile() in Python2.
An idea for using functions in a pool:
template = """\
def newfunc ():
%s
return result
"""
pool = [] # For iterating it will be faster if it is a list (just a bit)
# This compiles code as a real function and adds a pointer to a pool
def AddFunc (code):
code = "\n".join([" "+x for x in code.splitlines()])
exec template % code
pool.append(newfunc)
# Usage:
AddFunc("""\
a = 8.34**0.5
b = 8
c = 13
result = []
for x in range(10):
result.append(math.sin(a*b+c)/math.pi+x)""")
for f in pool:
x = f()
Related
I've parallelized my code using imap and the pyfastx library but the problem is that the sequences get loaded using a list comprehension. When the input file is large this becomes problematic because all seq values are loaded in memory. Is there a way to do this without completely populating the list that's inputted to imap?
import pyfastx
import multiprocessing
def pSeq(seq):
...
return(A1,A2,B)
pool=multiprocessing.Pool(5)
for (A1,A2,B) in
pool.imap(pSeq,[seq for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False)],chunksize=100000):
if A1 == A2 and A1 != B:
matchedA[A1][B] += 1
I also tried skipping the list comprehension and using the apply_async function since pyfastx supports loading the sequences one at a time but because each individual loop is fairly short and there's no chunksize argument this ends up taking way longer than just not using multiprocessing at all.
import pyfastx
import multiprocessing
def pSeq(seq):
...
return(A1,A2,B)
pool=multiprocessing.Pool(5)
results = []
for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False):
results.append(pool.apply_async(pSeq,seq))
pool.join()
pool.close()
for result in results:
if result[0] == result[1] and result[0] != result[2]:
matchedA[result[0]][result[2]] +=1
Any suggestions?
I know it's been a while since the original post, but I actually dealt with a similar issue so thought this may be helpful for someone at some point.
First of all, the general solution is to pass imap an iterator or generator object, instead of a list. In this case, you would modify pSeq to accept a tuple of 3 and simply drop the list comprehension.
I am including some code below to demonstrate what I mean, but let me preempt somebody trying this - it doesn't work (at least in my hands). My guess is this happens because, for some reason, pyfastx.Fastq doesn't return an iterator or generator object (I did verify this tidbit - the returned object doesn't implement next)...
I worked around this by using fastq-and-furious, which is comparably fast and does return a generator (and also has a more flexible python API). That workaround code is at the bottom, if you want to skip the "solution that should have worked".
At any rate, here is what I would have liked to work:
def pSeq(seq_tuple):
_, seq, _ = seq_tuple
...
return(A1,A2,B)
...
import multiprocessing as mp
with mp.Pool(5) as pool:
# this fails (when I ran it on Mac, the program hung and I had to keyboard interrupt)
# most likely due to pyfastx.Fastq not returning a generator or iterator
parser = pyfastx.Fastq(temp2.name, build_index=False)
result_iterator = pool.imap(pSeq, parser, chunksize=100000)
for result in result_itertor:
do something
Just to make this answer complete, I am also adding my workaround code, which does work for me. Unfortunately, I couldn't get it to run properly while still using pyfastx:
import fastqandfurious.fastqandfurious as fqf
import fastqandfurious._fastqandfurious as _fqf
# if you don't supply an entry function, fqf returns (name, seq, quality) as byte-strings
def pfx_like_entry(buf, pos, offset=0):
"""
Return a tuple with identical format to pyfastx, so reads can be
processed with the same function regardless of which parser we use
"""
name = buf[pos[0]:pos[1]].decode('ascii')
seq = buf[pos[2]:pos[3]].decode('ascii')
quality = buf[pos[4]:pos[5]].decode('ascii')
return name, seq, quality
# can be replaced with fqf.automagic_open(), gzip.open(), some other equivalent
with open(temp2.name, mode='rb') as handle, \
mp.Pool(5) as pool:
# this does work. You can also use biopython's fastq parsers
# (or any other parser that returns an iterator/ generator)
parser = fqf.readfastq_iter(fh=handle,
fbufsize=20000,
entryfunc=pfx_like_entry
_entrypos=_fqf.entrypos
)
result_iterator = pool.imap(pSeq, parser, chunksize=100000)
for result in result_itertor:
do something
I have the following function (shortened for readability), which I parallelize using Python's (3.5) multiprocessing module:
def evaluate_prediction(enumeration_tuple):
i = enumeration_tuple[0]
logits_pred = enumeration_tuple[1]
print("This prints succesfully")
print("This never gets printed: ")
print(enumeration_tuple[0])
filename = sample_names_test[i]
onehots_pred = logits_to_onehots(logits_pred)
np.save("/media/nfs/7_raid/ebos/models/fcn/" + channels + "/test/ndarrays/" + filename, onehots_pred)
However, this function hangs whenever I attempt to read its input argument. Execution can get past the logits_pred = enumeration_tuple[1] line, as evidenced by a print statement printing a simple string, but it halts whenever I print(logits_pred). So apparently, whenever I actually need the passed value, the process stops. I do not get an exception or error message. When using either Python's built-in map() function or a for-loop, the function finishes succesfully. I should have sufficient memory en computing power available. All processes are writing to different files. enumerate(predictions) yields correct index-value pairs, as expected. I call this function using Pool.map():
pool = multiprocessing.Pool()
file_results = pool.map(evaluate_prediction, enumerate(predictions))
Why is it hanging? And how can I get an exception, so I know what's going wrong?
UPDATE: After outsourcing the mapped function to another module, importing it from there, and adding __init__.py to my directory, I manage to print the first item in the tuple, but not the second.
I had a similar issue before, and a solution that worked for me was to put the function you want to parallelize in a separate module and then import it.
from eval_prediction import evaluate_prediction
pool = multiprocessing.Pool()
file_results = pool.map(evaluate_prediction, enumerate(predictions))
I assume you will save the function definition inside a filename eval_prediction.py in same directory. Make sure you have __init__.py as well.
I want to utilize all my computer's cores to run the following Pseudocode (the actual code is too long):
def function2(modifiedList):
process modifiedList
return value
mainList = [a, b, c,...,z]
def function1(mainList):
process mainList
create modifiedList
result = function2(modifiedList)
return result
calculator(function1)
Not sure how to do multiporcessing or multi-threads when a function is called from inside another function.
You should look at https://docs.python.org/3.6/library/multiprocessing.html module in standard library.
However, the code you posted would not be easy to parallelize, because of it's sequential nature.
A much better approach would be to find an easily parallelizable part of your problem and work from there.
As the documentation says, if your list is long enough, and you can process each element independently, you could try to substitute
for i in l:
result.append(f(i))
with
result = p.map(f, l)
I have a "not so" large file (~2.2GB) which I am trying to read and process...
graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
for line in f:
try:
line = line.rstrip(os.linesep)
tokens = line.split("\t")
if len(tokens)==3:
src = long(tokens[0])
destination = long(tokens[1])
weight = float(tokens[2])
#tup1 = (destination,weight)
#tup2 = (src,weight)
graph[src][destination] = weight
graph[destination][src] = weight
else:
print "error ", line
error.write(line+"\n")
except Exception, e:
string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
error.write(string)
continue
Am i doing something wrong??
Its been like an hour.. since the code is reading the file.. (its still reading..)
And tracking memory usage is already 20GB..
why is it taking so time and memory??
To get a rough idea of where the memory is going, you can use the gc.get_objects function. Wrap your above code in a make_graph() function (this is best practice anyway), and then wrap the call to this function with a KeyboardInterrupt exception handler which prints out the gc data to a file.
def main():
try:
make_graph()
except KeyboardInterrupt:
write_gc()
def write_gc():
from os.path import exists
fname = 'gc.log.%i'
i = 0
while exists(fname % i):
i += 1
fname = fname % i
with open(fname, 'w') as f:
from pprint import pformat
from gc import get_objects
f.write(pformat(get_objects())
if __name__ == '__main__':
main()
Now whenever you ctrl+c your program, you'll get a new gc.log. Given a few samples you should be able to see the memory issue.
There are a few things you can do:
Run your code on a subset of data. Measure time required. Extrapolate to the full size of your data. That will give you an estimate how long it will run.
counter = 0
with open("final_edge_list.txt","r") as f:
for line in f:
counter += 1
if counter == 200000:
break
try:
...
On 1M lines it runs ~8 sec on my machine, so for 2.2Gb file with about 100M lines it suppose to run ~15 min. Though, once you get over you available memory, it won't hold anymore.
Your graph seems symmetric
graph[src][destination] = weight
graph[destination][src] = weight
In your graph processing code use symmetry of graph, reduce memory usage by half.
Run profilers on you code using subset of the data, see what happens there. Simplest would be to run
python -m cProfile --sort cumulative youprogram.py
There is a good article on speed and memory profilers: http://www.huyng.com/posts/python-performance-analysis/
Python's numeric types use quite a lot of memory compared to other programming languages. For my setting it appears to be 24 bytes for each number:
>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24
Given you have hundreds of millions of lines in that 2.2 GB input file the reported memory consumption should not come unexpected.
To add another thing, some versions of the Python interpreter (including CPython 2.6) are known for keeping so called free lists for allocation performance, especially for objects of type int and float. Once allocated, this memory will not be returned to the operating system until your process terminates. Also have a look at this question I posted when I first discovered this issue:
Python: garbage collection fails?
Suggestions to work around this include:
use a subprocess to do the memory hungry computation, e.g., based on the multiprocessing module
use a library that implements the functionality in C, e.g., numpy, pandas
use another interpreter, e.g., PyPy
You don't need graph to be defaultdict(dict), user dict instead; graph[src, destination] = weight and graph[destination, src] = weight will do. Or only one of them.
To reduce memory usage, try store resulting dataset in scipy.sparse matrix, it consumes less memory and might be compressed.
What do you plan to do with your nodes list afterwards?
I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.
May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.