Python multiprocessing imap without list comprehension? - python

I've parallelized my code using imap and the pyfastx library but the problem is that the sequences get loaded using a list comprehension. When the input file is large this becomes problematic because all seq values are loaded in memory. Is there a way to do this without completely populating the list that's inputted to imap?
import pyfastx
import multiprocessing
def pSeq(seq):
...
return(A1,A2,B)
pool=multiprocessing.Pool(5)
for (A1,A2,B) in
pool.imap(pSeq,[seq for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False)],chunksize=100000):
if A1 == A2 and A1 != B:
matchedA[A1][B] += 1
I also tried skipping the list comprehension and using the apply_async function since pyfastx supports loading the sequences one at a time but because each individual loop is fairly short and there's no chunksize argument this ends up taking way longer than just not using multiprocessing at all.
import pyfastx
import multiprocessing
def pSeq(seq):
...
return(A1,A2,B)
pool=multiprocessing.Pool(5)
results = []
for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False):
results.append(pool.apply_async(pSeq,seq))
pool.join()
pool.close()
for result in results:
if result[0] == result[1] and result[0] != result[2]:
matchedA[result[0]][result[2]] +=1
Any suggestions?

I know it's been a while since the original post, but I actually dealt with a similar issue so thought this may be helpful for someone at some point.
First of all, the general solution is to pass imap an iterator or generator object, instead of a list. In this case, you would modify pSeq to accept a tuple of 3 and simply drop the list comprehension.
I am including some code below to demonstrate what I mean, but let me preempt somebody trying this - it doesn't work (at least in my hands). My guess is this happens because, for some reason, pyfastx.Fastq doesn't return an iterator or generator object (I did verify this tidbit - the returned object doesn't implement next)...
I worked around this by using fastq-and-furious, which is comparably fast and does return a generator (and also has a more flexible python API). That workaround code is at the bottom, if you want to skip the "solution that should have worked".
At any rate, here is what I would have liked to work:
def pSeq(seq_tuple):
_, seq, _ = seq_tuple
...
return(A1,A2,B)
...
import multiprocessing as mp
with mp.Pool(5) as pool:
# this fails (when I ran it on Mac, the program hung and I had to keyboard interrupt)
# most likely due to pyfastx.Fastq not returning a generator or iterator
parser = pyfastx.Fastq(temp2.name, build_index=False)
result_iterator = pool.imap(pSeq, parser, chunksize=100000)
for result in result_itertor:
do something
Just to make this answer complete, I am also adding my workaround code, which does work for me. Unfortunately, I couldn't get it to run properly while still using pyfastx:
import fastqandfurious.fastqandfurious as fqf
import fastqandfurious._fastqandfurious as _fqf
# if you don't supply an entry function, fqf returns (name, seq, quality) as byte-strings
def pfx_like_entry(buf, pos, offset=0):
"""
Return a tuple with identical format to pyfastx, so reads can be
processed with the same function regardless of which parser we use
"""
name = buf[pos[0]:pos[1]].decode('ascii')
seq = buf[pos[2]:pos[3]].decode('ascii')
quality = buf[pos[4]:pos[5]].decode('ascii')
return name, seq, quality
# can be replaced with fqf.automagic_open(), gzip.open(), some other equivalent
with open(temp2.name, mode='rb') as handle, \
mp.Pool(5) as pool:
# this does work. You can also use biopython's fastq parsers
# (or any other parser that returns an iterator/ generator)
parser = fqf.readfastq_iter(fh=handle,
fbufsize=20000,
entryfunc=pfx_like_entry
_entrypos=_fqf.entrypos
)
result_iterator = pool.imap(pSeq, parser, chunksize=100000)
for result in result_itertor:
do something

Related

Python: Why does Pool.map() hang when attempting to use the input arguments to its map function?

I have the following function (shortened for readability), which I parallelize using Python's (3.5) multiprocessing module:
def evaluate_prediction(enumeration_tuple):
i = enumeration_tuple[0]
logits_pred = enumeration_tuple[1]
print("This prints succesfully")
print("This never gets printed: ")
print(enumeration_tuple[0])
filename = sample_names_test[i]
onehots_pred = logits_to_onehots(logits_pred)
np.save("/media/nfs/7_raid/ebos/models/fcn/" + channels + "/test/ndarrays/" + filename, onehots_pred)
However, this function hangs whenever I attempt to read its input argument. Execution can get past the logits_pred = enumeration_tuple[1] line, as evidenced by a print statement printing a simple string, but it halts whenever I print(logits_pred). So apparently, whenever I actually need the passed value, the process stops. I do not get an exception or error message. When using either Python's built-in map() function or a for-loop, the function finishes succesfully. I should have sufficient memory en computing power available. All processes are writing to different files. enumerate(predictions) yields correct index-value pairs, as expected. I call this function using Pool.map():
pool = multiprocessing.Pool()
file_results = pool.map(evaluate_prediction, enumerate(predictions))
Why is it hanging? And how can I get an exception, so I know what's going wrong?
UPDATE: After outsourcing the mapped function to another module, importing it from there, and adding __init__.py to my directory, I manage to print the first item in the tuple, but not the second.
I had a similar issue before, and a solution that worked for me was to put the function you want to parallelize in a separate module and then import it.
from eval_prediction import evaluate_prediction
pool = multiprocessing.Pool()
file_results = pool.map(evaluate_prediction, enumerate(predictions))
I assume you will save the function definition inside a filename eval_prediction.py in same directory. Make sure you have __init__.py as well.

Python 3 multiporcessing or multi-threads for multiple functions?

I want to utilize all my computer's cores to run the following Pseudocode (the actual code is too long):
def function2(modifiedList):
process modifiedList
return value
mainList = [a, b, c,...,z]
def function1(mainList):
process mainList
create modifiedList
result = function2(modifiedList)
return result
calculator(function1)
Not sure how to do multiporcessing or multi-threads when a function is called from inside another function.
You should look at https://docs.python.org/3.6/library/multiprocessing.html module in standard library.
However, the code you posted would not be easy to parallelize, because of it's sequential nature.
A much better approach would be to find an easily parallelizable part of your problem and work from there.
As the documentation says, if your list is long enough, and you can process each element independently, you could try to substitute
for i in l:
result.append(f(i))
with
result = p.map(f, l)

Python Multithreading for search

I have a class that I have written that will open a text document and search it line by line for the keywords that are input from a GUI that I have created in a different file. It works great, the only problem is the text document that I am searching is long (over 60,000 entries). I was looking at ways to make the search faster and have been playing around with multithreading but have not had any success yet. Basically, the main program calls the search function which takes the line and breaks it into individual words. Then over a loop checks each of the words against the keywords from the user. If the keyword is in that word then it says its true and adds a 1 to a list. At the end, if there is the same number of keywords as true statements then it adds that line to a set that is returned at the end of main.
What I would like to do is incorporate multithreading into this so that it will run much faster but at the end of the main function will still return results. Any advice or direction with being able to accomplish this will be very helpful. I have tried to read a bunch of examples and watched a bunch of youtube videos but it didn't seem to transfer over when I tried. Thank you for your help and your time.
import pdb
from threading import Thread
class codeBook:
def __init__(self):
pass
def main(self, search):
count = 0
results = set()
with open('CodeBook.txt') as current_CodeBook:
lines = current_CodeBook.readlines()
for line in lines:
line = line.strip()
new_search = self.change_search(line,search)
line = new_search[0]
search = new_search[1]
#if search in line:
if self.speed_search(line,search) == True:
results.add(line)
else:
pass
count = count + 1
results = sorted(list(results))
return results
def change_search(self, current_line, search):
current_line = current_line.lower()
search = search.lower()
return current_line, search
def search(self,line,keywords):
split_line = line.split()
split_keywords = keywords.split()
numberOfTrue = list()
for i in range(0,len(split_keywords)):
if split_keywords[i] in line:
numberOfTrue.append(1)
if len(split_keywords) == len(numberOfTrue):
return True
else:
return False
You can split the file into several parts and create a new thread that reads and processes a specific part. You can keep a data structure global to all threads and add lines that match the search query from all the threads to it. This structure should either be thread-safe or you need to use some kind of synchronization (like a lock) to work with it.
Note: CPython interpreter has a global interpreter lock (GIL), so if you're using it and your application is CPU-heavy (which seems to the case here), you might not get any benefits from multithreading whatsoever.
You can use the multiprocessing module instead. It comes with means of interprocess communitation. A Queue looks like the right structure for your problem (each process could add matching lines to the queue). After that, you just need to get all lines from the queue and do what you did with the results in your code.
While threading and/or multiprocessing can be beneficial and speed up execution, I would want to direct your attention to looking into the possibility to optimize your current algorithm, running in a single thread, before doing that.
Looking at your implementation I believe a lot of work is done several times for no reason. To the best of my understanding the following function will perform the same operation as your codeBook.main but with less overhead:
def search_keywords(keyword_string, filename='CodeBook.txt'):
results = set()
keywords = set()
for keyword in keyword_string.lower():
keywords.add(keyword)
with open(filename) as code_book:
for line in code_book:
words = line.strip().lower()
kws_present = True
for keyword in keywords:
kws_present = keyword in words
if not kws_present:
break
if kws_present:
results.add(line)
return sorted(list(results))
Try this function, as is, or slightly modified for your needs and see if that gives you a sufficient speed-up. First when that is not enough, you should look into more complex solutions, as it invariably will increase the complexity of your program to introduce more threads/processes.

How to speed up a repeated execution of exec()?

I have the following code:
for k in pool:
x = []
y = []
try:
exec(pool[k])
except Exception as e:
...
do_something(x)
do_something_else(y)
where pool[k] is python code that will eventually append items to x and y (that's why I am using exec instead of eval).
I have tried already to execute the same code with pypy but for this particular block I don't get much better, that line with exec is still my bottleneck.
That said, my question is:
Is there a faster alternative to exec?
If not, do you have any workaround to get some speed up in such a case?
--UPDATE--
To clarify, pool contains around one million keys, to each key it is associated a script (around 50 line of code). The inputs for the scripts are defined before the for loop and the outputs generated by a script are stored in x and y. So, each script has a line in the code stating x.append(something) and y.append(something). The rest of the program will evaluate the results and score each script. Therefore, I need to loop over each script, execute it and process the results. The scripts are originally stored in different text files. pool is a dictionary obtained by parsing these files.
P.S.
Using the pre-compiled version of the code:
for k in pool.keys():
pool[k] = compile(pool[k], '<string>', 'exec')
I have got 5x speed increase, not much but it is something already. I am experimenting with other solutions...
If you really need to execute some code in such manner, use compile() to prepare it.
I.e. do not pass raw Python code into exec but compiled object. Use compile() on your codes before to make them Python byte compiled objects.
But still, it'll make more sense to write a function that will perform what you need on an input argument i.e. pool[k] and return results that are corresponding to x and y.
If you are getting your code out of a file, you have also IO slowdowns
to cope with. So it would be nice to have these files already compiled to *.pyc.
You may think about using execfile() in Python2.
An idea for using functions in a pool:
template = """\
def newfunc ():
%s
return result
"""
pool = [] # For iterating it will be faster if it is a list (just a bit)
# This compiles code as a real function and adds a pointer to a pool
def AddFunc (code):
code = "\n".join([" "+x for x in code.splitlines()])
exec template % code
pool.append(newfunc)
# Usage:
AddFunc("""\
a = 8.34**0.5
b = 8
c = 13
result = []
for x in range(10):
result.append(math.sin(a*b+c)/math.pi+x)""")
for f in pool:
x = f()

Is there a hidden possible deadlock in ppmap/parallel python?

I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.
May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.

Categories