Is there a hidden possible deadlock in ppmap/parallel python? - python

I am having some trouble with using a parallel version of map (ppmap wrapper, implementation by Kirk Strauser).
The function I am trying to run in parallel runs a simple regular expression search on large number of strings (protein sequences), which are parsed from the filesystem using BioPython's SeqIO. Each of function calls uses their own file.
If I run the function using a normal map, everything works as expected. However, when using the ppmap, some of the runs simple freeze, there is no CPU usage and the main program does not even react to KeyboardInterrupt. Also, when I look onto the running processes, the workers are still there (but not using any CPU anymore).
e.g.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
Furthermore, the workers do not seem to freeze on any particular data entry - if I manually kill the process and re-run the execution, it stops at a different point. (So I have temporarily resorted to keeping a list of finished entries and re-started the program multiple times).
Is there any way to see where the problem is?
Sample of the code that I am running:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
# letter not present in the string, "zero" repeat
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
# process the output
If I use simple map instead of the ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)

You could try using one of the methods (like map) of the Pool object from the multiprocessing module instead. The advantage is that it's built in and doesn't require external packages. It also works very well.
By default, it uses as many worker processes as your computer has cores, but you can specifiy a higher number as well.

May I suggest using dispy (http://dispy.sourceforge.net)? Disclaimer: I am the author. I understand it doesn't address the question directly, but hopefully helps you.

Related

Python multiprocessing imap without list comprehension?

I've parallelized my code using imap and the pyfastx library but the problem is that the sequences get loaded using a list comprehension. When the input file is large this becomes problematic because all seq values are loaded in memory. Is there a way to do this without completely populating the list that's inputted to imap?
import pyfastx
import multiprocessing
def pSeq(seq):
...
return(A1,A2,B)
pool=multiprocessing.Pool(5)
for (A1,A2,B) in
pool.imap(pSeq,[seq for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False)],chunksize=100000):
if A1 == A2 and A1 != B:
matchedA[A1][B] += 1
I also tried skipping the list comprehension and using the apply_async function since pyfastx supports loading the sequences one at a time but because each individual loop is fairly short and there's no chunksize argument this ends up taking way longer than just not using multiprocessing at all.
import pyfastx
import multiprocessing
def pSeq(seq):
...
return(A1,A2,B)
pool=multiprocessing.Pool(5)
results = []
for _,seq,_ in pyfastx.Fastq(temp2.name, build_index=False):
results.append(pool.apply_async(pSeq,seq))
pool.join()
pool.close()
for result in results:
if result[0] == result[1] and result[0] != result[2]:
matchedA[result[0]][result[2]] +=1
Any suggestions?
I know it's been a while since the original post, but I actually dealt with a similar issue so thought this may be helpful for someone at some point.
First of all, the general solution is to pass imap an iterator or generator object, instead of a list. In this case, you would modify pSeq to accept a tuple of 3 and simply drop the list comprehension.
I am including some code below to demonstrate what I mean, but let me preempt somebody trying this - it doesn't work (at least in my hands). My guess is this happens because, for some reason, pyfastx.Fastq doesn't return an iterator or generator object (I did verify this tidbit - the returned object doesn't implement next)...
I worked around this by using fastq-and-furious, which is comparably fast and does return a generator (and also has a more flexible python API). That workaround code is at the bottom, if you want to skip the "solution that should have worked".
At any rate, here is what I would have liked to work:
def pSeq(seq_tuple):
_, seq, _ = seq_tuple
...
return(A1,A2,B)
...
import multiprocessing as mp
with mp.Pool(5) as pool:
# this fails (when I ran it on Mac, the program hung and I had to keyboard interrupt)
# most likely due to pyfastx.Fastq not returning a generator or iterator
parser = pyfastx.Fastq(temp2.name, build_index=False)
result_iterator = pool.imap(pSeq, parser, chunksize=100000)
for result in result_itertor:
do something
Just to make this answer complete, I am also adding my workaround code, which does work for me. Unfortunately, I couldn't get it to run properly while still using pyfastx:
import fastqandfurious.fastqandfurious as fqf
import fastqandfurious._fastqandfurious as _fqf
# if you don't supply an entry function, fqf returns (name, seq, quality) as byte-strings
def pfx_like_entry(buf, pos, offset=0):
"""
Return a tuple with identical format to pyfastx, so reads can be
processed with the same function regardless of which parser we use
"""
name = buf[pos[0]:pos[1]].decode('ascii')
seq = buf[pos[2]:pos[3]].decode('ascii')
quality = buf[pos[4]:pos[5]].decode('ascii')
return name, seq, quality
# can be replaced with fqf.automagic_open(), gzip.open(), some other equivalent
with open(temp2.name, mode='rb') as handle, \
mp.Pool(5) as pool:
# this does work. You can also use biopython's fastq parsers
# (or any other parser that returns an iterator/ generator)
parser = fqf.readfastq_iter(fh=handle,
fbufsize=20000,
entryfunc=pfx_like_entry
_entrypos=_fqf.entrypos
)
result_iterator = pool.imap(pSeq, parser, chunksize=100000)
for result in result_itertor:
do something

Printing in loop

I have following code to print to the System-Default printer:
def printFile(file):
print("printing file...")
with open(file, "rb") as source:
printer = subprocess.Popen('/usr/bin/lpr', stdin=subprocess.PIPE)
printer.stdin.write(source.read())
This function works quite well if I use it on its own. But if use it in a loop construct like this:
while True:
printFile(file)
(...)
the printing job won't run (although) the loop will continue without error...
I tried to build in a time delay, but it didn't helped...
[Edit]: further investigations showed me that the printing function (when called from the loop) will put the printing jobs on hold...?
In modern Python3, it is advised to use subprocess.run() in most cases instead of using subprocess.Popen directly. And I would leave it to lpr to read the file, rather than passing it to standard input:
def printFile(file):
print("printing file...")
cp = subprocess.run(['\usr\bin\lpr', file])
return cp.returncode
Using subprocess.run allows you to ascertain that the lpr process finished correctly. And this way you don't have to read and write the complete file. You can even remove the file once lpr is finished.
Using Popen directly has some disadvantages here;
Using Popen.stdin might produce a deadlock if it overfills the OS pipe buffers (according to the Python docs).
Since you don't wait() for the Popen process to finish, you don't know if it finished without errors.
Depending on how lpr is set up, it might have rate controls. That is, it might stop printing if it gets a lot of print requests in a short span of time.
Edit: I just thought of something. Most lpr implementations allow you to print more than one file at a time. So you could also do:
def printFile(files):
"""
Print file(s).
Arguments:
files: string or sequence of strings.
"""
if isinstance(files, str):
files = [files]
# if you want to be super strict...
if not isinstance(files (list, tuple)):
raise ValueError('files must be a sequence type')
else:
if not all(isinstance(f, str) for f in files):
raise ValueError('files must be a sequence of strings')
cp = subprocess.run(['\usr\bin\lpr'] + files)
return cp.returncode
That would print a single file or a whole bunch of them in one go...

Python Multithreading for search

I have a class that I have written that will open a text document and search it line by line for the keywords that are input from a GUI that I have created in a different file. It works great, the only problem is the text document that I am searching is long (over 60,000 entries). I was looking at ways to make the search faster and have been playing around with multithreading but have not had any success yet. Basically, the main program calls the search function which takes the line and breaks it into individual words. Then over a loop checks each of the words against the keywords from the user. If the keyword is in that word then it says its true and adds a 1 to a list. At the end, if there is the same number of keywords as true statements then it adds that line to a set that is returned at the end of main.
What I would like to do is incorporate multithreading into this so that it will run much faster but at the end of the main function will still return results. Any advice or direction with being able to accomplish this will be very helpful. I have tried to read a bunch of examples and watched a bunch of youtube videos but it didn't seem to transfer over when I tried. Thank you for your help and your time.
import pdb
from threading import Thread
class codeBook:
def __init__(self):
pass
def main(self, search):
count = 0
results = set()
with open('CodeBook.txt') as current_CodeBook:
lines = current_CodeBook.readlines()
for line in lines:
line = line.strip()
new_search = self.change_search(line,search)
line = new_search[0]
search = new_search[1]
#if search in line:
if self.speed_search(line,search) == True:
results.add(line)
else:
pass
count = count + 1
results = sorted(list(results))
return results
def change_search(self, current_line, search):
current_line = current_line.lower()
search = search.lower()
return current_line, search
def search(self,line,keywords):
split_line = line.split()
split_keywords = keywords.split()
numberOfTrue = list()
for i in range(0,len(split_keywords)):
if split_keywords[i] in line:
numberOfTrue.append(1)
if len(split_keywords) == len(numberOfTrue):
return True
else:
return False
You can split the file into several parts and create a new thread that reads and processes a specific part. You can keep a data structure global to all threads and add lines that match the search query from all the threads to it. This structure should either be thread-safe or you need to use some kind of synchronization (like a lock) to work with it.
Note: CPython interpreter has a global interpreter lock (GIL), so if you're using it and your application is CPU-heavy (which seems to the case here), you might not get any benefits from multithreading whatsoever.
You can use the multiprocessing module instead. It comes with means of interprocess communitation. A Queue looks like the right structure for your problem (each process could add matching lines to the queue). After that, you just need to get all lines from the queue and do what you did with the results in your code.
While threading and/or multiprocessing can be beneficial and speed up execution, I would want to direct your attention to looking into the possibility to optimize your current algorithm, running in a single thread, before doing that.
Looking at your implementation I believe a lot of work is done several times for no reason. To the best of my understanding the following function will perform the same operation as your codeBook.main but with less overhead:
def search_keywords(keyword_string, filename='CodeBook.txt'):
results = set()
keywords = set()
for keyword in keyword_string.lower():
keywords.add(keyword)
with open(filename) as code_book:
for line in code_book:
words = line.strip().lower()
kws_present = True
for keyword in keywords:
kws_present = keyword in words
if not kws_present:
break
if kws_present:
results.add(line)
return sorted(list(results))
Try this function, as is, or slightly modified for your needs and see if that gives you a sufficient speed-up. First when that is not enough, you should look into more complex solutions, as it invariably will increase the complexity of your program to introduce more threads/processes.

How to speed up a repeated execution of exec()?

I have the following code:
for k in pool:
x = []
y = []
try:
exec(pool[k])
except Exception as e:
...
do_something(x)
do_something_else(y)
where pool[k] is python code that will eventually append items to x and y (that's why I am using exec instead of eval).
I have tried already to execute the same code with pypy but for this particular block I don't get much better, that line with exec is still my bottleneck.
That said, my question is:
Is there a faster alternative to exec?
If not, do you have any workaround to get some speed up in such a case?
--UPDATE--
To clarify, pool contains around one million keys, to each key it is associated a script (around 50 line of code). The inputs for the scripts are defined before the for loop and the outputs generated by a script are stored in x and y. So, each script has a line in the code stating x.append(something) and y.append(something). The rest of the program will evaluate the results and score each script. Therefore, I need to loop over each script, execute it and process the results. The scripts are originally stored in different text files. pool is a dictionary obtained by parsing these files.
P.S.
Using the pre-compiled version of the code:
for k in pool.keys():
pool[k] = compile(pool[k], '<string>', 'exec')
I have got 5x speed increase, not much but it is something already. I am experimenting with other solutions...
If you really need to execute some code in such manner, use compile() to prepare it.
I.e. do not pass raw Python code into exec but compiled object. Use compile() on your codes before to make them Python byte compiled objects.
But still, it'll make more sense to write a function that will perform what you need on an input argument i.e. pool[k] and return results that are corresponding to x and y.
If you are getting your code out of a file, you have also IO slowdowns
to cope with. So it would be nice to have these files already compiled to *.pyc.
You may think about using execfile() in Python2.
An idea for using functions in a pool:
template = """\
def newfunc ():
%s
return result
"""
pool = [] # For iterating it will be faster if it is a list (just a bit)
# This compiles code as a real function and adds a pointer to a pool
def AddFunc (code):
code = "\n".join([" "+x for x in code.splitlines()])
exec template % code
pool.append(newfunc)
# Usage:
AddFunc("""\
a = 8.34**0.5
b = 8
c = 13
result = []
for x in range(10):
result.append(math.sin(a*b+c)/math.pi+x)""")
for f in pool:
x = f()

Python Mock Process for Unit Testing

Background:
I am currently writing a process monitoring tool (Windows and Linux) in Python and implementing unit test coverage. The process monitor hooks into the Windows API function EnumProcesses on Windows and monitors the /proc directory on Linux to find current processes. The process names and process IDs are then written to a log which is accessible to the unit tests.
Question:
When I unit test the monitoring behavior I need a process to start and terminate. I would love if there would be a (cross-platform?) way to start and terminate a fake system process that I could uniquely name (and track its creation in a unit test).
Initial ideas:
I could use subprocess.Popen() to open any system process but this runs into some issues. The unit tests could falsely pass if the process I'm using to test is run by the system as well. Also, the unit tests are run from the command line and any Linux process I can think of suspends the terminal (nano, etc.).
I could start a process and track it by its process ID but I'm not exactly sure how to do this without suspending the terminal.
These are just thoughts and observations from initial testing and I would love it if someone could prove me wrong on either of these points.
I am using Python 2.6.6.
Edit:
Get all Linux process IDs:
try:
processDirectories = os.listdir(self.PROCESS_DIRECTORY)
except IOError:
return []
return [pid for pid in processDirectories if pid.isdigit()]
Get all Windows process IDs:
import ctypes, ctypes.wintypes
Psapi = ctypes.WinDLL('Psapi.dll')
EnumProcesses = self.Psapi.EnumProcesses
EnumProcesses.restype = ctypes.wintypes.BOOL
count = 50
while True:
# Build arguments to EnumProcesses
processIds = (ctypes.wintypes.DWORD*count)()
size = ctypes.sizeof(processIds)
bytes_returned = ctypes.wintypes.DWORD()
# Call enum processes to find all processes
if self.EnumProcesses(ctypes.byref(processIds), size, ctypes.byref(bytes_returned)):
if bytes_returned.value &lt size:
return processIds
else:
# We weren't able to get all the processes so double our size and try again
count *= 2
else:
print "EnumProcesses failed"
sys.exit()
Windows code is from here
edit: this answer is getting long :), but some of my original answer still applies, so I leave it in :)
Your code is not so different from my original answer. Some of my ideas still apply.
When you are writing Unit Test, you want to only test your logic. When you use code that interacts with the operating system, you usually want to mock that part out. The reason being that you don't have much control over the output of those libraries, as you found out. So it's easier to mock those calls.
In this case, there are two libraries that are interacting with the sytem: os.listdir and EnumProcesses. Since you didn't write them, we can easily fake them to return what we need. Which in this case is a list.
But wait, in your comment you mentioned:
"The issue I'm having with it however is that it really doesn't test
that my code is seeing new processes on the system but rather that the
code is correctly monitoring new items in a list."
The thing is, we don't need to test the code that actually monitors the processes on the system, because it's a third party code. What we need to test is that your code logic handles the returned processes. Because that's the code you wrote. The reason why we are testing over a list, is because that's what your logic is doing. os.listir and EniumProcesses return a list of pids (numeric strings and integers, respectively) and your code acts on that list.
I'm assuming your code is inside a Class (you are using self in your code). I'm also assuming that they are isolated inside their own methods (you are using return). So this will be sort of what I suggested originally, except with actual code :) Idk if they are in the same class or different classes, but it doesn't really matter.
Linux method
Now, testing your Linux process function is not that difficult. You can patch os.listdir to return a list of pids.
def getLinuxProcess(self):
try:
processDirectories = os.listdir(self.PROCESS_DIRECTORY)
except IOError:
return []
return [pid for pid in processDirectories if pid.isdigit()]
Now for the test.
import unittest
from fudge import patched_context
import os
import LinuxProcessClass # class that contains getLinuxProcess method
def test_LinuxProcess(self):
"""Test the logic of our getLinuxProcess.
We patch os.listdir and return our own list, because os.listdir
returns a list. We do this so that we can control the output
(we test *our* logic, not a built-in library's functionality).
"""
# Test we can parse our pdis
fakeProcessIds = ['1', '2', '3']
with patched_context(os, 'listdir', lamba x: fakeProcessIds):
myClass = LinuxProcessClass()
....
result = myClass.getLinuxProcess()
expected = [1, 2, 3]
self.assertEqual(result, expected)
# Test we can handle IOERROR
with patched_context(os, 'listdir', lamba x: raise IOError):
myClass = LinuxProcessClass()
....
result = myClass.getLinuxProcess()
expected = []
self.assertEqual(result, expected)
# Test we only get pids
fakeProcessIds = ['1', '2', '3', 'do', 'not', 'parse']
.....
Windows method
Testing your Window's method is a little trickier. What I would do is the following:
def prepareWindowsObjects(self):
"""Create and set up objects needed to get the windows process"
...
Psapi = ctypes.WinDLL('Psapi.dll')
EnumProcesses = self.Psapi.EnumProcesses
EnumProcesses.restype = ctypes.wintypes.BOOL
self.EnumProcessses = EnumProcess
...
def getWindowsProcess(self):
count = 50
while True:
.... # Build arguments to EnumProcesses and call enun process
if self.EnumProcesses(ctypes.byref(processIds),...
..
else:
return []
I separated the code into two methods to make it easier to read (I believe you are already doing this). Here is the tricky part, EnumProcesses is using pointers and they are not easy to play with. Another thing is, that I don't know how to work with pointers in Python, so I couldn't tell you of an easy way to mock that out =P
What I can tell you is to simply not test it. Your logic there is very minimal. Besides increasing the size of count, everything else in that function is creating the space EnumProcesses pointers will use. Maybe you can add a limit to the count size but other than that, this method is short and sweet. It returns the windows processes and nothing more. Just what I was asking for in my original comment :)
So leave that method alone. Don't test it. Make sure though, that anything that uses getWindowsProcess and getLinuxProcess get's mocked out as per my original suggestion.
Hopefully this makes more sense :) If it doesn't let me know and maybe we can have a chat session or do a video call or something.
original answer
I'm not exactly sure how to do what you are asking, but whenever I need to test code that depends on some outside force (external libraries, popen or in this case processes) I mock out those parts.
Now, I don't know how your code is structured, but maybe you can do something like this:
def getWindowsProcesses(self, ...):
'''Call Windows API function EnumProcesses and
return the list of processes
'''
# ... call EnumProcesses ...
return listOfProcesses
def getLinuxProcesses(self, ...):
'''Look in /proc dir and return list of processes'''
# ... look in /proc ...
return listOfProcessses
These two methods only do one thing, get the list of processes. For Windows, it might just be a call to that API and for Linux just reading the /proc dir. That's all, nothing more. The logic for handling the processes will go somewhere else. This makes these methods extremely easy to mock out since their implementations are just API calls that return a list.
Your code can then easy call them:
def getProcesses(...):
'''Get the processes running.'''
isLinux = # ... logic for determining OS ...
if isLinux:
processes = getLinuxProcesses(...)
else:
processes = getWindowsProcesses(...)
# ... do something with processes, write to log file, etc ...
In your test, you can then use a mocking library such as Fudge. You mock out these two methods to return what you expect them to return.
This way you'll be testing your logic since you can control what the result will be.
from fudge import patched_context
...
def test_getProcesses(self, ...):
monitor = MonitorTool(..)
# Patch the method that gets the processes. Whenever it gets called, return
# our predetermined list.
originalProcesses = [....pids...]
with patched_context(monitor, "getLinuxProcesses", lamba x: originalProcesses):
monitor.getProcesses()
# ... assert logic is right ...
# Let's "add" some new processes and test that our logic realizes new
# processes were added.
newProcesses = [...]
updatedProcesses = originalProcessses + (newProcesses)
with patched_context(monitor, "getLinuxProcesses", lamba x: updatedProcesses):
monitor.getProcesses()
# ... assert logic caught new processes ...
# Let's "kill" our new processes and test that our logic can handle it
with patched_context(monitor, "getLinuxProcesses", lamba x: originalProcesses):
monitor.getProcesses()
# ... assert logic caught processes were 'killed' ...
Keep in mind that if you test your code this way, you won't get 100% code coverage (since your mocked methods won't be run), but this is fine. You're testing your code and not third party's, which is what matters.
Hopefully this might be able to help you. I know it doesn't answer your question, but maybe you can use this to figure out the best way to test your code.
Your original idea of using subprocess is a good one. Just create your own executable and name it something that identifies it as a testing thing. Maybe make it do something like sleep for a while.
Alternately, you could actually use the multiprocessing module. I've not used python in windows much, but you should be able to get process identifying data out of the Process object you create:
p = multiprocessing.Process(target=time.sleep, args=(30,))
p.start()
pid = p.getpid()

Categories