Race condition with Python mmap and multiprocessing.semaphore

Race condition with Python mmap and multiprocessing.semaphore - python

I am writing a script that processes some mmaps concurrently with multiprocessing.Process and updates a result list stored in an mmap and locked with a mutex.
My function to write to the result list looks like this
def update_result(result_mmap, new_value, new_value_index, sema):
sema.acquire()
result_mmap.seek(0)
old_result = result_mmap.readline().split("\t")
old_result[new_value_index] = new_value
new_result = "\t".join(map(str, old_result))
result_mmap.resize(len(new_result))
result_mmap.seek(0)
result_mmap.write(new_result)
sema.release()
This works SOMETIMES, but other times, depending on the order of execution of the processes, it seems that the result_mmap isn't resizing properly. I am not sure where to look from here- I know that a race condition exists but I don't know why.
Edit: This is the function that calls update_result:
def apply_function(mmapped_files, function, result_mmap, result_index, sema):
for mf in mmapped_files:
accumulator = int(mf.readline())
while True:
line = mf.readline()
if line is None or line == '':
break
num = int(line)
accumulator = function(num, accumulator)
update_result(result_mmap, result_index, inc, sema)

Maybe I'm wrong, but are you sure that the semaphore really works between the processes (is it a system mutex?). Because if it's not, processes do not share the same memory space. What you might want to consider using would be the threading library, in order for the threads to use the same semaphore.

Related

How do I adapt my code for multiprocessing

Whenever I use my (other) multiprocessing code it works fine but in terms of feedback for where I am in regards to completion of the script for example "Completed 5 / 10 files" I do not know how to adapt my code to return the count. Basically I would like to adapt the code below to allow multiprocessing.
So I Use
file_paths = r"path to file with paths"
count = 0
pool = Pool(16)
pool.map(process_control, file_paths)
pool.close()
pool.join()
within process_control I have at the end of the function count += 1 and return count
I guess the equivelant code would be something like
def process_control(count, file_path):
do stuff
count += 1
print("Process {} / {} completed".format(count, len(file_paths))
return count
file_paths = r"path to file with paths"
count = 0
for path in file_paths:
count = process_control(count, path)
SOmething like that so that. I hope my explanation is clear.

Each subprocess has its own copy of count so all they can do is track the work in that one process. The count won't aggregate for all of the processes. But the parent can do the counting. map waits for all tasks to complete, so that isn't helpful. imap is better, it iterates but it also maintains order so reporting is still delayed. imap_unordered with chunksize 1 is your best option. Each task return value (even if it is None) is returned immediately.
def process_control(count, file_path):
do stuff
file_paths = ["path1", ...]
with multiprocessing.Pool(16) as pool:
count = 0
for _ in pool.imap_unordered(porcess_control, file_paths,chunksize=1):
count += 1
print("Process {} / {} completed".format(count, len(file_paths))
A note on chunksize. There are costs to using a pool - each work item needs to be sent to the subprocess and its value returned. This back-and-forth IPC is relatively expensive, so the pool will "chunk" the work items, meaning that it will send many work items to a given subprocess all in one chunk and the process will only return when the entire chunk of data has been processed through the worker function.
This is great when there are many relatively short work items. But suppose that different work items take different amount of time to execute. There will be a tall-pole subprocess still working on its chunk even though the others have finished.
More important for your case, the results aren't posted back to the parent until the chunk completes so you don't get real-time reporting of completion.
Set chunksize to 1 and the subprocess will return results immediately for more accurate accounting.

For simple cases, the previous answer by #tedelaney is excellent.
For more complicated cases, Value provides shared memroy:
from multiprocessing import Value
counter = Value('i', 0)
# increment the value
with variable.get_lock():
counter.value += 1
# get the value. Read lock automatically used
processes_done = counter.value

Python 3: How to properly add new Futures to a list while already waiting upon it?

I have a concurrent.futures.ThreadPoolExecutor and a list. And with the following code I add futures to the ThreadPoolExecutor:
for id in id_list:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
And then I wait upon the list:
concurrent.futures.wait(self._futures)
However, self.myfunc does some network I/O and thus there will be some network exceptions. When errors occur, self.myfunc submits a new self.myfunc with the same id to the same thread pool and add a new future to the same list, just as the above:
try:
do_stuff(id)
except:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
return None
Here comes the problem: I got an error on the line of concurrent.futures.wait(self._futures):
File "/usr/lib/python3.4/concurrent/futures/_base.py", line 277, in wait
f._waiters.remove(waiter)
ValueError: list.remove(x): x not in list
How should I properly add new Futures to a list while already waiting upon it?

Looking at the implementation of wait(), it certainly doesn't expect that anything outside concurrent.futures will ever mutate the list passed to it. So I don't think you'll ever get that "to work". It's not just that it doesn't expect the list to mutate, it's also that significant processing is done on list entries, and the implementation has no way to know that you've added more entries.
Untested, I'd suggest trying this instead: skip all that, and just keep a running count of threads still active. A straightforward way is to use a Condition guarding a count.
Initialization:
self._count_cond = threading.Condition()
self._thread_count = 0
When my_func is entered (i.e., when a new thread starts):
with self._count_cond:
self._thread_count += 1
When my_func is done (i.e., when a thread ends), for whatever reason (exceptional or not):
with self._count_cond:
self._thread_count -= 1
self._count_cond.notify() # wake up the waiting logic
And finally the main waiting logic:
with self._count_cond:
while self._thread_count:
self._count_cond.wait()
POSSIBLE RACE
It seems possible that the thread count could reach 0 while work for a new thread has been submitted, but before its my_func invocation starts running (and so before _thread_count is incremented to account for the new thread).
So the:
with self._count_cond:
self._thread_count += 1
part should really be done instead right before each occurrence of
self._thread_pool.submit(self.myfunc, id)
Or write a new method to encapsulate that pattern; e.g., like so:
def start_new_thread(self, id):
with self._count_cond:
self._thread_count += 1
self._thread_pool.submit(self.myfunc, id)
A DIFFERENT APPROACH
Offhand, I expect this could work too (but, again, haven't tested it): keep all your code the same except change how you're waiting:
while self._futures:
self._futures.pop().result()
So this simply waits for one thread at a time, until none remain.
Note that .pop() and .append() on lists are atomic in CPython, so no need for your own lock. And because your my_func() code appends before the thread it's running in ends, the list won't become empty before all threads really are done.
AND YET ANOTHER APPROACH
Keep the original waiting code, but rework the rest not to create new threads in case of exception. Like rewrite my_func to return True if it quits due to an exception, return False otherwise, and start threads running a wrapper instead:
def my_func_wrapper(self, id):
keep_going = True
while keep_going:
keep_going = self.my_func(id)
This may be especially attractive if you someday decide to use multiple processes instead of multiple threads (creating new processes can be a lot more expensive on some platforms).
AND A WAY USING cf.wait()
Another way is to change just the waiting code:
while self._futures:
fs = self._futures[:]
for f in fs:
self._futures.remove(f)
concurrent.futures.wait(fs)
Clear? This makes a copy of the list to pass to .wait(), and the copy is never mutated. New threads show up in the original list, and the whole process is repeated until no new threads show up.
Which of these ways makes most sense seems to me to depend mostly on pragmatics, but there's not enough info about all you're doing for me to make a guess about that.

My queue is empty after multiprocessing.Process instances finish

I have a python script where at the top of the file I have:
result_queue = Queue.Queue()
key_list = *a large list of small items* #(actually from bucket.list() via boto)
I have learned that Queues are process safe data structures. I have a method:
def enqueue_tasks(keys):
for key in keys:
try:
result = perform_scan.delay(key)
result_queue.put(result)
except:
print "failed"
The perform_scan.delay() function here actually calls a celery worker, but I don't think is relevant (it is an asynchronous process call).
I also have:
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Lastly I have a main() function:
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
concurrent.futures.wait(futures)
print len(result_queue)
The result from the print statement is a 0. Yet if I include a print statement of the size of result_queue in enqueue_tasks, while the program is running, I can see that the size is increasing and things are being added to the queue.
Ideas of what is happening?

It looks like there's a simpler solution to this problem.
You're building a list of futures. The whole point of futures is that they're future results. In particular, whatever each function returns, that's the (eventual) value of the future. So, don't do the whole "push results onto a queue" thing at all, just return them from the task function, and pick them up from the futures.
The simplest way to do this is to break that loop up so that each key is a separate task, with a separate future. I don't know whether that's appropriate for your real code, but if it is:
def do_task(key):
try:
return perform_scan.delay(key)
except:
print "failed"
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(do_task, key) for key in key_list]
# If you want to do anything with these results, you probably want
# a loop around concurrent.futures.as_completed or similar here,
# rather than waiting for them all to finish, ignoring the results,
# and printing the number of them.
concurrent.futures.wait(futures)
print len(futures)
Of course that doesn't do the grouping. But do you need it?
The most likely reason for the grouping to be necessary is that the tasks are so tiny that the overhead in scheduling them (and pickling the inputs and outputs) swamps the actual work. If that's true, then you can almost certainly wait until a whole batch is done to return any results. Especially given that you're not even looking at the results until they're all done anyway. (This model of "split into groups, process each group, merge back together" is pretty common in cases like numerical work, where each element may be tiny, or elements may not be independent of each other, but there are groups that are big enough or independent from the rest of the work.)
At any rate, that's almost as simple:
def do_tasks(keys):
results = []
for key in keys:
try:
result = perform_scan.delay(key)
results.append(result)
except:
print "failed"
return results
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
print sum(len(results) for results in concurrent.futures.as_completed(futures))
Or, if you prefer to first wait and then calculate:
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
concurrent.futures.wait(futures)
print sum(len(future.result()) for future in futures)
But again, I doubt you need even this.

You need to use a multiprocessing.Queue, not a Queue.Queue. Queue.Queue is thread-safe, not process-safe, so the changes you make to it in one process are not reflected in any others.

How to manage python threads results?

I am using this code:
def startThreads(arrayofkeywords):
global i
i = 0
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
except KeyboardInterrupt:
sys.exit()
thread.join()
for threading in python, I have almost everything done, but I dont know how to manage the results of each thread, on each thread I have an array of strings as result, how can I join all those arrays into one safely? Because, I if I try writing into a global array, two threads could be writing at the same time.

First, you actually need to save all those thread objects to call join() on them. As written, you're saving only the last one of them, and then only if there isn't an exception.
An easy way to do multithreaded programming is to give each thread all the data it needs to run, and then have it not write to anything outside that working set. If all threads follow that guideline, their writes will not interfere with each other. Then, once a thread has finished, have the main thread only aggregate the results into a global array. This is know as "fork/join parallelism."
If you subclass the Thread object, you can give it space to store that return value without interfering with other threads. Then you can do something like this:
class MyThread(threading.Thread):
def __init__(self, ...):
self.result = []
...
def main():
# doStuffWith() returns a MyThread instance
threads = [ doStuffWith(k).start() for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t.join()
ret = t.result
# process return value here
Edit:
After looking around a bit, it seems like the above method isn't the preferred way to do threads in Python. The above is more of a Java-esque pattern for threads. Instead you could do something like:
def handler(outList)
...
# Modify existing object (important!)
outList.append(1)
...
def doStuffWith(keyword):
...
result = []
thread = Thread(target=handler, args=(result,))
return (thread, result)
def main():
threads = [ doStuffWith(k) for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t[0].start()
for t in threads:
t[0].join()
ret = t[1]
# process return value here

Use a Queue.Queue instance, which is intrinsically thread-safe. Each thread can .put its results to that global instance when it's done, and the main thread (when it knows all working threads are done, by .joining them for example as in #unholysampler's answer) can loop .getting each result from it, and use each result to .extend the "overall result" list, until the queue is emptied.
Edit: there are other big problems with your code -- if the maximum number of threads is less than the number of keywords, it will never terminate (you're trying to start a thread per keyword -- never less -- but if you've already started the max numbers you loop forever to no further purpose).
Consider instead using a threading pool, kind of like the one in this recipe, except that in lieu of queueing callables you'll queue the keywords -- since the callable you want to run in the thread is the same in each thread, just varying the argument. Of course that callable will be changed to peel something from the incoming-tasks queue (with .get) and .put the list of results to the outgoing-results queue when done.
To terminate the N threads you could, after all keywords, .put N "sentinels" (e.g. None, assuming no keyword can be None): a thread's callable will exit if the "keyword" it just pulled is None.
More often than not, Queue.Queue offers the best way to organize threading (and multiprocessing!) architectures in Python, be they generic like in the recipe I pointed you to, or more specialized like I'm suggesting for your use case in the last two paragraphs.

You need to keep pointers to each thread you make. As is, your code only ensures the last created thread finishes. This does not imply that all the ones you started before it have also finished.
def startThreads(arrayofkeywords):
global i
i = 0
threads = []
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
threads.append(thread)
except KeyboardInterrupt:
sys.exit()
for t in threads:
t.join()
//process results stored in each thread
This also solves the problem of write access because each thread will store it's data locally. Then after all of them are done, you can do the work to combine each threads local data.

I know that this question is a little bit old, but the best way to do this is not to harm yourself too much in the way proposed by other colleagues :)
Please read the reference on Pool. This way you will fork-join your work:
def doStuffWith(keyword):
return keyword + ' processed in thread'
def startThreads(arrayofkeywords):
pool = Pool(processes=maxThreads)
result = pool.map(doStuffWith, arrayofkeywords)
print result

Writing into a global array is fine if you use a semaphore to protect the critical section. You 'acquire' the lock when you want to append to the global array, then 'release' when you are done. This way, only one thread is every appending to the array.
Check out http://docs.python.org/library/threading.html and search for semaphore for more info.
sem = threading.Semaphore()
...
sem.acquire()
# do dangerous stuff
sem.release()

try some semaphore's methods, like acquire and release..
http://docs.python.org/library/threading.html

making a programme run indefinitely in python

Is there any way to make a function (the ones I'm thinking of are in the style of the simple ones I've made which generate the fibonnacci sequence from 0 to a point, and all the primes between two points) run indefinitely. E.g. until I press a certain key or until a time has passed, rather than until a number reaches a certain point?
Also, if it is based on time then is there any way I could just extend the time and start it going from that point again, rather than having to start again from 0? I am aware there is a time module, i just don't know much about it.

The simplest way is just to write a program with an infinite loop, and then hit control-C to stop it. Without more description it's hard to know if this works for you.
If you do it time-based, you don't need a generator. You can just have it pause for user input, something like a "Continue? [y/n]", read from stdin, and depending on what you get either exit the loop or not.

If you really want your function to run and still wants user (or system) input, you have two solutions:
multi-thread
multi-process
It will depend on how fine the interaction. If you just want to interrupt the function and don't care about the exit, then multi-process is fine.
In both cases, you can rely on some shared resources (file or shared memory for multi-thread, variable with associated mutex for multi-thread) and check for the state of that resource regularly in your function. If it is set up to tell you to quit, just do it.
Example on multi-thread:
from threading import Thread, Lock
from time import sleep
class MyFct(Thread):
def __init__(self):
Thread.__init__(self)
self.mutex = Lock()
self._quit = False
def stopped(self):
self.mutex.acquire()
val = self._quit
self.mutex.release()
return val
def stop(self):
self.mutex.acquire()
self._quit = True
self.mutex.release()
def run(self):
i = 1
j = 1
print i
print j
while True:
if self.stopped():
return
i,j = j,i+j
print j
def main_fct():
t = MyFct()
t.start()
sleep(1)
t.stop()
t.join()
print "Exited"
if __name__ == "__main__":
main_fct()

You could use a generator for this:
def finished():
"Define your exit condition here"
return ...
def count(i=0):
while not finished():
yield i
i += 1
for i in count():
print i
If you want to change the exit condition you could pass a value back into the generator function and use that value to determine when to exit.

As in almost all languages:
while True:
# check what you want and eventually break
print nextValue()
The second part of your question is more interesting:
Also, if it is based on time then is there anyway I could just extend the time and start it going from that point again rather than having to start again from 0
you can use a yield instead of return in the function nextValue()

If you use a child thread to run the function while the main thread waits for character input it should work. Just remember to have something that stops the child thread (in the example below the global runthread)
For example:
import threading, time
runthread = 1
def myfun():
while runthread:
print "A"
time.sleep(.1)
t = threading.Thread(target=myfun)
t.start()
raw_input("")
runthread = 0
t.join()
does just that

If you want to exit based on time, you can use the signal module's alarm(time) function, and the catch the SIGALRM - here's an example http://docs.python.org/library/signal.html#example
You can let the user interrupt the program in a sane manner by catching KeyboardInterrupt. Simply catch the KeyboardInterrupt exception from outside you main loop, and do whatever cleanup you want.
If you want to continue later where you left off, you will have to add some sort persistence. I would pickle a data structure to disk, that you could read back in to continue the operations.
I haven't tried anything like this, but you could look into using something like memoizing, and caching to the disk.

You could do something like this to generate fibonnacci numbers for 1 second then stop.
fibonnacci = [1,1]
stoptime = time.time() + 1 # set stop time to 1 second in the future
while time.time() < stoptime:
fibonnacci.append(fibonnacci[-1]+fibonnacci[-2])
print "Generated %s numbers, the last one was %s." % (len(fibonnacci),fibonnacci[-1])
I'm not sure how efficient it is to call time.time() in every loop - depending on the what you are doing inside the loop, it might end up taking a lot of the performance away.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.