Passing updated args to multiple threads periodically in python - python

I have three base stations, they have to work in parallel, and they will receive a list every 10 seconds which contain information about their cluster, and I want to run this code for about 10 minutes. So, every 10 seconds my three threads have to call the target method with new arguments, and this process should last long for 10 minutes. I don't know how to do this, but I came up with the below idea which seems to be not quite a good one! Thus I appreciate any help.
I have a list named base_centroid_assign that I want to pass each item of it to a distinct thread. The list content will be updated frequently (supposed for instance 10 seconds), I so wish to recall my previous threads and give the update items to them.
In the below code, the list contains three items which have multiple items in them (it's nested). I want to have three threads stop after executing the quite simple target function, and then recall the threads with update item; however, when I run the below code, I ended up with 30 threads! (the run_time variable is 10 and list's length is 3).
How can I implement idea as mentioned above?
run_time = 10
def cluster_status_broadcasting(info_base_cent_avr):
print(threading.current_thread().name)
info_base_cent_avr.sort(key=lambda item: item[2], reverse=True)
start = time.time()
while(run_time > 0):
for item in base_centroid_assign:
t = threading.Thread(target=cluster_status_broadcasting, args=(item,))
t.daemon = True
t.start()
print('Entire job took:', time.time() - start)
run_time -= 1

Welcome to Stackoverflow.
Problems with thread synchronisation can be so tricky to handle that Python already has some very useful libraries specifically to handle such tasks. The primary such library is queue.Queue in Python 3. The idea is to have a queue for each "worker" thread. The main thread collect and put new data onto a queue, and have the subsidiary threads get the data from that queue.
When you call a Queue's get method its normal action is to block the thread until something is available, but presumably you want the threads to continue working on the current inputs until new ones are available, in which case it would make more sense to poll the queue and continue with the current data if there is nothing from the main thread.
I outline such an approach in my answer to this question, though in that case the worker threads are actually sending return values back on another queue.
The structure of your worker threads' run method would then need to be something like the following pseudo-code:
def run(self):
request_data = self.inq.get() # Wait for first item
while True:
process_with(request_data)
try:
request_data = self.inq.get(block=False)
except queue.Empty:
continue
You might like to add logic to terminate the thread cleanly when a sentinel value such as None is received.

Related

Why are multiple multiprocessing imap's blocking?

I want to do multiple transformations on some data. I figured I can use multiple Pool.imap's because each of the transformations is just a simple map. And Pool.imap is lazy, so it only does computation when needed.
But strangely, it looks like multiple consecutive Pool.imap's are blocking. And not lazy. Look at the following code as an example.
import time
from multiprocessing import Pool
def slow(n):
time.sleep(0.01)
return n*n
for i in [10, 100, 1000]:
with Pool() as p:
numbers = range(i)
iter1 = p.imap(slow, numbers)
iter2 = p.imap(slow, iter1)
start = time.perf_counter()
next(iter2)
print(i, time.perf_counter() - start)
# Prints
# 10 0.0327413540071575
# 100 0.27094774100987706
# 1000 2.6275791430089157
As you can see the time to the first element is increasing. I have 4 cores on my machine, so it roughly takes 2.5 seconds to process 1000 items with a 0.01 second delay. Hence, I think two consecutive Pool.imap's are blocking. And that the first Pool.imap finishes the entire workload before the second one starts. That is not lazy.
I've did some additional research. It does not matter if I use a process pool or a thread pool. It happens with Pool.imap and Pool.imap_unordered. The blocking takes longer when I do a third Pool.imap. A single Pool.imap is not blocking. This bug report seems related but different.
TL;DR imap is not a real generator, meaning it does not generate items on-demand (lazy computation aka similar to coroutine), and pools initiate "jobs" in serial.
longer answer: Every type of submission to a Pool be it imap, apply, apply_async etc.. gets written to a queue of "jobs". This queue is read by a thread in the main process (pool._handle_tasks) in order to allow jobs to continue to be initiated while the main process goes off and does other things. This thread contains a very simple double for loop (with a lot of error handling) that basically iterates over each job, then over each task within each job. The inner loop blocks until a worker is available to get each task, meaning tasks (and jobs) are always started in serial in the exact order they were submitted. This does not mean they will finish in perfect serial, which is why map, and imap collect results, and re-order them back to their original order (handled by pool._handle_resluts thread) before passing back to the main thread.
Rough pseudocode of what's going on:
#task_queue buffers task inputs first in - first out
pool.imap(foo, ("bar", "baz", "bat"), chunksize=1)
#put an iterator on the task queue which will yield "chunks" (a chunk is given to a single worker process to compute)
pool.imap(fun, ("one", "two", "three"), chunksize=1)
#put a second iterator to the task queue
#inside the pool._task_handler thread within the main proces
for task in task_queue: #[imap_1, imap_2]
#this is actually a while loop in reality that tries to get new tasks until the pool is close()'d
for chunk in task:
_worker_input_queue.put(chunk) # give the chunk to the next available worker
# This blocks until a worker actually takes the chunk, meaning the loop won't
# continue until all chunks are taken by workers.
def worker_function(_worker_input_queue, _worker_output_queue):
while True:
task = _worker_input_queue.get() #get the next chunk of tasks
#if task == StopSignal: break
result = task.func(task.args)
_worker_output_queue.put(result) #results are collected, and re-ordered
# by another thread in the main process
# as they are completed.

Splitting a list into batches to process elements in each batch multi threaded

I am trying pass each element from my list to a function that is being started on its own thread doing its own work. The problem is if the list has 100+ elements it will start 100 functions() on 100 threads.
For the sake of my computer I want to process the list in batches of 10's with he following steps:
Batch 1 gets queued.
Pass each element from batch1 to the function getting started on its own thread (This way I will only have 10 function threads running at a time)
Once all 10 threads have finished, they get popped off from their queue
Repeat until all batches are done
I was trying to use two lists where first 10 elements gets popped into list2. Process the list2, once the threads are done, pop 10 more elements until list1 reaches length of 0.
I have gotten this far not sure how to proceed.
carsdotcomOptionVal, carsdotcomOptionMakes = getMakes()
second_list = []
threads = []
while len(carsdotcomOptionVal) != 0:
second_list.append(carsdotcomOptionVal.pop(10))
for makesOptions in second_list:
th = threading.Thread(target=getModels, args=[makesOptions])
th.start()
threads.append(th)
for thread in threads:
thread.join()
Lastly the elements from the main list dont have to be even as they can be odd.
You should use a queue.Queue object, which can create a thread-safe list of tasks for other "worker-threads". You can choose how many worker-threads are active, and they would each feed from the list until it's done.
Here's what a sample code looks like with a queue:
import queue
import threading
threads_to_start = 10 # or choose how many you want
my_queue = queue.Queue()
def worker():
while not my_queue.empty():
data = my_queue.get()
do_something_with_data(data)
my_queue.task_done()
for i in range(100):
my_queue.put(i) # replace "i" with whatever data you want for the threads to process
for i in range(threads_to_start):
t = threading.Thread(target=worker, daemon=True) # daemon means that all threads will exit when the main thread exits
t.start()
my_queue.join() # this will block the main thread from exiting until the queue is empty and all data has been processed
Keep in mind this is just a pseudo-code rough start to introduce you to threading and queues, there is more to it than just that, but that example should work in most simple use-cases
This is scalable too - all you have to change if you can support more or less threads is change the number you initially set in threads_to_start

How to add threads depending on a number

In a part of my software code written with python, I have a list of items where it size can vary greatly from 12 to only one item . For each item in this list I'm doing some processing (sending an HTTP request related to the given item, parse results and many other operations . I'd like to speed up my code using threading, I'd like to create 2 threads where each one take a number of items and do the processing async.
Example 1 : Let's say that in my list I have 12 items, each thread would take in this case 6 items and call the processing functions on each item .
Example 2 : Now let's say that my list have 9 items, one thread would take 5 items and the other thread would take the other 4 left items .
Currently I'm not applying any threading and my code base is very large, so here some code that do almost the same thing as my case :
#This procedure need to be used with threading .
itemList = getItems() #This function return an unknown number of items between 1 and 12
if len(itemList) > 0: # Make sure that the list is empty in this case .
for item in itemList:
processItem(item) #This is an imaginary function that do the processing on each item
Below is a basic lite code that explain what I'm doing, I can't figure out how can I make my threads flexible, so each one take a number of items and the other take the rest (as explained in example 1 & 2) .
Thank's for your time
You might rather implement it using shared queues
https://docs.python.org/3/library/queue.html#queue-objects
import queue
import threading
def worker():
while True:
item = q.get()
if item is None:
break
do_work(item)
q.task_done()
q = queue.Queue()
threads = []
for i in range(num_worker_threads):
t = threading.Thread(target=worker)
t.start()
threads.append(t)
for item in source():
q.put(item)
# block until all tasks are done
q.join()
# stop workers
for i in range(num_worker_threads):
q.put(None)
for t in threads:
t.join()
Quoting from
https://docs.python.org/3/library/queue.html#module-queue:
The queue module implements multi-producer, multi-consumer queues. It
is especially useful in threaded programming when information must be
exchanged safely between multiple threads.
The idea is that you have a shared storage and each thread attempts reading items from it one-by-one.
This is much more flexible than distributing the load in advance as you don't know how threads execution will be scheduled by your OS, how much time each iteration would take etc.
Furthermore, you might add items for further processing to this queue dynamically — for example, having a producer thread running in parallel.
Some helpful links:
A brief introduction into concurrent programming in python:
http://www.slideshare.net/dabeaz/an-introduction-to-python-concurrency
More details on producer-consumer pattern with line-by-line explanation:
http://www.informit.com/articles/article.aspx?p=1850445&seqNum=8
You can use the ThreadPoolExecutor class from the concurrent.futures module in Python 3. The module is not present in Python 2, but there are some workarounds (which I will not discuss).
A thread pool executor does basically what #ffeast proposed, but with fewer lines of code for you to write. It manages a pool of threads which will execute all the tasks that you submit to it, presumably in the most efficient manner possible. The results will be returned through Future objects, which represent a "pending" result.
Since you seem to know the list of tasks up front, this is especially convenient for you. While you can not guarantee how the tasks will be split between the threads, the result will probably be at least as good as anything you coded by hand.
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=2) as executor:
for item in getItems():
executor.submit(processItem, item)
If you need more information with the output, like some way of identifying the futures that have completed or getting results out of them, see the example in the Python documentation (on which the code above is heavily based).

Python 3: How to properly add new Futures to a list while already waiting upon it?

I have a concurrent.futures.ThreadPoolExecutor and a list. And with the following code I add futures to the ThreadPoolExecutor:
for id in id_list:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
And then I wait upon the list:
concurrent.futures.wait(self._futures)
However, self.myfunc does some network I/O and thus there will be some network exceptions. When errors occur, self.myfunc submits a new self.myfunc with the same id to the same thread pool and add a new future to the same list, just as the above:
try:
do_stuff(id)
except:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
return None
Here comes the problem: I got an error on the line of concurrent.futures.wait(self._futures):
File "/usr/lib/python3.4/concurrent/futures/_base.py", line 277, in wait
f._waiters.remove(waiter)
ValueError: list.remove(x): x not in list
How should I properly add new Futures to a list while already waiting upon it?
Looking at the implementation of wait(), it certainly doesn't expect that anything outside concurrent.futures will ever mutate the list passed to it. So I don't think you'll ever get that "to work". It's not just that it doesn't expect the list to mutate, it's also that significant processing is done on list entries, and the implementation has no way to know that you've added more entries.
Untested, I'd suggest trying this instead: skip all that, and just keep a running count of threads still active. A straightforward way is to use a Condition guarding a count.
Initialization:
self._count_cond = threading.Condition()
self._thread_count = 0
When my_func is entered (i.e., when a new thread starts):
with self._count_cond:
self._thread_count += 1
When my_func is done (i.e., when a thread ends), for whatever reason (exceptional or not):
with self._count_cond:
self._thread_count -= 1
self._count_cond.notify() # wake up the waiting logic
And finally the main waiting logic:
with self._count_cond:
while self._thread_count:
self._count_cond.wait()
POSSIBLE RACE
It seems possible that the thread count could reach 0 while work for a new thread has been submitted, but before its my_func invocation starts running (and so before _thread_count is incremented to account for the new thread).
So the:
with self._count_cond:
self._thread_count += 1
part should really be done instead right before each occurrence of
self._thread_pool.submit(self.myfunc, id)
Or write a new method to encapsulate that pattern; e.g., like so:
def start_new_thread(self, id):
with self._count_cond:
self._thread_count += 1
self._thread_pool.submit(self.myfunc, id)
A DIFFERENT APPROACH
Offhand, I expect this could work too (but, again, haven't tested it): keep all your code the same except change how you're waiting:
while self._futures:
self._futures.pop().result()
So this simply waits for one thread at a time, until none remain.
Note that .pop() and .append() on lists are atomic in CPython, so no need for your own lock. And because your my_func() code appends before the thread it's running in ends, the list won't become empty before all threads really are done.
AND YET ANOTHER APPROACH
Keep the original waiting code, but rework the rest not to create new threads in case of exception. Like rewrite my_func to return True if it quits due to an exception, return False otherwise, and start threads running a wrapper instead:
def my_func_wrapper(self, id):
keep_going = True
while keep_going:
keep_going = self.my_func(id)
This may be especially attractive if you someday decide to use multiple processes instead of multiple threads (creating new processes can be a lot more expensive on some platforms).
AND A WAY USING cf.wait()
Another way is to change just the waiting code:
while self._futures:
fs = self._futures[:]
for f in fs:
self._futures.remove(f)
concurrent.futures.wait(fs)
Clear? This makes a copy of the list to pass to .wait(), and the copy is never mutated. New threads show up in the original list, and the whole process is repeated until no new threads show up.
Which of these ways makes most sense seems to me to depend mostly on pragmatics, but there's not enough info about all you're doing for me to make a guess about that.

How to manage python threads results?

I am using this code:
def startThreads(arrayofkeywords):
global i
i = 0
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
except KeyboardInterrupt:
sys.exit()
thread.join()
for threading in python, I have almost everything done, but I dont know how to manage the results of each thread, on each thread I have an array of strings as result, how can I join all those arrays into one safely? Because, I if I try writing into a global array, two threads could be writing at the same time.
First, you actually need to save all those thread objects to call join() on them. As written, you're saving only the last one of them, and then only if there isn't an exception.
An easy way to do multithreaded programming is to give each thread all the data it needs to run, and then have it not write to anything outside that working set. If all threads follow that guideline, their writes will not interfere with each other. Then, once a thread has finished, have the main thread only aggregate the results into a global array. This is know as "fork/join parallelism."
If you subclass the Thread object, you can give it space to store that return value without interfering with other threads. Then you can do something like this:
class MyThread(threading.Thread):
def __init__(self, ...):
self.result = []
...
def main():
# doStuffWith() returns a MyThread instance
threads = [ doStuffWith(k).start() for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t.join()
ret = t.result
# process return value here
Edit:
After looking around a bit, it seems like the above method isn't the preferred way to do threads in Python. The above is more of a Java-esque pattern for threads. Instead you could do something like:
def handler(outList)
...
# Modify existing object (important!)
outList.append(1)
...
def doStuffWith(keyword):
...
result = []
thread = Thread(target=handler, args=(result,))
return (thread, result)
def main():
threads = [ doStuffWith(k) for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t[0].start()
for t in threads:
t[0].join()
ret = t[1]
# process return value here
Use a Queue.Queue instance, which is intrinsically thread-safe. Each thread can .put its results to that global instance when it's done, and the main thread (when it knows all working threads are done, by .joining them for example as in #unholysampler's answer) can loop .getting each result from it, and use each result to .extend the "overall result" list, until the queue is emptied.
Edit: there are other big problems with your code -- if the maximum number of threads is less than the number of keywords, it will never terminate (you're trying to start a thread per keyword -- never less -- but if you've already started the max numbers you loop forever to no further purpose).
Consider instead using a threading pool, kind of like the one in this recipe, except that in lieu of queueing callables you'll queue the keywords -- since the callable you want to run in the thread is the same in each thread, just varying the argument. Of course that callable will be changed to peel something from the incoming-tasks queue (with .get) and .put the list of results to the outgoing-results queue when done.
To terminate the N threads you could, after all keywords, .put N "sentinels" (e.g. None, assuming no keyword can be None): a thread's callable will exit if the "keyword" it just pulled is None.
More often than not, Queue.Queue offers the best way to organize threading (and multiprocessing!) architectures in Python, be they generic like in the recipe I pointed you to, or more specialized like I'm suggesting for your use case in the last two paragraphs.
You need to keep pointers to each thread you make. As is, your code only ensures the last created thread finishes. This does not imply that all the ones you started before it have also finished.
def startThreads(arrayofkeywords):
global i
i = 0
threads = []
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
threads.append(thread)
except KeyboardInterrupt:
sys.exit()
for t in threads:
t.join()
//process results stored in each thread
This also solves the problem of write access because each thread will store it's data locally. Then after all of them are done, you can do the work to combine each threads local data.
I know that this question is a little bit old, but the best way to do this is not to harm yourself too much in the way proposed by other colleagues :)
Please read the reference on Pool. This way you will fork-join your work:
def doStuffWith(keyword):
return keyword + ' processed in thread'
def startThreads(arrayofkeywords):
pool = Pool(processes=maxThreads)
result = pool.map(doStuffWith, arrayofkeywords)
print result
Writing into a global array is fine if you use a semaphore to protect the critical section. You 'acquire' the lock when you want to append to the global array, then 'release' when you are done. This way, only one thread is every appending to the array.
Check out http://docs.python.org/library/threading.html and search for semaphore for more info.
sem = threading.Semaphore()
...
sem.acquire()
# do dangerous stuff
sem.release()
try some semaphore's methods, like acquire and release..
http://docs.python.org/library/threading.html

Categories