How to add threads depending on a number - python

In a part of my software code written with python, I have a list of items where it size can vary greatly from 12 to only one item . For each item in this list I'm doing some processing (sending an HTTP request related to the given item, parse results and many other operations . I'd like to speed up my code using threading, I'd like to create 2 threads where each one take a number of items and do the processing async.
Example 1 : Let's say that in my list I have 12 items, each thread would take in this case 6 items and call the processing functions on each item .
Example 2 : Now let's say that my list have 9 items, one thread would take 5 items and the other thread would take the other 4 left items .
Currently I'm not applying any threading and my code base is very large, so here some code that do almost the same thing as my case :
#This procedure need to be used with threading .
itemList = getItems() #This function return an unknown number of items between 1 and 12
if len(itemList) > 0: # Make sure that the list is empty in this case .
for item in itemList:
processItem(item) #This is an imaginary function that do the processing on each item
Below is a basic lite code that explain what I'm doing, I can't figure out how can I make my threads flexible, so each one take a number of items and the other take the rest (as explained in example 1 & 2) .
Thank's for your time

You might rather implement it using shared queues
https://docs.python.org/3/library/queue.html#queue-objects
import queue
import threading
def worker():
while True:
item = q.get()
if item is None:
break
do_work(item)
q.task_done()
q = queue.Queue()
threads = []
for i in range(num_worker_threads):
t = threading.Thread(target=worker)
t.start()
threads.append(t)
for item in source():
q.put(item)
# block until all tasks are done
q.join()
# stop workers
for i in range(num_worker_threads):
q.put(None)
for t in threads:
t.join()
Quoting from
https://docs.python.org/3/library/queue.html#module-queue:
The queue module implements multi-producer, multi-consumer queues. It
is especially useful in threaded programming when information must be
exchanged safely between multiple threads.
The idea is that you have a shared storage and each thread attempts reading items from it one-by-one.
This is much more flexible than distributing the load in advance as you don't know how threads execution will be scheduled by your OS, how much time each iteration would take etc.
Furthermore, you might add items for further processing to this queue dynamically — for example, having a producer thread running in parallel.
Some helpful links:
A brief introduction into concurrent programming in python:
http://www.slideshare.net/dabeaz/an-introduction-to-python-concurrency
More details on producer-consumer pattern with line-by-line explanation:
http://www.informit.com/articles/article.aspx?p=1850445&seqNum=8

You can use the ThreadPoolExecutor class from the concurrent.futures module in Python 3. The module is not present in Python 2, but there are some workarounds (which I will not discuss).
A thread pool executor does basically what #ffeast proposed, but with fewer lines of code for you to write. It manages a pool of threads which will execute all the tasks that you submit to it, presumably in the most efficient manner possible. The results will be returned through Future objects, which represent a "pending" result.
Since you seem to know the list of tasks up front, this is especially convenient for you. While you can not guarantee how the tasks will be split between the threads, the result will probably be at least as good as anything you coded by hand.
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=2) as executor:
for item in getItems():
executor.submit(processItem, item)
If you need more information with the output, like some way of identifying the futures that have completed or getting results out of them, see the example in the Python documentation (on which the code above is heavily based).

Related

Is it right to append item into the same list by multi-thread without lock?

Hrere's the detail question:
I want use multi-thread way to do a batch-http-request work, then gather all these result into a list and sort all items.
So I want to define a empty list origin_list in main process first, and start some threads to just append result into this list after pass origin_list to ervery thread.
And It seemed that I got the expected results in then end, so I think I got the right result list finally without thread lock for the list is a mutable object, am I right?
My main codes are as below:
def do_request_work(final_item_list,request_url):
request_results = request.get(request_url).text
# do request work
finnal_item_list.append(request_results )
def do_sort_work(final_item_list):
# do sort work
return final_item_list
def main():
f_item_list = []
request_list = [url1, url2, ...]
with ThreadPoolExecutor(max_workers=20) as executor:
executor.map(
partial(
do_request_work,
f_item_list
),
request_list)
sorted_list = do_sort_work(f_item_list)
Any commentary is very welcome. great thanks.
I think, that this is a quite questionable solution even without taking thread safety into account.
First of all python has GIL, which
In CPython, the global interpreter lock, or GIL, is a mutex that
protects access to Python objects, preventing multiple threads from
executing Python bytecodes at once.
Thus, I doubt about much performance benefit here, even noting that
potentially blocking or long-running operations, such as I/O, image
processing, and NumPy number crunching, happen outside the GIL.
all python work will be executed one thread in a time.
From the other perspective, the same lock may help you with the thread safety here, so only one thread will modify final_item_list in a time, but I am not sure.
Anyway, I would use multiprocessing module here with integrated parallel map:
from multiprocessing import Pool
def do_request_work(request_url):
request_results = request.get(request_url).text
# do request work
return request_results
if __name__ == '__main__':
request_list = [url1, url2, ...]
with Pool(20) as p:
f_item_list = p.map(do_request_work, request_list)
Which will guarantee you parallel lock-free execution of requests, since every process will receive only their part of work and just return the result, when ready.
Look at this thread: I'm seeking advise on multi-tasking on Python36 platform, Procedure setup.
Relevant to python3.5+
Running Tasks Concurrently¶
awaitable asyncio.gather(*aws, loop=None, return_exceptions=False)
Run awaitable objects in the aws sequence concurrently.
I use this very often, just be aware that its not thread-safe, so do not change values inside, otherwise you will have use deepcopy.
Other things to look at:
https://github.com/kennethreitz/grequests
https://github.com/jreese/aiomultiprocess
aiohttp

Passing updated args to multiple threads periodically in python

I have three base stations, they have to work in parallel, and they will receive a list every 10 seconds which contain information about their cluster, and I want to run this code for about 10 minutes. So, every 10 seconds my three threads have to call the target method with new arguments, and this process should last long for 10 minutes. I don't know how to do this, but I came up with the below idea which seems to be not quite a good one! Thus I appreciate any help.
I have a list named base_centroid_assign that I want to pass each item of it to a distinct thread. The list content will be updated frequently (supposed for instance 10 seconds), I so wish to recall my previous threads and give the update items to them.
In the below code, the list contains three items which have multiple items in them (it's nested). I want to have three threads stop after executing the quite simple target function, and then recall the threads with update item; however, when I run the below code, I ended up with 30 threads! (the run_time variable is 10 and list's length is 3).
How can I implement idea as mentioned above?
run_time = 10
def cluster_status_broadcasting(info_base_cent_avr):
print(threading.current_thread().name)
info_base_cent_avr.sort(key=lambda item: item[2], reverse=True)
start = time.time()
while(run_time > 0):
for item in base_centroid_assign:
t = threading.Thread(target=cluster_status_broadcasting, args=(item,))
t.daemon = True
t.start()
print('Entire job took:', time.time() - start)
run_time -= 1
Welcome to Stackoverflow.
Problems with thread synchronisation can be so tricky to handle that Python already has some very useful libraries specifically to handle such tasks. The primary such library is queue.Queue in Python 3. The idea is to have a queue for each "worker" thread. The main thread collect and put new data onto a queue, and have the subsidiary threads get the data from that queue.
When you call a Queue's get method its normal action is to block the thread until something is available, but presumably you want the threads to continue working on the current inputs until new ones are available, in which case it would make more sense to poll the queue and continue with the current data if there is nothing from the main thread.
I outline such an approach in my answer to this question, though in that case the worker threads are actually sending return values back on another queue.
The structure of your worker threads' run method would then need to be something like the following pseudo-code:
def run(self):
request_data = self.inq.get() # Wait for first item
while True:
process_with(request_data)
try:
request_data = self.inq.get(block=False)
except queue.Empty:
continue
You might like to add logic to terminate the thread cleanly when a sentinel value such as None is received.

Python multi function multithreading with threading.Thread? (variable number of threads)

I'm trying to start a variable number of threads to compute the results of functions for one of my automated trading modules. I have about 14 functions all of which are computationally expensive. I've been calculating each function sequentially, but it takes around 3 minutes to complete, and my platform is high frequency, I have the need to cut that computation time down to 1 minute or less.
I've read up on multiprocessing and multithreading, but I can't find a solution that fits my need.
What I'm trying to do is define "n" number of threads to use, then divide my list of functions into "n" groups, then compute each group of functions in a separate thread. Essentially:
functionList = [func1,func2,func3,func4]
outputList = [func1out,func2out,func3out,func4out]
argsList = [func1args,func2args,func3args,func4args]
# number of threads
n = 3
functionSplit = np.array_split(np.array(functionList),n)
outputSplit = np.array_split(np.array(outputList),n)
argSplit = np.array_split(np.array(argsList),n)
Now I'd like to start "n" seperate threads, each processing the functions according to the split lists. Then I'd like to name the output of each function according to the outputList and create a master dict of the outputs from each function. I then will loop through the output dict and create a dataframe with column ID numbers according to the information in each column (already have this part worked out, just need the multithreading).
Is there any way to do something like this? I've been looking into creating a subclass of the threading.Thread class and passing the functions, output names, and arguments into the run() method, but I don't know how to name and output the results of the functions from each thread! Nor do I know how to call functions in a list according to their corresponding arguments!
The reason that I'm doing this is to discover the optimum thread number balance between computational efficiency and time. Like I said, this will be integrated into a high frequency trading platform I'm developing where time is my major constraint!
Any ideas?
You can use multiprocessing library like below
import multiprocessing
def callfns(fnList, argList, outList, d):
for i in range(len(fnList)):
d[somekey] = fnList[i](argList, outList)
...
manager = multiprocessing.Manager()
d = manager.dict()
processes = []
for i in range(len(functionSplit)):
process = multiprocessing.Process(target=callfns, args=(functionSplit[i], argSplit[i], outputSplit[i], d))
processes.append(process)
for j in processes:
j.start()
for j in processes:
j.join()
# use d here
You can use a server process to share the dictionary between these processes. To interact with the server process you need Manager. Then you can create a dictionary in server process manager.dict(). Once all process join back to the main process, you can use the dictionary d.
I hope this help you solve your problem.
You should use multiprocessing instead of threading for cpu bound tasks.
Manually creating and managing processes can be difficult and require more efforts. Do checkout the concurrent.futures and try the ProcessPool for maintaining a pool of processes. You can submit tasks to them and retrieve results.
The Pool.map method from multiprocessing module can take a function and iterable and then process them in chunks in parallel to compute faster. The iterable is broken into separate chunks. These chunks are passed to the function in separate processes. Then the results are then put back together.

multithreading check membership in Queue and stop the threads

I want to iterate over a list using 2 thread. One from leading and other from trailing, and put the elements in a Queue on each iteration. But before putting the value in Queue I need to check for existence of the value within Queue (its when that one of the threads has putted that value in Queue), So when this happens I need to stop the thread and return list of traversed values for each thread.
This is what I have tried so far :
from Queue import Queue
from threading import Thread, Event
class ThreadWithReturnValue(Thread):
def __init__(self, group=None, target=None, name=None,
args=(), kwargs={}, Verbose=None):
Thread.__init__(self, group, target, name, args, kwargs, Verbose)
self._return = None
def run(self):
if self._Thread__target is not None:
self._return = self._Thread__target(*self._Thread__args,
**self._Thread__kwargs)
def join(self):
Thread.join(self)
return self._return
main_path = Queue()
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
def a(main_path,g,l=[]):
for i in g:
l.append(i)
print 'a'
if is_in_queue(i,main_path):
return l
main_path.put(i)
def b(main_path,g,l=[]):
for i in g:
l.append(i)
print 'b'
if is_in_queue(i,main_path):
return l
main_path.put(i)
g=['a','b','c','d','e','f','g','h','i','j','k','l']
t1 = ThreadWithReturnValue(target=a, args=(main_path,g))
t2 = ThreadWithReturnValue(target=b, args=(main_path,g[::-1]))
t2.start()
t1.start()
# Wait for all produced items to be consumed
print main_path.join()
I used ThreadWithReturnValue that will create a custom thread that returns the value.
And for membership checking I used the following function :
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
Now if I first start the t1 and then the t2 I will get 12 a then one b then it doesn't do any thing and I need to terminate the python manually!
But if I first run the t2 then t1 I will get the following result:
b
b
b
b
ab
ab
b
b
b
b
a
a
So my questions is that why python treads different in this cases? and how can I terminate the threads and make them communicate with each other?
Before we get into bigger problems, you're not using Queue.join right.
The whole point of this function is that a producer who adds a bunch of items to a queue can wait until the consumer or consumers have finished working on all of those items. This works by having the consumer call task_done after they finish working on each item that they pulled off with get. Once there have been as many task_done calls as put calls, the queue is done. You're not doing a get anywhere, much less a task_done, so there's no way the queue can ever be finished. So, that's why you block forever after the two threads finish.
The first problem here is that your threads are doing almost no work outside of the actual synchronization. If the only thing they do is fight over a queue, only one of them is going to be able to run at a time.
Of course that's common in toy problems, but you have to think through your real problem:
If you're doing a lot of I/O work (listening on sockets, waiting for user input, etc.), threads work great.
If you're doing a lot of CPU work (calculating primes), threads don't work in Python because of the GIL, but processes do.
If you're actually primarily dealing with synchronizing separate tasks, neither one is going to work well (and processes will be worse). It may still be simpler to think in terms of threads, but it'll be the slowest way to do things. You may want to look into coroutines; Greg Ewing has a great demonstration of how to use yield from to use coroutines to build things like schedulers or many-actor simulations.
Next, as I alluded to in your previous question, making threads (or processes) work efficiently with shared state requires holding locks for as short a time as possible.
So, if you have to search a whole queue under a lock, that had better be a constant-time search, not a linear-time search. That's why I suggested using something like an OrderedSet recipe rather than a list, like the one inside the stdlib's Queue.Queue. Then this function:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
… is only blocking the queue for a tiny fraction of a second—just long enough to look up a hash value in a table, instead of long enough to compare every element in the queue against x.
Finally, I tried to explain about race conditions on your other question, but let me try again.
You need a lock around every complete "transaction" in your code, not just around the individual operations.
For example, if you do this:
with queue locked:
see if x is in the queue
if x was not in the queue:
with queue locked:
add x to the queue
… then it's always possible that x was not in the queue when you checked, but in the time between when you unlocked it and relocked it, someone added it. This is exactly why it's possible for both threads to stop early.
To fix this, you need to put a lock around the whole thing:
with queue locked:
if x is not in the queue:
add x to the queue
Of course this goes directly against what I said before about locking the queue for as short a time as possible. Really, that's what makes multithreading hard in a nutshell. It's easy to write safe code that just locks everything for as long as might conceivably be necessary, but then your code ends up only using a single core, while all the other threads are blocked waiting for the lock. And it's easy to write fast code that just locks everything as briefly as possible, but then it's unsafe and you get garbage values or even crashes all over the place. Figuring out what needs to be a transaction, and how to minimize the work inside those transactions, and how to deal with the multiple locks you'll probably need to make that work without deadlocking them… that's not so easy.
A couple of things that I think can be improved:
Due to the GIL, you might want to use the multiprocessing (rather than threading) module. In general, CPython threading will not cause CPU intensive work to speed up. (Depending on what exactly is the context of your question, it's also possible that multiprocessing won't, but threading almost certainly won't.)
A function like your is_inqueue would likely lead to high contention.
The locked time seems linear in the number of items that need to be traversed:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
So, instead, you could possibly do the following.
Use multiprocessing with a shared dict:
from multiprocessing import Process, Manager
manager = Manager()
d = manager.dict()
# Fn definitions and such
p1 = Process(target=p1, args=(d,))
p2 = Process(target=p2, args=(d,))
within each function, check for the item like this:
def p1(d):
# Stuff
if 'foo' in d:
return

Multiple python threads writing to different records in same list simultaneously - is this ok?

I am trying to fix a bug where multiple threads are writing to a list in memory. Right now I have a thread lock and am occasionally running into problems that are related to the work being done in the threads.
I was hoping to simply make an hash of lists, one for each thread, and remove the thread lock. It seems like each thread could write to its own record without worrying about the others, but perhaps the fact that they are all using the same owning hash would itself be a problem.
Does anyone happen to know if this will work or not? If not, could I, for example, dynamically add a list to a package for each thread? Is that essentially the same thing?
I am far from a threading expert so any advice welcome.
Thanks,
import threading
def job(root_folder,my_list):
for current,files,dirs in os.walk(root):
my_list.extend(files)
time.sleep(1)
my_lists = [[],[],[]]
my_folders = ["C:\\Windows","C:\\Users","C:\\Temp"]
my_threads = []
for folder,a_list in zip(my_folders,my_lists):
my_threads.append(threading.Thread(target=job,args=(folder,a_list)
for thread in my_threads:
thread.start()
for thread in my_threads:
thread.join()
my_full_list = my_lists[0] + my_lists[1] + my_lists[2]
this way each thread just modifies its own list and at the end combines all the individual lists
also as pointed out this gives zero performance gain (actually probably slower than not threading it... ) you may get performance gains using multiprocessing instead ...
Don't use list. Use Queue (python2) or queue (python3).
There is 3 kinds of queue: fifo, lifo and priority. The last one is for ordered data.
You may put data at one side (with thread):
q.put(data)
And get at the other side (maybe in a loop for, say, database):
while not q.empty:
print q.get()
https://docs.python.org/2/library/queue.html

Categories