So I have have script that uses about 50k threads, but only runs 10 at a time. I use the threading library for this and BoundedSemaphore to limit the threads to 10 at a time. In some cases there is not enough memory for all threads, but it is important that all threads get processed so I would like to repeat those threads that got killed because of insufficient memory.
import some_other_script, threading
class myThread (threading.Thread):
def __init__(self, item):
threading.Thread.__init__(self)
self.item = item
def run(self):
threadLimiter.acquire()
some_other_script.method(self.item)
somelist.remove(self.item)
threadLimiter.release()
threadLimiter = threading.BoundedSemaphore(10)
somelist = ['50,000 Items','.....]
for item in somelist:
myThread(item).start()
As you can see the only idea I could come up with so far was to delete the item that got processed from the list within every thread with somelist.remove(self.item). (Each item is unique and only present once within the list).
My idea was that I could run a while loop around the for loop, to check if it still contains items, which did not work, because after the for loop is finished the threads are not finished an so the list isn't empty.
What I want to do is to catch those which fail, because the systems runs out of memory and executed them again (and again if need be).
Thank you very much in advance!
This solves both the too many active threads problem and the problem in your question:
def get_items():
threads = threading.enumerate()
items = set()
for thr in threads:
if isinstance(thr, myThread): items.add(thr.item)
return items
def manageThreads(howmany):
while bigset:
items = get_items()
items_to_add = bigset.difference(items)
while len(items) < howmany:
item = items_to_add.pop()
processor = myThread(item)
processor.start()
with thread_done:
thread_done.wait()
thread_done = threading.Condition()
bigset = set(["50,000 items", "..."])
manageThreads(10)
The mythread class run method:
def run(self):
try:
some_other_script.method(self.item)
bigset.remove(self.item)
finally:
with thread_done:
thread_done.notify()
Threading.enumerate() returns a list of currently active thread objects. So, the manageThreads function initially creates 10 threads, then waits for one to finish, then checks the thread count again, and so on. If a thread runs out of memory or another error occurs during processing, it wont remove the item from the bigset, causing it to be requeued by the manager onto a different thread.
Related
I am trying pass each element from my list to a function that is being started on its own thread doing its own work. The problem is if the list has 100+ elements it will start 100 functions() on 100 threads.
For the sake of my computer I want to process the list in batches of 10's with he following steps:
Batch 1 gets queued.
Pass each element from batch1 to the function getting started on its own thread (This way I will only have 10 function threads running at a time)
Once all 10 threads have finished, they get popped off from their queue
Repeat until all batches are done
I was trying to use two lists where first 10 elements gets popped into list2. Process the list2, once the threads are done, pop 10 more elements until list1 reaches length of 0.
I have gotten this far not sure how to proceed.
carsdotcomOptionVal, carsdotcomOptionMakes = getMakes()
second_list = []
threads = []
while len(carsdotcomOptionVal) != 0:
second_list.append(carsdotcomOptionVal.pop(10))
for makesOptions in second_list:
th = threading.Thread(target=getModels, args=[makesOptions])
th.start()
threads.append(th)
for thread in threads:
thread.join()
Lastly the elements from the main list dont have to be even as they can be odd.
You should use a queue.Queue object, which can create a thread-safe list of tasks for other "worker-threads". You can choose how many worker-threads are active, and they would each feed from the list until it's done.
Here's what a sample code looks like with a queue:
import queue
import threading
threads_to_start = 10 # or choose how many you want
my_queue = queue.Queue()
def worker():
while not my_queue.empty():
data = my_queue.get()
do_something_with_data(data)
my_queue.task_done()
for i in range(100):
my_queue.put(i) # replace "i" with whatever data you want for the threads to process
for i in range(threads_to_start):
t = threading.Thread(target=worker, daemon=True) # daemon means that all threads will exit when the main thread exits
t.start()
my_queue.join() # this will block the main thread from exiting until the queue is empty and all data has been processed
Keep in mind this is just a pseudo-code rough start to introduce you to threading and queues, there is more to it than just that, but that example should work in most simple use-cases
This is scalable too - all you have to change if you can support more or less threads is change the number you initially set in threads_to_start
I am learning about Thread in Python and am trying to make a simple program, one that uses threads to grab a number off the Queue and print it.
I have the following code
import threading
from Queue import Queue
test_lock = threading.Lock()
tests = Queue()
def start_thread():
while not tests.empty():
with test_lock:
if tests.empty():
return
test = tests.get()
print("{}".format(test))
for i in range(10):
tests.put(i)
threads = []
for i in range(5):
threads.append(threading.Thread(target=start_thread))
threads[i].daemon = True
for thread in threads:
thread.start()
tests.join()
When run it just prints the values and never exits.
How do I make the program exit when the Queue is empty?
From the docstring of Queue.join():
Blocks until all items in the Queue have been gotten and processed.
The count of unfinished tasks goes up whenever an item is added to the
queue. The count goes down whenever a consumer thread calls task_done()
to indicate the item was retrieved and all work on it is complete.
When the count of unfinished tasks drops to zero, join() unblocks.
So you must call tests.task_done() after processing the item.
Since your threads are daemon threads, and the queue will handle concurrent access correctly, you don't need to check if the queue is empty or use a lock. You can just do:
def start_thread():
while True:
test = tests.get()
print("{}".format(test))
tests.task_done()
I have a queue that always needs to be ready to process items when they are added to it. The function that runs on each item in the queue creates and starts thread to execute the operation in the background so the program can go do other things.
However, the function I am calling on each item in the queue simply starts the thread and then completes execution, regardless of whether or not the thread it started completed. Because of this, the loop will move on to the next item in the queue before the program is done processing the last item.
Here is code to better demonstrate what I am trying to do:
queue = Queue.Queue()
t = threading.Thread(target=worker)
t.start()
def addTask():
queue.put(SomeObject())
def worker():
while True:
try:
# If an item is put onto the queue, immediately execute it (unless
# an item on the queue is still being processed, in which case wait
# for it to complete before moving on to the next item in the queue)
item = queue.get()
runTests(item)
# I want to wait for 'runTests' to complete before moving past this point
except Queue.Empty, err:
# If the queue is empty, just keep running the loop until something
# is put on top of it.
pass
def runTests(args):
op_thread = SomeThread(args)
op_thread.start()
# My problem is once this last line 't.start()' starts the thread,
# the 'runTests' function completes operation, but the operation executed
# by some thread is not yet done executing because it is still running in
# the background. I do not want the 'runTests' function to actually complete
# execution until the operation in thread t is done executing.
"""t.join()"""
# I tried putting this line after 't.start()', but that did not solve anything.
# I have commented it out because it is not necessary to demonstrate what
# I am trying to do, but I just wanted to show that I tried it.
Some notes:
This is all running in a PyGTK application. Once the 'SomeThread' operation is complete, it sends a callback to the GUI to display the results of the operation.
I do not know how much this affects the issue I am having, but I thought it might be important.
A fundamental issue with Python threads is that you can't just kill them - they have to agree to die.
What you should do is:
Implement the thread as a class
Add a threading.Event member which the join method clears and the thread's main loop occasionally checks. If it sees it's cleared, it returns. For this override threading.Thread.join to check the event and then call Thread.join on itself
To allow (2), make the read from Queue block with some small timeout. This way your thread's "response time" to the kill request will be the timeout, and OTOH no CPU choking is done
Here's some code from a socket client thread I have that has the same issue with blocking on a queue:
class SocketClientThread(threading.Thread):
""" Implements the threading.Thread interface (start, join, etc.) and
can be controlled via the cmd_q Queue attribute. Replies are placed in
the reply_q Queue attribute.
"""
def __init__(self, cmd_q=Queue.Queue(), reply_q=Queue.Queue()):
super(SocketClientThread, self).__init__()
self.cmd_q = cmd_q
self.reply_q = reply_q
self.alive = threading.Event()
self.alive.set()
self.socket = None
self.handlers = {
ClientCommand.CONNECT: self._handle_CONNECT,
ClientCommand.CLOSE: self._handle_CLOSE,
ClientCommand.SEND: self._handle_SEND,
ClientCommand.RECEIVE: self._handle_RECEIVE,
}
def run(self):
while self.alive.isSet():
try:
# Queue.get with timeout to allow checking self.alive
cmd = self.cmd_q.get(True, 0.1)
self.handlers[cmd.type](cmd)
except Queue.Empty as e:
continue
def join(self, timeout=None):
self.alive.clear()
threading.Thread.join(self, timeout)
Note self.alive and the loop in run.
I am using this code:
def startThreads(arrayofkeywords):
global i
i = 0
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
except KeyboardInterrupt:
sys.exit()
thread.join()
for threading in python, I have almost everything done, but I dont know how to manage the results of each thread, on each thread I have an array of strings as result, how can I join all those arrays into one safely? Because, I if I try writing into a global array, two threads could be writing at the same time.
First, you actually need to save all those thread objects to call join() on them. As written, you're saving only the last one of them, and then only if there isn't an exception.
An easy way to do multithreaded programming is to give each thread all the data it needs to run, and then have it not write to anything outside that working set. If all threads follow that guideline, their writes will not interfere with each other. Then, once a thread has finished, have the main thread only aggregate the results into a global array. This is know as "fork/join parallelism."
If you subclass the Thread object, you can give it space to store that return value without interfering with other threads. Then you can do something like this:
class MyThread(threading.Thread):
def __init__(self, ...):
self.result = []
...
def main():
# doStuffWith() returns a MyThread instance
threads = [ doStuffWith(k).start() for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t.join()
ret = t.result
# process return value here
Edit:
After looking around a bit, it seems like the above method isn't the preferred way to do threads in Python. The above is more of a Java-esque pattern for threads. Instead you could do something like:
def handler(outList)
...
# Modify existing object (important!)
outList.append(1)
...
def doStuffWith(keyword):
...
result = []
thread = Thread(target=handler, args=(result,))
return (thread, result)
def main():
threads = [ doStuffWith(k) for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t[0].start()
for t in threads:
t[0].join()
ret = t[1]
# process return value here
Use a Queue.Queue instance, which is intrinsically thread-safe. Each thread can .put its results to that global instance when it's done, and the main thread (when it knows all working threads are done, by .joining them for example as in #unholysampler's answer) can loop .getting each result from it, and use each result to .extend the "overall result" list, until the queue is emptied.
Edit: there are other big problems with your code -- if the maximum number of threads is less than the number of keywords, it will never terminate (you're trying to start a thread per keyword -- never less -- but if you've already started the max numbers you loop forever to no further purpose).
Consider instead using a threading pool, kind of like the one in this recipe, except that in lieu of queueing callables you'll queue the keywords -- since the callable you want to run in the thread is the same in each thread, just varying the argument. Of course that callable will be changed to peel something from the incoming-tasks queue (with .get) and .put the list of results to the outgoing-results queue when done.
To terminate the N threads you could, after all keywords, .put N "sentinels" (e.g. None, assuming no keyword can be None): a thread's callable will exit if the "keyword" it just pulled is None.
More often than not, Queue.Queue offers the best way to organize threading (and multiprocessing!) architectures in Python, be they generic like in the recipe I pointed you to, or more specialized like I'm suggesting for your use case in the last two paragraphs.
You need to keep pointers to each thread you make. As is, your code only ensures the last created thread finishes. This does not imply that all the ones you started before it have also finished.
def startThreads(arrayofkeywords):
global i
i = 0
threads = []
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
threads.append(thread)
except KeyboardInterrupt:
sys.exit()
for t in threads:
t.join()
//process results stored in each thread
This also solves the problem of write access because each thread will store it's data locally. Then after all of them are done, you can do the work to combine each threads local data.
I know that this question is a little bit old, but the best way to do this is not to harm yourself too much in the way proposed by other colleagues :)
Please read the reference on Pool. This way you will fork-join your work:
def doStuffWith(keyword):
return keyword + ' processed in thread'
def startThreads(arrayofkeywords):
pool = Pool(processes=maxThreads)
result = pool.map(doStuffWith, arrayofkeywords)
print result
Writing into a global array is fine if you use a semaphore to protect the critical section. You 'acquire' the lock when you want to append to the global array, then 'release' when you are done. This way, only one thread is every appending to the array.
Check out http://docs.python.org/library/threading.html and search for semaphore for more info.
sem = threading.Semaphore()
...
sem.acquire()
# do dangerous stuff
sem.release()
try some semaphore's methods, like acquire and release..
http://docs.python.org/library/threading.html
I have one thread that writes results into a Queue.
In another thread (GUI), I periodically (in the IDLE event) check if there are results in the queue, like this:
def queue_get_all(q):
items = []
while 1:
try:
items.append(q.get_nowait())
except Empty, e:
break
return items
Is this a good way to do it ?
Edit:
I'm asking because sometimes the
waiting thread gets stuck for a few
seconds without taking out new
results.
The "stuck" problem turned out to be because I was doing the processing in the idle event handler, without making sure that such events are actually generated by calling wx.WakeUpIdle, as is recommended.
If you're always pulling all available items off the queue, is there any real point in using a queue, rather than just a list with a lock? ie:
from __future__ import with_statement
import threading
class ItemStore(object):
def __init__(self):
self.lock = threading.Lock()
self.items = []
def add(self, item):
with self.lock:
self.items.append(item)
def getAll(self):
with self.lock:
items, self.items = self.items, []
return items
If you're also pulling them individually, and making use of the blocking behaviour for empty queues, then you should use Queue, but your use case looks much simpler, and might be better served by the above approach.
[Edit2] I'd missed the fact that you're polling the queue from an idle loop, and from your update, I see that the problem isn't related to contention, so the below approach isn't really relevant to your problem. I've left it in in case anyone finds a blocking variant of this useful:
For cases where you do want to block until you get at least one result, you can modify the above code to wait for data to become available through being signalled by the producer thread. Eg.
class ItemStore(object):
def __init__(self):
self.cond = threading.Condition()
self.items = []
def add(self, item):
with self.cond:
self.items.append(item)
self.cond.notify() # Wake 1 thread waiting on cond (if any)
def getAll(self, blocking=False):
with self.cond:
# If blocking is true, always return at least 1 item
while blocking and len(self.items) == 0:
self.cond.wait()
items, self.items = self.items, []
return items
I think the easiest way of getting all items out of the queue is the following:
def get_all_queue_result(queue):
result_list = []
while not queue.empty():
result_list.append(queue.get())
return result_list
I'd be very surprised if the get_nowait() call caused the pause by not returning if the list was empty.
Could it be that you're posting a large number of (maybe big?) items between checks which means the receiving thread has a large amount of data to pull out of the Queue? You could try limiting the number you retrieve in one batch:
def queue_get_all(q):
items = []
maxItemsToRetrieve = 10
for numOfItemsRetrieved in range(0, maxItemsToRetrieve):
try:
if numOfItemsRetrieved == maxItemsToRetrieve:
break
items.append(q.get_nowait())
except Empty, e:
break
return items
This would limit the receiving thread to pulling up to 10 items at a time.
The simplest method is using a list comprehension:
items = [q.get() for _ in range(q.qsize())]
Use of the range function is generally frowned upon, but I haven't found a simpler method yet.
If you're done writing to the queue, qsize should do the trick without needing to check the queue for each iteration.
responseList = []
for items in range(0, q.qsize()):
responseList.append(q.get_nowait())
I see you are using get_nowait() which according to the documentation, "return[s] an item if one is immediately available, else raise the Empty exception"
Now, you happen to break out of the loop when an Empty exception is thrown. Thus, if there is no result immediately available in the queue, your function returns an empty items list.
Is there a reason why you are not using the get() method instead? It may be the case that the get_nowait() fails because the queue is servicing a put() request at that same moment.