I am using multiple threads to work through items in a very large list untell it is empty.
while item_list:
my_item = item_list.pop()
I check if any items are left in list and if so I pop one and work on it. Is this process thread safe?
Is there chance that when I check there is an item in list but by time I pop it will be gone and raise error? Or any other issues?
Yes, a thread-switch could happen between the the two lines and the list could be empty by the time you pop the item. Use a thread-safe queue.Queue() to store your work items.
Related
I have a small function (see below) that returns a list of names that are mapped from a list of integers (eg [1,2,3,4]) which can be of length up to a thousand.
This function can potentially get called tens of thousands of times at a time and I want to know if I can do anything to make it run faster.
The graph_hash is a large hash that maps keys to sets of length 1000 or less. I am iterating over a set and mapping the values to names and returning a list. The u.get_name_from_id() queries an sqlite database.
Any thoughts to optimize any part of this function?
def get_neighbors(pid):
names = []
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
return names
Caching and multithreading are only going to get you so far, you should create a new method that uses executemany under the hood to retrieve multiple names from the database in bulk.
Something like names = u.get_names_from_ids(graph_hash[pid]).
You're hitting the database sequentially here:
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
I would recommend doing it concurrently using threads. Something like this should get you started:
def load_stuff(queue, p):
q.put(u.get_name_from_id(p))
def get_neighbors(pid):
names = Queue.Queue()
# we'll keep track of the threads with this list
threads = []
for p in graph_hash[pid]:
thread = threading.Thread(target=load_stuff, args=(names,p))
threads.append(thread)
# start the thread
thread.start()
# wait for them to finish before you return your Queue
for thread in threads:
thread.join()
return names
You can turn the Queue back into a list with [item for item in names.queue] if needed.
The idea is that the database calls are blocking until they're done, but you can make multiple SELECT statements on a database without locking. So, you should use threads or some other concurrency method to avoid waiting unnecessarily.
I would recommend to use deque instead of list if you doing thousands of appends. So, names should be names = deque().
A list comprehension is a start (similar to #cricket_007's generator suggestion), but you are limited by function calls:
def get_neighbors(pid):
return [u.get_name_from_id(p) for p in graph_hash[pid]]
As #salparadise suggested, consider memoization to speed up get_name_from_id().
everything I see is about lists but this is about events = queue.queue()
which is a queue with objects that I want to extract, but how would I go about getting the last N elements from that queue?
By definition, you can't.
What you can do is use a loop or a comprehension to get the first (you can't get from the end of a queue) N elements:
N = 2
first_N_elements = [my_queue.get() for _ in range(N)]
If you're multi-threading, "the last N elements from that queue" is undefined and the question doesn't make sense.
If there is no multi-threading, it depends on whether you care about the other elements (not the last N).
If you don't:
for i in range(events.qsize() - N):
events.get()
after that, get N items
If you don't want to throw away the other items, you'll just have to move everything to a different data structure (like a list). The whole point of a queue is to get things in a certain order.
Though this was years ago, it bears mentioning that there is a method, which depends on the queue type you need. If the latest entries to the queue are your concern then use a Last-In-First-Out Queue. Opposite of the standard queue, which is First-In-First-Out. Please give larger examples. With the little info given, here's how you get the last N elements from a LIFO queue
from queue import LifoQueue, Empty
events = LifoQueue()
# Wait for queue to be loaded with N events
N = 5
while N:
try:
event = events.get_nowait()
except Empty:
break
# process event
N-=1
You should handle the case where the queue is Empty in this case exit the loop and do not attempt to process.
I have a concurrent.futures.ThreadPoolExecutor and a list. And with the following code I add futures to the ThreadPoolExecutor:
for id in id_list:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
And then I wait upon the list:
concurrent.futures.wait(self._futures)
However, self.myfunc does some network I/O and thus there will be some network exceptions. When errors occur, self.myfunc submits a new self.myfunc with the same id to the same thread pool and add a new future to the same list, just as the above:
try:
do_stuff(id)
except:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
return None
Here comes the problem: I got an error on the line of concurrent.futures.wait(self._futures):
File "/usr/lib/python3.4/concurrent/futures/_base.py", line 277, in wait
f._waiters.remove(waiter)
ValueError: list.remove(x): x not in list
How should I properly add new Futures to a list while already waiting upon it?
Looking at the implementation of wait(), it certainly doesn't expect that anything outside concurrent.futures will ever mutate the list passed to it. So I don't think you'll ever get that "to work". It's not just that it doesn't expect the list to mutate, it's also that significant processing is done on list entries, and the implementation has no way to know that you've added more entries.
Untested, I'd suggest trying this instead: skip all that, and just keep a running count of threads still active. A straightforward way is to use a Condition guarding a count.
Initialization:
self._count_cond = threading.Condition()
self._thread_count = 0
When my_func is entered (i.e., when a new thread starts):
with self._count_cond:
self._thread_count += 1
When my_func is done (i.e., when a thread ends), for whatever reason (exceptional or not):
with self._count_cond:
self._thread_count -= 1
self._count_cond.notify() # wake up the waiting logic
And finally the main waiting logic:
with self._count_cond:
while self._thread_count:
self._count_cond.wait()
POSSIBLE RACE
It seems possible that the thread count could reach 0 while work for a new thread has been submitted, but before its my_func invocation starts running (and so before _thread_count is incremented to account for the new thread).
So the:
with self._count_cond:
self._thread_count += 1
part should really be done instead right before each occurrence of
self._thread_pool.submit(self.myfunc, id)
Or write a new method to encapsulate that pattern; e.g., like so:
def start_new_thread(self, id):
with self._count_cond:
self._thread_count += 1
self._thread_pool.submit(self.myfunc, id)
A DIFFERENT APPROACH
Offhand, I expect this could work too (but, again, haven't tested it): keep all your code the same except change how you're waiting:
while self._futures:
self._futures.pop().result()
So this simply waits for one thread at a time, until none remain.
Note that .pop() and .append() on lists are atomic in CPython, so no need for your own lock. And because your my_func() code appends before the thread it's running in ends, the list won't become empty before all threads really are done.
AND YET ANOTHER APPROACH
Keep the original waiting code, but rework the rest not to create new threads in case of exception. Like rewrite my_func to return True if it quits due to an exception, return False otherwise, and start threads running a wrapper instead:
def my_func_wrapper(self, id):
keep_going = True
while keep_going:
keep_going = self.my_func(id)
This may be especially attractive if you someday decide to use multiple processes instead of multiple threads (creating new processes can be a lot more expensive on some platforms).
AND A WAY USING cf.wait()
Another way is to change just the waiting code:
while self._futures:
fs = self._futures[:]
for f in fs:
self._futures.remove(f)
concurrent.futures.wait(fs)
Clear? This makes a copy of the list to pass to .wait(), and the copy is never mutated. New threads show up in the original list, and the whole process is repeated until no new threads show up.
Which of these ways makes most sense seems to me to depend mostly on pragmatics, but there's not enough info about all you're doing for me to make a guess about that.
I'm very new to multi-threading. I have 2 functions in my python script. One function enqueue_tasks iterates through a large list of small items and performs a task on each item which involves appending an item to a list (lets call it master_list). This I already have multi-threaded using futures.
executor = concurrent.futures.ThreadPoolExecutor(15) # Arbitrarily 15
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 50)]
concurrent.futures.wait(futures)
I have another function process_master that iterates through the master_list above and checks the status of each item in the list, then does some operation.
Can I use the same method above to use multi-threading for process_master? Furthermore, can I have it running at the same time as enqueue_tasks? What are the implications of this? process_master is dependent on the list from enqueue_tasks, so will running them at the same time be a problem? Is there a way I can delay the second function a bit? (using time.sleep perhaps)?
No, this isn't safe. If enqueue_tasks and process_master are running at the same time, you could potentially be adding items to master_list inside enqueue_tasks at the same time process_master is iterating over it. Changing the size of an iterable while you iterate over it causes undefined behavior in Python, and should always be avoided. You should use a threading.Lock to protect the code that adds items to master_list, as well as the code that iterates over master_list, to ensure they never run at the same time.
Better yet, use a Queue.Queue (queue.Queue in Python 3.x) instead of a list, which is a thread-safe data structure. Add items to the Queue in enqueue_tasks, and get items from the Queue in process_master. That way process_master can safely run a the same time as enqueue_tasks.
I am trying to fix a bug where multiple threads are writing to a list in memory. Right now I have a thread lock and am occasionally running into problems that are related to the work being done in the threads.
I was hoping to simply make an hash of lists, one for each thread, and remove the thread lock. It seems like each thread could write to its own record without worrying about the others, but perhaps the fact that they are all using the same owning hash would itself be a problem.
Does anyone happen to know if this will work or not? If not, could I, for example, dynamically add a list to a package for each thread? Is that essentially the same thing?
I am far from a threading expert so any advice welcome.
Thanks,
import threading
def job(root_folder,my_list):
for current,files,dirs in os.walk(root):
my_list.extend(files)
time.sleep(1)
my_lists = [[],[],[]]
my_folders = ["C:\\Windows","C:\\Users","C:\\Temp"]
my_threads = []
for folder,a_list in zip(my_folders,my_lists):
my_threads.append(threading.Thread(target=job,args=(folder,a_list)
for thread in my_threads:
thread.start()
for thread in my_threads:
thread.join()
my_full_list = my_lists[0] + my_lists[1] + my_lists[2]
this way each thread just modifies its own list and at the end combines all the individual lists
also as pointed out this gives zero performance gain (actually probably slower than not threading it... ) you may get performance gains using multiprocessing instead ...
Don't use list. Use Queue (python2) or queue (python3).
There is 3 kinds of queue: fifo, lifo and priority. The last one is for ordered data.
You may put data at one side (with thread):
q.put(data)
And get at the other side (maybe in a loop for, say, database):
while not q.empty:
print q.get()
https://docs.python.org/2/library/queue.html