Is modifying a pre-allocated Python list thread-safe? - python

I have a function that dispatches calls to several Redis shards and stores the result in a pre-allocated Python list.
Basically, the code goes as follow:
def myfunc(calls):
results = [None] * len(calls)
for index, (connection, call) in enumerate(calls.iteritems()):
results[index] = call(connection)
return results
Obviously, as of now this calls the various Redis shards in sequence. I intend to use a thread pool and to make those calls happen in parallel as they can take quite some time.
My question is: given that the results list is preallocated and that each calls has a dedicated slot, do I need to lock it to put the results from each thread or is there any guarantee that it will work without ?
I will obviously profile the result in the end but I wouldn't want to lock if I don't really need to.

Related

Merge using threads not working in python

I have to merge two lists and every time a full the lists in order to merge them , but what is happening that I did it like this :
def repeated_fill_buffer(self):
"""
repeat the operation until reaching the end of file
"""
# clear buffers from last data
self.block = [[] for file in self.files]
filling_buffer_thread = threading.Thread(self.fill_buffer())
filling_buffer_thread.start()
# create inverted index thread
create_inverted_index_thread = threading.Thread(self.create_inverted_index())
create_inverted_index_thread.start()
# check if buffers are not empty to merge and start the thread
if any(self.block):
self.block = [[] for file in self.files]
filling_buffer_thread.join()
create_inverted_index_thread.join()
but what is happening that filling_buffer_thread and create_inverted_index_thread just called one time, and not working again, when I debugged the code I see that
filling_buffer_thread stopped
I don't know if I explain my question good, but what I want that I can called same thread multi time and run them..
If there is any operation which is CPU Bound then, using thread is of no use. Because of Python GIL, which prevents multiple byte-code instruction to be executed at a time. use multiprocessing module since, every process has its own GIL.
All number crunching or any operations which depends on CPU for its completion are CPU-Bound. Threads are useful for I/O Bound Operations (like Database Calls, Network Calls)
To summarize your error, your filling_buffer_thread got blocked due to create_inverted_index_thread

How to make list of twisted deferred operations truly async

I am new to using Twisted library, I want to make a list of operations async. Take example of the following pseudo code:
#defer.inlineCallbacks
def getDataAsync(host):
data = yield AsyncHttpAPI(host) # some asyc api which returns deferred
return data
#defer.inlineCallbacks
def funcPrintData():
hosts = []; # some list of hosts, say 1000 in number
for host in hosts:
data = yield getDataAsync(host)
# why doesn't the following line get printed as soon as first result is available
# it waits for all getDataAsync to be queued before calling the callback and so print data
print(data)
Please comment if the question is not clear. Is there a better way of doing this? Should I instead be using the DeferredList ?
The line:
data = yield getDataAsync(host)
means "stop running this function until the getDataAsync(host) operation has completed. If the function stops running, the for loop can't get to any subsequent iterations so those operations can't even begin until after the first getDataAsync(host) has completed. If you want to run everything concurrently then you need to not stop running the function until all of the operations have started. For example:
ops = []
for host in hosts:
ops.append(getDataAsync(host))
After this runs, all of the operations will have started regardless of whether or not any have finished.
What you do with ops depends on whether you want results in the same order as hosts or if you want them all at once when they're all ready or if you want them one at a time in the order the operations succeed.
DeferredList is for getting them all at once when they're all ready as a list in the same order as the input list (ops):
datas = yield DeferredList(ops)
If you want to process each result as it becomes available, it's easier to use addCallback:
ops = []
for host in hosts:
ops.append(getDataAsync(host).addCallback(print))
This still doesn't yield so the whole group of operations are started. However, the callback on each operation runs as soon as that operation has a result. You're still left with a list of Deferred instances in ops which you can still use to wait for all of the results to finish if you want or attach overall error handling to (at least one of those is a good idea otherwise you have dangling operations that you can't easily account for in callers of funcPrintDat).

Optimization for Python code

I have a small function (see below) that returns a list of names that are mapped from a list of integers (eg [1,2,3,4]) which can be of length up to a thousand.
This function can potentially get called tens of thousands of times at a time and I want to know if I can do anything to make it run faster.
The graph_hash is a large hash that maps keys to sets of length 1000 or less. I am iterating over a set and mapping the values to names and returning a list. The u.get_name_from_id() queries an sqlite database.
Any thoughts to optimize any part of this function?
def get_neighbors(pid):
names = []
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
return names
Caching and multithreading are only going to get you so far, you should create a new method that uses executemany under the hood to retrieve multiple names from the database in bulk.
Something like names = u.get_names_from_ids(graph_hash[pid]).
You're hitting the database sequentially here:
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
I would recommend doing it concurrently using threads. Something like this should get you started:
def load_stuff(queue, p):
q.put(u.get_name_from_id(p))
def get_neighbors(pid):
names = Queue.Queue()
# we'll keep track of the threads with this list
threads = []
for p in graph_hash[pid]:
thread = threading.Thread(target=load_stuff, args=(names,p))
threads.append(thread)
# start the thread
thread.start()
# wait for them to finish before you return your Queue
for thread in threads:
thread.join()
return names
You can turn the Queue back into a list with [item for item in names.queue] if needed.
The idea is that the database calls are blocking until they're done, but you can make multiple SELECT statements on a database without locking. So, you should use threads or some other concurrency method to avoid waiting unnecessarily.
I would recommend to use deque instead of list if you doing thousands of appends. So, names should be names = deque().
A list comprehension is a start (similar to #cricket_007's generator suggestion), but you are limited by function calls:
def get_neighbors(pid):
return [u.get_name_from_id(p) for p in graph_hash[pid]]
As #salparadise suggested, consider memoization to speed up get_name_from_id().

Can I have two multithreaded functions running at the same time?

I'm very new to multi-threading. I have 2 functions in my python script. One function enqueue_tasks iterates through a large list of small items and performs a task on each item which involves appending an item to a list (lets call it master_list). This I already have multi-threaded using futures.
executor = concurrent.futures.ThreadPoolExecutor(15) # Arbitrarily 15
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 50)]
concurrent.futures.wait(futures)
I have another function process_master that iterates through the master_list above and checks the status of each item in the list, then does some operation.
Can I use the same method above to use multi-threading for process_master? Furthermore, can I have it running at the same time as enqueue_tasks? What are the implications of this? process_master is dependent on the list from enqueue_tasks, so will running them at the same time be a problem? Is there a way I can delay the second function a bit? (using time.sleep perhaps)?
No, this isn't safe. If enqueue_tasks and process_master are running at the same time, you could potentially be adding items to master_list inside enqueue_tasks at the same time process_master is iterating over it. Changing the size of an iterable while you iterate over it causes undefined behavior in Python, and should always be avoided. You should use a threading.Lock to protect the code that adds items to master_list, as well as the code that iterates over master_list, to ensure they never run at the same time.
Better yet, use a Queue.Queue (queue.Queue in Python 3.x) instead of a list, which is a thread-safe data structure. Add items to the Queue in enqueue_tasks, and get items from the Queue in process_master. That way process_master can safely run a the same time as enqueue_tasks.

Does pool.map() from multiprocessing lock process to CPU core automatically?

I've submitted several questions over last few days trying to understand how to use the multiprocessing python library properly.
Current method I'm using is to split a task over a number of processes that is equal to the number of available CPU cores on the machine, as follows:
from multiprocessing import Pool
from contextlib import closing
def myFunction(row):
# row function
with closing(Pool(processes=multiprocessing.cpu_count())) as pool:
pool.map(myFunction, rowList)
However, when the map part is reached in the program it seems to actually slow down, not speed up. One of my functions for example moves through only 60 records (the first function) and it prints a result at the end of each record. The record printing seems to slow down to an eventual stop and do not much! I am wondering if the program is loading the next function into memory async or whether there's something wrong with my methodology.
So I am wondering - are the child processes automatically 'LOCKED' to each CPU core with the pool.map() or do I need to do something extra?
EDIT:
So the program does not actually stop, it just begins to print the values very slowly.
here is an example of myFunction in very simplified terms (row is from a list object):
def myFunction(row):
d = string
j=0
for item in object:
d+= row[j]
j=j+1
d += row[x] + string
d += row[y] + string
print row[z]
return
As I said, the above function is for a very small list, however the function proceeding it deals with a much much larger list.
The problem is that you don't appear to be doing enough work in each call to the worker function. All you seem to be doing is pasting together list of strings being passed as argument. However this is pretty much exactly what the multiprocessing module needs to do in the parent process to pass the list of strings to the worker process. It pickles them, writes them to a pipe, which the child process then reads, unpickles and then passes as argument to myFunction.
Since in order to pass the argument to the worker process the parent process has to do at least as much work as the worker process needs to do, you gain no benefit from using the multiprocessing module in this case.

Categories