Can I have two multithreaded functions running at the same time? - python

I'm very new to multi-threading. I have 2 functions in my python script. One function enqueue_tasks iterates through a large list of small items and performs a task on each item which involves appending an item to a list (lets call it master_list). This I already have multi-threaded using futures.
executor = concurrent.futures.ThreadPoolExecutor(15) # Arbitrarily 15
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 50)]
concurrent.futures.wait(futures)
I have another function process_master that iterates through the master_list above and checks the status of each item in the list, then does some operation.
Can I use the same method above to use multi-threading for process_master? Furthermore, can I have it running at the same time as enqueue_tasks? What are the implications of this? process_master is dependent on the list from enqueue_tasks, so will running them at the same time be a problem? Is there a way I can delay the second function a bit? (using time.sleep perhaps)?

No, this isn't safe. If enqueue_tasks and process_master are running at the same time, you could potentially be adding items to master_list inside enqueue_tasks at the same time process_master is iterating over it. Changing the size of an iterable while you iterate over it causes undefined behavior in Python, and should always be avoided. You should use a threading.Lock to protect the code that adds items to master_list, as well as the code that iterates over master_list, to ensure they never run at the same time.
Better yet, use a Queue.Queue (queue.Queue in Python 3.x) instead of a list, which is a thread-safe data structure. Add items to the Queue in enqueue_tasks, and get items from the Queue in process_master. That way process_master can safely run a the same time as enqueue_tasks.

Related

traverse two lists asynchronously?

I have two lists, lzma2_list and rar_list. both have a random number of objects names that vary daily. there is a directory where these objects are, called "O:", there are 2 methods that should handle this data.
bkp.zipto_rar(path,object_name)
bkp.zipto_lzma(path,object_name)
how could i get all items from lists asynchronously without waiting for one to finish?
speed up compression using list asynchronously and threads
i tried using the answers to this question but in my case the methods receive 2 parameters, one fixed, referring to the directory, and another that will change constantly, referring to the items in the list.
As your functions take parameters, you should use functools.partial to convert them to the signature without arguments.
Then you can use asyncio.new_event_loop().run_in_executor to process each item in background threads if the functions are IO-bound, or multiprocessing.Pool to process items in background processes if they are CPU-bound.
You can even combine two approaches and use many theads in each background process but it's hard write useful example not knowing specifics or your functions and lists. Gathering results after may also be not trivial.
import asyncio
import functools
lzma2_list = []
rar_list = []
def process_lzma2_list():
path = 'CONST'
for item in lzma2_list:
func = functools.partial(bkp.zipto_lzma, *(path, item))
asyncio.new_event_loop().run_in_executor(executor=None, func=func)
def process_rar_list():
path = 'CONST'
for item in rar_list:
func = functools.partial(bkp.zipto_rar, *(path, item))
asyncio.new_event_loop().run_in_executor(executor=None, func=func)
if __name__ == '__main__':
# it's ok to run these 2 functions sequentially as they just create tasks, actual processing is done in background
process_lzma2_list()
process_rar_list()

Python Lists and Thread Safety

I am using multiple threads to work through items in a very large list untell it is empty.
while item_list:
my_item = item_list.pop()
I check if any items are left in list and if so I pop one and work on it. Is this process thread safe?
Is there chance that when I check there is an item in list but by time I pop it will be gone and raise error? Or any other issues?
Yes, a thread-switch could happen between the the two lines and the list could be empty by the time you pop the item. Use a thread-safe queue.Queue() to store your work items.

Optimization for Python code

I have a small function (see below) that returns a list of names that are mapped from a list of integers (eg [1,2,3,4]) which can be of length up to a thousand.
This function can potentially get called tens of thousands of times at a time and I want to know if I can do anything to make it run faster.
The graph_hash is a large hash that maps keys to sets of length 1000 or less. I am iterating over a set and mapping the values to names and returning a list. The u.get_name_from_id() queries an sqlite database.
Any thoughts to optimize any part of this function?
def get_neighbors(pid):
names = []
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
return names
Caching and multithreading are only going to get you so far, you should create a new method that uses executemany under the hood to retrieve multiple names from the database in bulk.
Something like names = u.get_names_from_ids(graph_hash[pid]).
You're hitting the database sequentially here:
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
I would recommend doing it concurrently using threads. Something like this should get you started:
def load_stuff(queue, p):
q.put(u.get_name_from_id(p))
def get_neighbors(pid):
names = Queue.Queue()
# we'll keep track of the threads with this list
threads = []
for p in graph_hash[pid]:
thread = threading.Thread(target=load_stuff, args=(names,p))
threads.append(thread)
# start the thread
thread.start()
# wait for them to finish before you return your Queue
for thread in threads:
thread.join()
return names
You can turn the Queue back into a list with [item for item in names.queue] if needed.
The idea is that the database calls are blocking until they're done, but you can make multiple SELECT statements on a database without locking. So, you should use threads or some other concurrency method to avoid waiting unnecessarily.
I would recommend to use deque instead of list if you doing thousands of appends. So, names should be names = deque().
A list comprehension is a start (similar to #cricket_007's generator suggestion), but you are limited by function calls:
def get_neighbors(pid):
return [u.get_name_from_id(p) for p in graph_hash[pid]]
As #salparadise suggested, consider memoization to speed up get_name_from_id().

Is modifying a pre-allocated Python list thread-safe?

I have a function that dispatches calls to several Redis shards and stores the result in a pre-allocated Python list.
Basically, the code goes as follow:
def myfunc(calls):
results = [None] * len(calls)
for index, (connection, call) in enumerate(calls.iteritems()):
results[index] = call(connection)
return results
Obviously, as of now this calls the various Redis shards in sequence. I intend to use a thread pool and to make those calls happen in parallel as they can take quite some time.
My question is: given that the results list is preallocated and that each calls has a dedicated slot, do I need to lock it to put the results from each thread or is there any guarantee that it will work without ?
I will obviously profile the result in the end but I wouldn't want to lock if I don't really need to.

Does pool.map() from multiprocessing lock process to CPU core automatically?

I've submitted several questions over last few days trying to understand how to use the multiprocessing python library properly.
Current method I'm using is to split a task over a number of processes that is equal to the number of available CPU cores on the machine, as follows:
from multiprocessing import Pool
from contextlib import closing
def myFunction(row):
# row function
with closing(Pool(processes=multiprocessing.cpu_count())) as pool:
pool.map(myFunction, rowList)
However, when the map part is reached in the program it seems to actually slow down, not speed up. One of my functions for example moves through only 60 records (the first function) and it prints a result at the end of each record. The record printing seems to slow down to an eventual stop and do not much! I am wondering if the program is loading the next function into memory async or whether there's something wrong with my methodology.
So I am wondering - are the child processes automatically 'LOCKED' to each CPU core with the pool.map() or do I need to do something extra?
EDIT:
So the program does not actually stop, it just begins to print the values very slowly.
here is an example of myFunction in very simplified terms (row is from a list object):
def myFunction(row):
d = string
j=0
for item in object:
d+= row[j]
j=j+1
d += row[x] + string
d += row[y] + string
print row[z]
return
As I said, the above function is for a very small list, however the function proceeding it deals with a much much larger list.
The problem is that you don't appear to be doing enough work in each call to the worker function. All you seem to be doing is pasting together list of strings being passed as argument. However this is pretty much exactly what the multiprocessing module needs to do in the parent process to pass the list of strings to the worker process. It pickles them, writes them to a pipe, which the child process then reads, unpickles and then passes as argument to myFunction.
Since in order to pass the argument to the worker process the parent process has to do at least as much work as the worker process needs to do, you gain no benefit from using the multiprocessing module in this case.

Categories