Sharing dictionary with item-level locking in multiprocessing (without Manager) - python

I need to share a dict object between multiple processes. Each process, at its runtime, will read or modify existing items and/or add new items into the dict. I am using the manager object from multiprocessing similar to this code:
import multiprocessing as mp
def init_d(d, states):
for s in states:
if tuple(list(s) + [0]) not in d:
for i in range(4):
d[tuple(list(s) + [i])] = 0
if __name__ == '__main__':
with mp.Manager() as manager:
master_d = manager.dict()
states1 = [(1, 2), (3, 4)]
states2 = [(1, 2), (5, 6)]
p1 = mp.Process(target=init_d, args=(master_d, states1))
p2 = mp.Process(target=init_d, args=(master_d, states2))
p1.start()
p2.start()
p1.join()
p2.join()
print(master_d)
However, it is extremely slow for my particular usage. Aside form the fact that Manager is slower than shared memory (according to python doc), the much more important issue is that in my application, each process read/modify/add only one item of the dict at a time while has nothing to do with the rest. So, instead of locking the entire dict object, which causes the enormous slow-down, I was wondering if there is a way to for example only lock the specific item in __get_item__ and __set_item__ methods. 
I have seen several related questions, but no answer. Please do not tell me about thread safety and locking stuff. Also, I am NOT looking for the generic case, rather:
I need an implementation that allows for reading an item of a dictionary by one process, while modifying another item by another process, also possibly add a new item by yet another process, all at the same time. 
The dict can be extremely large, and I somehow rely on O(1) property of hash tables. Hence, I can not use list or tuple instead.
Each process only read/modify/add one item. Hence, a local dict and then updating the shared one at the end (after a batch of local updates) is not an option.
Any help is deeply appreciated.

Related

How are locks differenciated in multiprocessing?

Let's say that you have two lists created with manager.list(), and two locks created with manager.Lock(). How do you assign each lock to each list?
I was doing like
lock1 = manager.Lock()
lock2 = manager.Lock()
list1 = manager.list()
list2 = manager.list()
and when I wanted to write/read from the list
lock1.acquire()
list1.pop(0)
lock1.release()
lock2.acquire()
list2.pop(0)
lock2.released()
Today I realized that there's nothing that associates lock1 to list1.
Am I misunderstanding these functions?
TL;DR yes, and this might be an XY problem!
If you create a multiprocessing.Manager() and use its methods to create container primitives (.list and .dict), they will already be synchronized and you don't need to deal with the synchronization primitives yourself
from multiprocessing import Manager, Process, freeze_support
def my_function(d, lst):
lst.append([x**2 for x in d.values()])
def main():
with Manager() as manager: # context-managed SyncManager
normal_dict = {'a': 1, 'b': 2}
managed_synchronized_dict = manager.dict(normal_dict)
managed_synchronized_list = manager.list() # used to store results
p = Process(
target=my_function,
args=(managed_synchronized_dict, managed_synchronized_list)
)
p.start()
p.join()
print(managed_synchronized_list)
if __name__ == '__main__':
freeze_support()
main()
% python3 ./test_so_66603485.py
[[1, 4]]
multiprocessing.Array, is also synchronized
BEWARE: proxy objects are not directly comparable to their Python collection equivalents
Note: The proxy types in multiprocessing do nothing to support comparisons by value. So, for instance, we have:
>>> manager.list([1,2,3]) == [1,2,3]
False
One should just use a copy of the referent instead when making comparisons.
Some confusion might come from the section of the multiprocessing docs on Synchronization Primitives, which implies that one should use a Manager to create synchronization primitives, when really the Manager can already do the synchronization for you
Synchronization primitives
Generally synchronization primitives are not as necessary in a multiprocess program as they are in a multithreaded program. See the documentation for threading module.
Note that one can also create synchronization primitives by using a manager object – see Managers.
If you use simply multiprocessing.Manager(), per the docs, it
Returns a started SyncManager object which can be used for sharing objects between processes. The returned manager object corresponds to a spawned child process and has methods which will create shared objects and return corresponding proxies.
From the SyncManager section
Its methods create and return Proxy Objects for a number of commonly used data types to be synchronized across processes. This notably includes shared lists and dictionaries.
This means that you probably have most of what you want already
manager object with methods for building managed types
synchronization via Proxy Objects
Finally, to sum up the thread from comments specifically about instances of Lock objects
there's no inherent way to tell that some named lock is for anything in particular other than meta-information such as its name, comment(s), documentation ..conversely, they're free to be used for whatever synchronization needs you may have
some useful class/container can be made to manage both the lock and whatever it should be synchronizing -- a normal multiprocessing.Manager (SyncManager)'s .list and .dict do this, and a variety of other useful constructs exist, such as Pipe and Queue
one lock can be used to synchronize any number of actions, but having more locks can be a valuable trade-off as they are potentially unnecessarily blocking access to resources
a variety of synchronization primitives also exist for different purposes
value = my_queue.get() # already synchronized
if not my_lock1.acquire(timeout=5): # False if cannot acquire
raise CustomException("failed to acquire my_lock1 after 5 seconds")
try:
with my_lock2(): # blocks until acquired
some_shared_mutable = some_container_of_mutables[-1]
some_shared_mutable = some_object_with_mutables.get()
if foo(value, some_shared_mutable):
action1(value, some_shared_mutable)
action2(value, some_other_shared_mutable)
finally:
lock2.release()

Python: Pass Readonly Shelve to subprocesses

As discussed here: Python: Multiprocessing on Windows -> Shared Readonly Memory I have a heavily parallelized task.
Multiple workers do some stuff and in the end need to access some keys of a dictionary which contains several millions of key:value combinations. The keys which will be accessed, are only known within the worker after some further action also involving some file-processing (the example below is just for demonstration purposes, hence simplified in that way).
Before, my solution was to keep this big dictionary in memory, pass it once into shared memory and access it by the single workers. But it consumes a lot of RAM... So I wanted to use shelve (because the values of that dictionary are again dicts or lists).
So a simplified example of what I tried was:
def shelveWorker(tupArgs):
id, DB = tupArgs
return DB[id]
if __name__ == '__main__':
DB = shelve.open('file.db', flag='r', protocol=2)
joblist = []
for id in range(10000):
joblist.append((str(id), DB))
p = multiprocessing.Pool()
for returnValue in p.imap_unordered(shelveWorker, joblist):
# do something with returnValue
pass
p.close()
p.join()
Unfortunately I get:
"TypeError: can't pickle DB objects"
But IMHO it does not make any sense to open the shelve itself (DB = shelve.open('file.db', flag='r', protocol=2)) within each worker on its own because of slower runtime (I have several thousand workers).
How to go about it?
Thanks!

Multiprocessing storing read-only string-array for all processes

I am trying to create workers for a task that involves reading a lot of files and analyzing them.
I want something like this:
list_of_unique_keys_from_csv_file = [] # About 200mb array (10m rows)
# a list of uniquekeys for comparing inside worker processes to a set of flatfiles
I need more threads as it is going very slow, doing the comparison with one process (10 minutes per file).
I have another set of flat-files that I compare the CSV file to, to see if unique keys exist. This seems like a map reduce type of problem.
main.py:
def worker_process(directory_glob_of_flat_files, list_of_unique_keys_from_csv_file):
# Do some parallel comparisons "if not in " type stuff.
# generate an array of
# lines of text like : "this item_x was not detected in CSV list (from current_flatfile)"
if current_item not in list_of_unique_keys_from_csv_file:
all_lines_this_worker_generated.append(sometext + current_item)
return all_lines_this_worker_generated
def main():
all_results = []
pool = Pool(processes=6)
partitioned_flat_files = [] # divide files from glob by 6
results = pool.starmap(worker_process, partitioned_flat_files, {{{{i wanna pass in my read-only parameter}}}})
pool.close()
pool.join()
all_results.extend(results )
resulting_file.write(all_results)
I am using both a linux and a windows environment, so perhaps I need something cross-platform compatible (the whole fork() discussion).
Main Question: Do I need some sort of Pipe or Queue, I can't seem to find good examples of how to transfer around a big read-only string array, a copy for each worker process?
You can just split your read-only parameters and then pass them in. The multiprocessing module is cross-platform compatible, so don't worry about it.
Actually, every process, even sub-process, has its own resources, that means no matter how you pass the parameters to it, it will keep a copy of the original one instead of sharing it. In this simple case, when you pass the parameters from main process into sub-processes, Pool automatically makes a copy of your variables. Because sub-processes just have the copies of original one, so the modification cannot be shared. It doesn't matter in this case as your variables are read-only.
But be careful about your code, you need to wrap the parameters you need into an
iterable collection, for example:
def add(a, b):
return a + b
pool = Pool()
results = pool.starmap(add, [(1, 2), (3, 4)])
print(results)
# [3, 7]

Can't mutate a variable local to a function in another function

I am trying to manipulate the lists inside the dictionary clean_txt in another function, but its not working and I end up with empty lists inside the dict.
My understading is that both lists and dicts are mutable objects so what is the problem here?
def process_questions(i, question_list, questions, question_list_name):
''' Transform questions and display progress '''
print('processing {}: process {}'.format(question_list_name, i))
for question in questions:
question_list.append(text_to_wordlist(str(question)))
#timeit
def multi(n_cores, tq, qln):
procs = []
clean_txt = {}
for i in range(n_cores):
clean_txt[i] = []
for index in range(n_cores):
tq_indexed = tq[index*len(tq)//n_cores:(index+1)*len(tq)//n_cores]
proc = Process(target=process_questions, args=(index, clean_txt[index], tq_indexed, qln, ))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
print('{} records processed from {}'.format(sum([len(x) for x in clean_txt.values()]), qln))
print('-'*100)
Your are using Processes not threads.
When the process is created the memory of your program is copied and each process work on its own set, therefore it is NOT shared.
Here's a question that can help you understand: Multiprocessing vs Threading Python
If you want to share memory between processes you should look into semaphores or use Threads instead. There are also other solutions to share data, like queues or database etc.
You are appending to clean_txt[index] from a different Process. clean_txt[index] belongs to the main python process who created it. Since a process can't access or modify another process's memory, you can't append to it. (Not really. See edit below)
You will need to create shared memory.
You can use Manager to create shared memory, something like this
from multiprocessing import Manager
manager = Manager()
...
clean_txt[i] = manager.list()
Now you can append to this list in the other process.
Edit -
My explanation about clean_txt wasn't clear. Thanks to #Maresh.
When a new Process is created the whole memory is copied. So modifying the list in the new process won't affect the copy in the main process. So you need a shared memory.

Sharing many queues among processes in Python

I am aware of multiprocessing.Manager() and how it can be used to create shared objects, in particular queues which can be shared between workers. There is this question, this question, this question and even one of my own questions.
However, I need to define a great many queues, each of which is linking a specific pair of processes. Say that each pair of processes and its linking queue is identified by the variable key.
I want to use a dictionary to access my queues when I need to put and get data. I cannot make this work. I've tried a number of things. With multiprocessing imported as mp:
Defining a dict like for key in all_keys: DICT[key] = mp.Queue in a config file which is imported by the multiprocessing module (call it multi.py) does not return errors, but the queue DICT[key] is not shared between the processes, each one seems to have their own copy of the queue and thus no communication happens.
If I try to define the DICT at the beginning of the main multiprocessing function that defines the processes and starts them, like
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Queue()
I get the error
RuntimeError: Queue objects should only be shared between processes through
inheritance
Changing to
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Manager().Queue()
only makes everything worse. Trying similar definitions at the head of multi.py rather than inside the main function returns similar errors.
There must be a way to share many queues between processes without explicitly naming each one in the code. Any ideas?
Edit
Here is a basic schema of the program:
1- load the first module, which defines some variables, imports multi, launches multi.main(), and loads another module which starts a cascade of module loads and code execution. Meanwhile...
2- multi.main looks like this:
def main():
manager = mp.Manager()
pool = mp.Pool()
DICT2 = manager.dict()
for key in all_keys:
DICT2[key] = manager.Queue()
proc_1 = pool.apply_async(targ1,(DICT1[key],) ) #DICT1 is defined in the config file
proc_2 = pool.apply_async(targ2,(DICT2[key], otherargs,)
Rather than use pool and manager, I was also launching processes with the following:
mp.Process(target=targ1, args=(DICT[key],))
3 - The function targ1 takes input data that is coming in (sorted by key) from the main process. It is meant to pass the result to DICT[key] so targ2 can do its work. This is the part that is not working. There are an arbitrary number of targ1s, targ2s, etc. and therefore an arbitrary number of queues.
4 - The results of some of these processes will be sent to a bunch of different arrays / pandas dataframes which are also indexed by key, and which I would like to be accessible from arbitrary processes, even ones launched in a different module. I have yet to write this part and it might be a different question. (I mention it here because the answer to 3 above might also solve 4 nicely.)
It sounds like your issues started when you tried to share a multiprocessing.Queue() by passing it as an argument. You can get around this by creating a managed queue instead:
import multiprocessing
manager = multiprocessing.Manager()
passable_queue = manager.Queue()
When you use a manager to create it, you are storing and passing around a proxy to the queue, rather than the queue itself, so even when the object you pass to your worker processes is a copied, it will still point at the same underlying data structure: your queue. It's very similar (in concept) to pointers in C/C++. If you create your queues this way, you will be able to pass them when you launch a worker process.
Since you can pass queues around now, you no longer need your dictionary to be managed. Keep a normal dictionary in main that will store all the mappings, and only give your worker processes the queues they need, so they won't need access to any mappings.
I've written an example of this here. It looks like you are passing objects between your workers, so that's what's done here. Imagine we have two stages of processing, and the data both starts and ends in the control of main. Look at how we can create the queues that connect the workers like a pipeline, but by giving them only they queues they need, there's no need for them to know about any mappings:
import multiprocessing as mp
def stage1(q_in, q_out):
q_out.put(q_in.get()+"Stage 1 did some work.\n")
return
def stage2(q_in, q_out):
q_out.put(q_in.get()+"Stage 2 did some work.\n")
return
def main():
pool = mp.Pool()
manager = mp.Manager()
# create managed queues
q_main_to_s1 = manager.Queue()
q_s1_to_s2 = manager.Queue()
q_s2_to_main = manager.Queue()
# launch workers, passing them the queues they need
results_s1 = pool.apply_async(stage1, (q_main_to_s1, q_s1_to_s2))
results_s2 = pool.apply_async(stage2, (q_s1_to_s2, q_s2_to_main))
# Send a message into the pipeline
q_main_to_s1.put("Main started the job.\n")
# Wait for work to complete
print(q_s2_to_main.get()+"Main finished the job.")
pool.close()
pool.join()
return
if __name__ == "__main__":
main()
The code produces this output:
Main started the job.
Stage 1 did some work.
Stage 2 did some work.
Main finished the job.
I didn't include an example of storing the queues or AsyncResults objects in dictionaries, because I still don't quite understand how your program is supposed to work. But now that you can pass your queues freely, you can build your dictionary to store the queue/process mappings as needed.
In fact, if you really do build a pipeline between multiple workers, you don't even need to keep a reference to the "inter-worker" queues in main. Create the queues, pass them to your workers, then only retain references to queues that main will use. I would definitely recommend trying to let old queues be garbage collected as quickly as possible if you really do have "an arbitrary number" of queues.

Categories