I'm using multiprocessing to create sub-process to my application.
I also share a dictionary between the process and the sub-process.
Example of my code:
Main process:
from multiprocessing import Process, Manager
manager = Manager()
shared_dict = manager.dict()
p = Process(target=mysubprocess, args=(shared_dict,))
p.start()
p.join()
print shared_dict
my sub-process:
def mysubprocess(shared_dict):
shared_dict['list_item'] = list()
shared_dict['list_item'].append('test')
print shared_dict
In both cases the printed value is :
{'list_item': []}
What could be the problem?
Thanks
Manager.dict will give you a dict where direct changes will be propagated between the processes, but it doesn't detect if you change objects contained in the dict (like the list stored under "list_item"). See the note at the bottom of the SyncManager documentation:
Note: Modifications to mutable values or items in dict and list proxies will not be propagated through the manager, because the proxy has no way of knowing when its values or items are modified. To modify such an item, you can re-assign the modified object to the container proxy.
So in your example the list gets synced when you set it in the dict, but the append doesn't trigger another sync.
You can get around that by re-assigning the key in the dict:
from multiprocessing import Process, Manager
def mysubprocess(shared_dict):
item = shared_dict['list_item'] = list()
item.append('test')
shared_dict['list_item'] = item
print 'subprocess:', shared_dict
manager = Manager()
shared_dict = manager.dict()
p = Process(target=mysubprocess, args=(shared_dict,))
p.start()
p.join()
print 'main process:', shared_dict
But that might get inefficient if the list is going to grow long - the whole list will be serialised and sent to the manager process for each append. A better way in that case would be to make the shared list directly with SyncManager.list (although you'll still have the same problem if the elements of the list are mutable - you need to reset them in the list to send them between the processes).
Related
Scenario
I have a long list in which each element is a string, and now I want to do the same operation for every element by using some functions. Considering the list's length, I want to use multiprocessing to create some Process to process this list. Each Process processes only a portion of the elements of the list. To save time in this way.
In the code below:
I create a multiprocessing.Manager to make my ls can be shared between the 2 child processes.
Then I create a multiprocessing.Pool, hoping for two processes to be created. One Process do func() on ls[0:100000], and another do func() on ls[100000:200000].
Every result of my_operation() on ls[index] is written in-place.
Example code (using 2 processes):
from multiprocessing import Pool, Manager
# len(ls) = 200000
ls = ['a', 'b', ...]
# Each element of the scopes below is the scope to be processed by the process.
# (I want to use 2 processes in this example)
scopes = [ (0, 100000), (100000, 200000) ]
m = Manager()
ls = m.list(ls) # create a shared list for child processes
def func(range_start, range_end):
for index in range(range_start, range_end):
ls[index] = my_operation(ls[index])
def my_operation(str):
return str
with Pool(2) as pool:
pool.starmap_async(func, scopes)
pool.close()
pool.join()
Questions
Is it a good idea to read and write on the same list when using multiprocessing to deal with different scopes of this list?
How to improve my code? (change Manager to something else? lesser shared state and how to achieve that? ...)
Thank you!
I am wondering if I create a dict through multiprocessing.Mananger(), will its value be locked automatically when a processing is manipulating it, or I should write lock.acquire/release explicitly?
Previously, I wrote lock.acquire/release inside the function explicitly, however, it seems my code suffer from the problem of dead lock. It is strange since I think there is only one lock in my code. Therefore I am wondering if manager.dict will give another lock automatically. When I delete the lock.acquire/release, the code works fine. But I am not sure if the dict is correct.
import multiprocessing as mp
from functools import partial
def function(file_name, d, lock):
key, value = read_files(file_name)
#lock.acquire()
if (key not in d):
d[key] = []
tmp = d[key]
tmp.append(value)
d[key] = tmp
#lock.release()
if __name__ == '__main__':
manager = mp.Manager()
d = manager.dict()
lock = manager.Lock()
partial_function = partial(function, d=d, lock=lock)
pool = mp.Pool(10)
pool.map(partial_function, files) #files is a predefined list
pool.close()
pool.join()
Some of the related questions are listed below, but they seems to contradict to each other.
Why a manager.dict need to be lock before writing inside?
How python manager.dict() locking works:
Yes, "locking'' occurs automatically with multiprocessing.Manager().dict(). It is thread-safe in the sense that the internals will only allow one process to access (read or write) the object at any given time.
I need to share a dict object between multiple processes. Each process, at its runtime, will read or modify existing items and/or add new items into the dict. I am using the manager object from multiprocessing similar to this code:
import multiprocessing as mp
def init_d(d, states):
for s in states:
if tuple(list(s) + [0]) not in d:
for i in range(4):
d[tuple(list(s) + [i])] = 0
if __name__ == '__main__':
with mp.Manager() as manager:
master_d = manager.dict()
states1 = [(1, 2), (3, 4)]
states2 = [(1, 2), (5, 6)]
p1 = mp.Process(target=init_d, args=(master_d, states1))
p2 = mp.Process(target=init_d, args=(master_d, states2))
p1.start()
p2.start()
p1.join()
p2.join()
print(master_d)
However, it is extremely slow for my particular usage. Aside form the fact that Manager is slower than shared memory (according to python doc), the much more important issue is that in my application, each process read/modify/add only one item of the dict at a time while has nothing to do with the rest. So, instead of locking the entire dict object, which causes the enormous slow-down, I was wondering if there is a way to for example only lock the specific item in __get_item__ and __set_item__ methods.
I have seen several related questions, but no answer. Please do not tell me about thread safety and locking stuff. Also, I am NOT looking for the generic case, rather:
I need an implementation that allows for reading an item of a dictionary by one process, while modifying another item by another process, also possibly add a new item by yet another process, all at the same time.
The dict can be extremely large, and I somehow rely on O(1) property of hash tables. Hence, I can not use list or tuple instead.
Each process only read/modify/add one item. Hence, a local dict and then updating the shared one at the end (after a batch of local updates) is not an option.
Any help is deeply appreciated.
I am trying to manipulate the lists inside the dictionary clean_txt in another function, but its not working and I end up with empty lists inside the dict.
My understading is that both lists and dicts are mutable objects so what is the problem here?
def process_questions(i, question_list, questions, question_list_name):
''' Transform questions and display progress '''
print('processing {}: process {}'.format(question_list_name, i))
for question in questions:
question_list.append(text_to_wordlist(str(question)))
#timeit
def multi(n_cores, tq, qln):
procs = []
clean_txt = {}
for i in range(n_cores):
clean_txt[i] = []
for index in range(n_cores):
tq_indexed = tq[index*len(tq)//n_cores:(index+1)*len(tq)//n_cores]
proc = Process(target=process_questions, args=(index, clean_txt[index], tq_indexed, qln, ))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
print('{} records processed from {}'.format(sum([len(x) for x in clean_txt.values()]), qln))
print('-'*100)
Your are using Processes not threads.
When the process is created the memory of your program is copied and each process work on its own set, therefore it is NOT shared.
Here's a question that can help you understand: Multiprocessing vs Threading Python
If you want to share memory between processes you should look into semaphores or use Threads instead. There are also other solutions to share data, like queues or database etc.
You are appending to clean_txt[index] from a different Process. clean_txt[index] belongs to the main python process who created it. Since a process can't access or modify another process's memory, you can't append to it. (Not really. See edit below)
You will need to create shared memory.
You can use Manager to create shared memory, something like this
from multiprocessing import Manager
manager = Manager()
...
clean_txt[i] = manager.list()
Now you can append to this list in the other process.
Edit -
My explanation about clean_txt wasn't clear. Thanks to #Maresh.
When a new Process is created the whole memory is copied. So modifying the list in the new process won't affect the copy in the main process. So you need a shared memory.
I am aware of multiprocessing.Manager() and how it can be used to create shared objects, in particular queues which can be shared between workers. There is this question, this question, this question and even one of my own questions.
However, I need to define a great many queues, each of which is linking a specific pair of processes. Say that each pair of processes and its linking queue is identified by the variable key.
I want to use a dictionary to access my queues when I need to put and get data. I cannot make this work. I've tried a number of things. With multiprocessing imported as mp:
Defining a dict like for key in all_keys: DICT[key] = mp.Queue in a config file which is imported by the multiprocessing module (call it multi.py) does not return errors, but the queue DICT[key] is not shared between the processes, each one seems to have their own copy of the queue and thus no communication happens.
If I try to define the DICT at the beginning of the main multiprocessing function that defines the processes and starts them, like
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Queue()
I get the error
RuntimeError: Queue objects should only be shared between processes through
inheritance
Changing to
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Manager().Queue()
only makes everything worse. Trying similar definitions at the head of multi.py rather than inside the main function returns similar errors.
There must be a way to share many queues between processes without explicitly naming each one in the code. Any ideas?
Edit
Here is a basic schema of the program:
1- load the first module, which defines some variables, imports multi, launches multi.main(), and loads another module which starts a cascade of module loads and code execution. Meanwhile...
2- multi.main looks like this:
def main():
manager = mp.Manager()
pool = mp.Pool()
DICT2 = manager.dict()
for key in all_keys:
DICT2[key] = manager.Queue()
proc_1 = pool.apply_async(targ1,(DICT1[key],) ) #DICT1 is defined in the config file
proc_2 = pool.apply_async(targ2,(DICT2[key], otherargs,)
Rather than use pool and manager, I was also launching processes with the following:
mp.Process(target=targ1, args=(DICT[key],))
3 - The function targ1 takes input data that is coming in (sorted by key) from the main process. It is meant to pass the result to DICT[key] so targ2 can do its work. This is the part that is not working. There are an arbitrary number of targ1s, targ2s, etc. and therefore an arbitrary number of queues.
4 - The results of some of these processes will be sent to a bunch of different arrays / pandas dataframes which are also indexed by key, and which I would like to be accessible from arbitrary processes, even ones launched in a different module. I have yet to write this part and it might be a different question. (I mention it here because the answer to 3 above might also solve 4 nicely.)
It sounds like your issues started when you tried to share a multiprocessing.Queue() by passing it as an argument. You can get around this by creating a managed queue instead:
import multiprocessing
manager = multiprocessing.Manager()
passable_queue = manager.Queue()
When you use a manager to create it, you are storing and passing around a proxy to the queue, rather than the queue itself, so even when the object you pass to your worker processes is a copied, it will still point at the same underlying data structure: your queue. It's very similar (in concept) to pointers in C/C++. If you create your queues this way, you will be able to pass them when you launch a worker process.
Since you can pass queues around now, you no longer need your dictionary to be managed. Keep a normal dictionary in main that will store all the mappings, and only give your worker processes the queues they need, so they won't need access to any mappings.
I've written an example of this here. It looks like you are passing objects between your workers, so that's what's done here. Imagine we have two stages of processing, and the data both starts and ends in the control of main. Look at how we can create the queues that connect the workers like a pipeline, but by giving them only they queues they need, there's no need for them to know about any mappings:
import multiprocessing as mp
def stage1(q_in, q_out):
q_out.put(q_in.get()+"Stage 1 did some work.\n")
return
def stage2(q_in, q_out):
q_out.put(q_in.get()+"Stage 2 did some work.\n")
return
def main():
pool = mp.Pool()
manager = mp.Manager()
# create managed queues
q_main_to_s1 = manager.Queue()
q_s1_to_s2 = manager.Queue()
q_s2_to_main = manager.Queue()
# launch workers, passing them the queues they need
results_s1 = pool.apply_async(stage1, (q_main_to_s1, q_s1_to_s2))
results_s2 = pool.apply_async(stage2, (q_s1_to_s2, q_s2_to_main))
# Send a message into the pipeline
q_main_to_s1.put("Main started the job.\n")
# Wait for work to complete
print(q_s2_to_main.get()+"Main finished the job.")
pool.close()
pool.join()
return
if __name__ == "__main__":
main()
The code produces this output:
Main started the job.
Stage 1 did some work.
Stage 2 did some work.
Main finished the job.
I didn't include an example of storing the queues or AsyncResults objects in dictionaries, because I still don't quite understand how your program is supposed to work. But now that you can pass your queues freely, you can build your dictionary to store the queue/process mappings as needed.
In fact, if you really do build a pipeline between multiple workers, you don't even need to keep a reference to the "inter-worker" queues in main. Create the queues, pass them to your workers, then only retain references to queues that main will use. I would definitely recommend trying to let old queues be garbage collected as quickly as possible if you really do have "an arbitrary number" of queues.