I am wondering if I create a dict through multiprocessing.Mananger(), will its value be locked automatically when a processing is manipulating it, or I should write lock.acquire/release explicitly?
Previously, I wrote lock.acquire/release inside the function explicitly, however, it seems my code suffer from the problem of dead lock. It is strange since I think there is only one lock in my code. Therefore I am wondering if manager.dict will give another lock automatically. When I delete the lock.acquire/release, the code works fine. But I am not sure if the dict is correct.
import multiprocessing as mp
from functools import partial
def function(file_name, d, lock):
key, value = read_files(file_name)
#lock.acquire()
if (key not in d):
d[key] = []
tmp = d[key]
tmp.append(value)
d[key] = tmp
#lock.release()
if __name__ == '__main__':
manager = mp.Manager()
d = manager.dict()
lock = manager.Lock()
partial_function = partial(function, d=d, lock=lock)
pool = mp.Pool(10)
pool.map(partial_function, files) #files is a predefined list
pool.close()
pool.join()
Some of the related questions are listed below, but they seems to contradict to each other.
Why a manager.dict need to be lock before writing inside?
How python manager.dict() locking works:
Yes, "locking'' occurs automatically with multiprocessing.Manager().dict(). It is thread-safe in the sense that the internals will only allow one process to access (read or write) the object at any given time.
Related
As discussed here: Python: Multiprocessing on Windows -> Shared Readonly Memory I have a heavily parallelized task.
Multiple workers do some stuff and in the end need to access some keys of a dictionary which contains several millions of key:value combinations. The keys which will be accessed, are only known within the worker after some further action also involving some file-processing (the example below is just for demonstration purposes, hence simplified in that way).
Before, my solution was to keep this big dictionary in memory, pass it once into shared memory and access it by the single workers. But it consumes a lot of RAM... So I wanted to use shelve (because the values of that dictionary are again dicts or lists).
So a simplified example of what I tried was:
def shelveWorker(tupArgs):
id, DB = tupArgs
return DB[id]
if __name__ == '__main__':
DB = shelve.open('file.db', flag='r', protocol=2)
joblist = []
for id in range(10000):
joblist.append((str(id), DB))
p = multiprocessing.Pool()
for returnValue in p.imap_unordered(shelveWorker, joblist):
# do something with returnValue
pass
p.close()
p.join()
Unfortunately I get:
"TypeError: can't pickle DB objects"
But IMHO it does not make any sense to open the shelve itself (DB = shelve.open('file.db', flag='r', protocol=2)) within each worker on its own because of slower runtime (I have several thousand workers).
How to go about it?
Thanks!
I'm using multiprocessing to create sub-process to my application.
I also share a dictionary between the process and the sub-process.
Example of my code:
Main process:
from multiprocessing import Process, Manager
manager = Manager()
shared_dict = manager.dict()
p = Process(target=mysubprocess, args=(shared_dict,))
p.start()
p.join()
print shared_dict
my sub-process:
def mysubprocess(shared_dict):
shared_dict['list_item'] = list()
shared_dict['list_item'].append('test')
print shared_dict
In both cases the printed value is :
{'list_item': []}
What could be the problem?
Thanks
Manager.dict will give you a dict where direct changes will be propagated between the processes, but it doesn't detect if you change objects contained in the dict (like the list stored under "list_item"). See the note at the bottom of the SyncManager documentation:
Note: Modifications to mutable values or items in dict and list proxies will not be propagated through the manager, because the proxy has no way of knowing when its values or items are modified. To modify such an item, you can re-assign the modified object to the container proxy.
So in your example the list gets synced when you set it in the dict, but the append doesn't trigger another sync.
You can get around that by re-assigning the key in the dict:
from multiprocessing import Process, Manager
def mysubprocess(shared_dict):
item = shared_dict['list_item'] = list()
item.append('test')
shared_dict['list_item'] = item
print 'subprocess:', shared_dict
manager = Manager()
shared_dict = manager.dict()
p = Process(target=mysubprocess, args=(shared_dict,))
p.start()
p.join()
print 'main process:', shared_dict
But that might get inefficient if the list is going to grow long - the whole list will be serialised and sent to the manager process for each append. A better way in that case would be to make the shared list directly with SyncManager.list (although you'll still have the same problem if the elements of the list are mutable - you need to reset them in the list to send them between the processes).
I am aware of multiprocessing.Manager() and how it can be used to create shared objects, in particular queues which can be shared between workers. There is this question, this question, this question and even one of my own questions.
However, I need to define a great many queues, each of which is linking a specific pair of processes. Say that each pair of processes and its linking queue is identified by the variable key.
I want to use a dictionary to access my queues when I need to put and get data. I cannot make this work. I've tried a number of things. With multiprocessing imported as mp:
Defining a dict like for key in all_keys: DICT[key] = mp.Queue in a config file which is imported by the multiprocessing module (call it multi.py) does not return errors, but the queue DICT[key] is not shared between the processes, each one seems to have their own copy of the queue and thus no communication happens.
If I try to define the DICT at the beginning of the main multiprocessing function that defines the processes and starts them, like
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Queue()
I get the error
RuntimeError: Queue objects should only be shared between processes through
inheritance
Changing to
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Manager().Queue()
only makes everything worse. Trying similar definitions at the head of multi.py rather than inside the main function returns similar errors.
There must be a way to share many queues between processes without explicitly naming each one in the code. Any ideas?
Edit
Here is a basic schema of the program:
1- load the first module, which defines some variables, imports multi, launches multi.main(), and loads another module which starts a cascade of module loads and code execution. Meanwhile...
2- multi.main looks like this:
def main():
manager = mp.Manager()
pool = mp.Pool()
DICT2 = manager.dict()
for key in all_keys:
DICT2[key] = manager.Queue()
proc_1 = pool.apply_async(targ1,(DICT1[key],) ) #DICT1 is defined in the config file
proc_2 = pool.apply_async(targ2,(DICT2[key], otherargs,)
Rather than use pool and manager, I was also launching processes with the following:
mp.Process(target=targ1, args=(DICT[key],))
3 - The function targ1 takes input data that is coming in (sorted by key) from the main process. It is meant to pass the result to DICT[key] so targ2 can do its work. This is the part that is not working. There are an arbitrary number of targ1s, targ2s, etc. and therefore an arbitrary number of queues.
4 - The results of some of these processes will be sent to a bunch of different arrays / pandas dataframes which are also indexed by key, and which I would like to be accessible from arbitrary processes, even ones launched in a different module. I have yet to write this part and it might be a different question. (I mention it here because the answer to 3 above might also solve 4 nicely.)
It sounds like your issues started when you tried to share a multiprocessing.Queue() by passing it as an argument. You can get around this by creating a managed queue instead:
import multiprocessing
manager = multiprocessing.Manager()
passable_queue = manager.Queue()
When you use a manager to create it, you are storing and passing around a proxy to the queue, rather than the queue itself, so even when the object you pass to your worker processes is a copied, it will still point at the same underlying data structure: your queue. It's very similar (in concept) to pointers in C/C++. If you create your queues this way, you will be able to pass them when you launch a worker process.
Since you can pass queues around now, you no longer need your dictionary to be managed. Keep a normal dictionary in main that will store all the mappings, and only give your worker processes the queues they need, so they won't need access to any mappings.
I've written an example of this here. It looks like you are passing objects between your workers, so that's what's done here. Imagine we have two stages of processing, and the data both starts and ends in the control of main. Look at how we can create the queues that connect the workers like a pipeline, but by giving them only they queues they need, there's no need for them to know about any mappings:
import multiprocessing as mp
def stage1(q_in, q_out):
q_out.put(q_in.get()+"Stage 1 did some work.\n")
return
def stage2(q_in, q_out):
q_out.put(q_in.get()+"Stage 2 did some work.\n")
return
def main():
pool = mp.Pool()
manager = mp.Manager()
# create managed queues
q_main_to_s1 = manager.Queue()
q_s1_to_s2 = manager.Queue()
q_s2_to_main = manager.Queue()
# launch workers, passing them the queues they need
results_s1 = pool.apply_async(stage1, (q_main_to_s1, q_s1_to_s2))
results_s2 = pool.apply_async(stage2, (q_s1_to_s2, q_s2_to_main))
# Send a message into the pipeline
q_main_to_s1.put("Main started the job.\n")
# Wait for work to complete
print(q_s2_to_main.get()+"Main finished the job.")
pool.close()
pool.join()
return
if __name__ == "__main__":
main()
The code produces this output:
Main started the job.
Stage 1 did some work.
Stage 2 did some work.
Main finished the job.
I didn't include an example of storing the queues or AsyncResults objects in dictionaries, because I still don't quite understand how your program is supposed to work. But now that you can pass your queues freely, you can build your dictionary to store the queue/process mappings as needed.
In fact, if you really do build a pipeline between multiple workers, you don't even need to keep a reference to the "inter-worker" queues in main. Create the queues, pass them to your workers, then only retain references to queues that main will use. I would definitely recommend trying to let old queues be garbage collected as quickly as possible if you really do have "an arbitrary number" of queues.
I have a system designed to take data via a socket and store that into a dictionary to serve as a database. Then all my other modules (GUI, analysis, write_to_log_file, etc) will access the database and do what they need to do with the dictionary e.g make widgets/copy the dictionary to a log file. But since all these things happen at a different rate, I chose to have each module on their own thread so I can control the frequency.
In the main run function there's something like this:
from threading import Thread
import data_collector
import write_to_log_file
def main():
db = {}
receive_data_thread = Thread(target=data_collector.main, arg=(db,))
recieve_data_thread.start() # writes to dictionary # 50 Hz
log_data_thread = Thread(target=write_to_log_file.main, arg(db,))
log_data_thread.start() # reads dictionary # 1 Hz
But it seems that both modules aren't working on the same dictionary instance because the log_data_thread just prints out the empty dictionary even when the data_collector shows the data it's inserted into the dictionary.
There's only one writer to the dictionary so I don't have to worry about threads stepping on each others toes, I just need to figure out a way for all the modules to read the current database as it's being written.
Rather than using a builtin dict, you could look at using a Manager object from the multiprocessing library:
from multiprocessing import Manager
from threading import Thread
from time import sleep
manager = Manager()
d = manager.dict()
def do_this(d):
d["this"] = "done"
def do_that(d):
d["that"] ="done"
thread0 = Thread(target=do_this,args=(d,))
thread1 = Thread(target=do_that,args=(d,))
thread0.start()
thread1.start()
thread0.join()
thread1.join()
print d
This gives you a standard-library thread-safe synchronised dictionary which should be easy to swap in to your current implementation without changing the design.
Use a Queue.Queue to pass values from the reader threads to a single writer thread. Pass the Queue instance to each data_collector.main function. They can all call the Queue's put method.
Meanwhile the write_to_log_file.main should also be passed the same Queue instance, and it can call the Queue's get method.
As items are pulled out of the Queue, they can be added to the dict.
See also: Alex Martelli, on why Queue.Queue is the secret sauce of CPython multithreading.
This should not be a problem. I also assume you are using the threading module. I would have to know more about what the data_collector and write_to_log_file are doing to figure out why they are not working.
You could technically even have more then 1 thread writing and it would not be a problem because the GIL would take care of all the locking needed. Granted you will never get more then one cpus worth of work out of it.
Here is a simple Example:
import threading, time
def addItem(d):
c = 0
while True:
d[c]="test-%d"%(c)
c+=1
time.sleep(1)
def checkItems(d):
clen = len(d)
while True:
if clen < len(d):
print "dict changed", d
clen = len(d)
time.sleep(.5)
DICT = {}
t1 = threading.Thread(target=addItem, args=(DICT,))
t1.daemon = True
t2 = threading.Thread(target=checkItems, args=(DICT,))
t2.daemon = True
t1.start()
t2.start()
while True:
time.sleep(1000)
Sorry, I figured out my problem, and I'm dumb. The modules were working on the same dictionary, but my logger wasn't wrapped around a while True so it just executed once and terminated the thread and thus my dictionary was only logged to disk once. So I made write_to_log_file.main(db) constantly write at 1Hz forever and set log_data_thread.deamon = True so that once the writer thread (which won't be a daemon thread) exits, it'll quit. Thanks for all the input about best practices on this type of system.
I'm trying to start 6 threads each taking an item from the list files, removing it, then printing the value.
from multiprocessing import Pool
files = ['a','b','c','d','e','f']
def convert(file):
process_file = files.pop()
print process_file
if __name__ == '__main__':
pool = Pool(processes=6)
pool.map(convert,range(6))
The expected output should be:
a
b
c
d
e
f
Instead, the output is:
f
f
f
f
f
f
What's going on? Thanks in advance.
Part of the issue is that you are not dealing with the multiprocess nature of pool, (note that in Python, MultiThreading does not gain performance due to Global Interpreter Lock).
Is there a reason why you need to alter the original list? You current code does not use the iterable passed in, and instead edits a shared mutable object, which is DANGEROUS in the world of concurrency. A simple solution is as follows:
from multiprocessing import Pool
files = ['a','b','c','d','e','f']
def convert(aFile):
print aFile
if __name__ == '__main__':
pool = Pool() #note the default will use the optimal number of workers
pool.map(convert,files)
Your question really got me thinking, so I did a little more exploration to understand why Python behaves in this way. It seems that Python is doing some interesting black magic and deepcopying (while maintain the id, which is non-standard) the object into the new process. This can be seen by altering the number or processes used:
from multiprocessing import Pool
files = ['d','e','f','a','b','c',]
a = sorted(files)
def convert(_):
print a == files
files.sort()
#print id(files) #note this is the same for every process, which is interesting
if __name__ == '__main__':
pool = Pool(processes=1) #
pool.map(convert,range(6))
==> all but the first invocation print 'True' as expected.
If you set the number or processes to 2, it is less deterministic, as it depends on which process actually executes their statement(s) first.
One solution is to use multiprocessing.dummy which uses threads instead of processes
simply changing your import to:
from multiprocessing.dummy import Pool
"solves" the problem, but doesn't protect the shared memory against concurrent accesses.
You should still use a threading.Lock or a Queue with put and get