share data using Manager() in python multiprocessing module - python

I tried to share data when using the multiprocessing module (python 2.7, Linux), I got different results when using slightly different code:
import os
import time
from multiprocessing import Process, Manager
def editDict(d):
d[1] = 10
d[2] = 20
d[3] = 30
pnum = 3
m = Manager()
1st version:
mlist = m.list()
for i in xrange(pnum):
mdict = m.dict()
mlist.append(mdict)
p = Process(target=editDict,args=(mdict,))
p.start()
time.sleep(2)
print 'after process finished', mlist
This generates:
after process finished [{1: 10, 2: 20, 3: 30}, {1: 10, 2: 20, 3: 30}, {1: 10, 2: 20, 3: 30}]
2nd version:
mlist = m.list([m.dict() for i in xrange(pnum)]) # main difference to 1st version
for i in xrange(pnum):
p = Process(target=editDict,args=(mlist[i],))
p.start()
time.sleep(2)
print 'after process finished', mlist
This generates:
after process finished [{}, {}, {}]
I do not understand why the outcome is so different.

It is because you access the variable by the list index the second time, while the first time you pass the actual variable. As stated in the multiprocessing docs:
Modifications to mutable values or items in dict and list proxies will not be propagated through the manager, because the proxy has no way of knowing when its values or items are modified.
This means that, to keep track of items that are changed within a container (dictionary or list), you must reassign them after each edit. Consider the following change (for explanatory purposes, I'm not claiming this to be clean code):
def editDict(d, l, i):
d[1] = 10
d[2] = 20
d[3] = 30
l[i] = d
mlist = m.list([m.dict() for i in xrange(pnum)])
for i in xrange(pnum):
p = Process(target=editDict,args=(mlist[i], mlist, i,))
p.start()
If you will now print mlist, you'll see that is has the same output as your first attempt. The reassignment will allow the container proxy to keep track of the updated item again.
Your main issue in this case is that you have a dict (proxy) inside a list proxy: updates to the contained container won't be noticed by the manager, and hence not have the changes you expected it to have. Note that the dictionary itself will be updated in the second example, but you just don't see it since the manager didn't sync.

Related

multiprocessing, pass a dictionary & have a dataframe returned

I'm new to multiprocessing in python. I have a task that takes approx 10 minutes to run & it needs to be run multiple times (different parameters) & seems like multiprocessing is a good option to reduce the total run time.
My code is a simple test which is not running as I expect, obviously I'm doing something wrong. Nothing gets printed to the console, a list processes is returned but not with dataframes but a Process object.
I need to pass a dictionary to my function which in return will return a dataframe, how do I do this?
import time
import pandas as pd
import multiprocessing as mp
def multiproc():
processes = []
settings = {1: {'sleep_time': 5,
'id': 1},
2: {'sleep_time': 1,
'id': 2},
3: {'sleep_time': 2,
'id': 3},
4: {'sleep_time': 3,
'id': 4}}
for key in settings:
p = mp.Process(target=calc_something, args=(settings[key],))
processes.append(p)
p.start()
for p in processes:
p.join()
return processes
def calc_something(settings: dict) -> pd.DataFrame:
time_to_sleep = settings['sleep_time']
time.sleep(time_to_sleep)
print(str(settings['id']))
df = some_function_creates_data_frame()
return df
Despite your indentation errors, I will risk taking a guess on your intentions.
Using a process pool is indicated when either you are submitting multiple tasks to be processed and you either want to limit the number of processors used to process these tasks or you need to return values back from the tasks (there are other ways to return a value back from a process, such as using a queue, but a using a process pool makes this easy).
import time
import pandas as pd
import multiprocessing as mp
def calc_something(settings: dict) -> pd.DataFrame:
time_to_sleep = settings['sleep_time']
time.sleep(time_to_sleep)
print(str(settings['id']))
df = pd.DataFrame({'sleep_time': [time_to_sleep], 'id': [settings['id']]})
return df
def multiproc():
settings = {1: {'sleep_time': 5,
'id': 1},
2: {'sleep_time': 1,
'id': 2},
3: {'sleep_time': 2,
'id': 3},
4: {'sleep_time': 3,
'id': 4}}
with mp.Pool() as pool:
data_frames = pool.map(calc_something, settings.values())
return data_frames
if __name__ == '__main__': # required for Windows
data_frames = multiproc()
for data_frame in data_frames:
print(data_frame)
Prints:
2
3
4
1
sleep_time id
0 5 1
sleep_time id
0 1 2
sleep_time id
0 2 3
sleep_time id
0 3 4
Important Note
When creating processes under Windows or any platform that does not use fork to create new processes, the code that creates these processes must be invoked within a if __name__ == '__main__': block or else you will get into a recursive loop spawning new processes. This may have been part of your problem, but it is hard to tell since in addition to your indentation problem, you did not post a minimum, reproducible example.

Python, add key:value to dictionary in parallelised loop

I have written some code to perform some calculations in parallel (joblib) and update a dictionary with the calculation results. The code consists of a main function which calls a generator function and calculation function to be run in parallel. The calculation result (a key:value pair) are added by each instance of the calculation function to a dictionary created in the main function and market as global.
Below is a simplified version of my code, illustrating the procedure described above.
When everything runs, the result dictionary (d_result) is empty, but it should have been populated with the results generated by the calculation function. Why is it so?
import numpy as np
from joblib import Parallel, delayed
def do_calc(d, r, pair_index): # function to be run in parallel
data_1 = d[str(r)][pair_index, 1]
data_2 = d[str(r)][pair_index, 2]
result_name = str(data_1) + " ^ " + str(data_2)
result = data_1 ** data_2
d_result[result_name] = result
# d_result.setdefault(result_name, []).append(result) ## same result as above
def compute_indices(d): # generator function
for r in d:
num_pairs = d[str(r)].shape[0]
for pair_index in range(num_pairs):
yield r, pair_index
def process(): # main function
global d_result
d_result = {}
r1 = np.array([['ab', 1, 2], ['vw', 10, 12]], dtype=object)
r2 = np.array([['ac', 1, 3], ['vx', 10, 13]], dtype=object)
r3 = np.array([['ad', 1, 4], ['vy', 10, 14]], dtype=object)
r4 = np.array([['ae', 1, 5], ['vz', 10, 15]], dtype=object)
d = {'r1': r1, 'r2': r2, 'r3': r3, 'r4': r4}
Parallel(n_jobs=4)(delayed(do_calc)(d, r, pair_index) for r, pair_index in (compute_indices)(d))
print(d_result)
process()
I am glad you got your program to work. However I think you have overlooked something important, and you might run into trouble if you use your example as a basis for larger programs.
I scanned the docs for joblib, and discovered that it's built on the Python multiprocessing module. So the multiprocessing programming guidelines apply.
At first I could not figure out why your new program ran successfully and the original one did not. Here is the reason (from the link above): "Bear in mind that if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called." This is because each child process has, at least conceptually, its own copy of the Python interpreter. In each child process, the code that is used by that process must be imported. If that code declares globals, the two processes will have separate copies of those globals, even though it doesn't look that way when you read the code. So when your original program's child process put data into the global d_result, it was actually a different object from d_result in the parent process. From the docs again: "Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
For example, under Windows running the following module would fail with a RuntimeError:
from multiprocessing import Process
def foo():
print 'hello'
p = Process(target=foo)
p.start()
Instead one should protect the entry point of the program by using if __name__ == '__main__'."
So it is important to add one line of code to your program (the second version), right before the last line:
if __name__ == "__main__":
process()
Failure to do this can result in some nasty bugs that you don't want to spend time with.
OK, I've figured it out. Answer and new code below:
The do_calc() function now generates an empty dict, then populates it with a single key:value pair and returns the dict.
The parallel bit in process() by default creates a list of that which is returned from do_calc(). So what I end up with after the parallelised do_calc() is a list of dicts.
What I really want is a single dict, so using dict comprehension I convert the list of dicts to dict, and wala, she's all good!
This helped: python convert list of single key dictionaries into a single dictionary
import numpy as np
from joblib import Parallel, delayed
def do_calc(d, r, pair_index): # calculation function to be run in parallel
data_1 = d[str(r)][pair_index, 1]
data_2 = d[str(r)][pair_index, 2]
result_name = str(data_1) + " ^ " + str(data_2)
result = data_1 ** data_2
d_result = {} # create empty dict
d_result[result_name] = result #add key:value pair to dict
return d_result # return dict
def compute_indices(d): # generator function
for r in d:
num_pairs = d[str(r)].shape[0]
for pair_index in range(num_pairs):
yield r, pair_index
def process(): # main function
r1 = np.array([['ab', 1, 2], ['vw', 10, 12]], dtype=object)
r2 = np.array([['ac', 1, 3], ['vx', 10, 13]], dtype=object)
r3 = np.array([['ad', 1, 4], ['vy', 10, 14]], dtype=object)
r4 = np.array([['ae', 1, 5], ['vz', 10, 15]], dtype=object)
d = {'r1': r1, 'r2': r2, 'r3': r3, 'r4': r4}
# parallelised calc. Each run returns dict, final output is list of dicts
d_result = Parallel(n_jobs=4)(delayed(do_calc)(d, r, pair_index) for r, pair_index in (compute_indices)(d))
# transform list of dicts to dict
d_result = {k: v for x in d_result for k, v in x.items()}
print(d_result)
process()

Python sharing a list between processes using a manager and a dictionary

The Manager().dict() seems to work fine when the value is just a value but not when the value is a list. Is there a way to share a list between processes using a Manager()?
from multiprocessing import Process, Manager
def add(c):
for i in range(1000):
c['a'][0] += 1
c['a'][1] += 2
d = Manager().dict()
d['a'] = [0,0]
p = Process(target=add, args=(d,))
p.start()
p.join()
print(d)
Output:
{1: [0, 0]}
Solved thanks #Dale Song
It seems you must do it somewhat indirectly... assign values to a temporary list and then assign the temporary list to the dictionary, weird.
from multiprocessing import Process, Array, Manager
def test(dict,m,val):
for i in range(1000):
m[0] +=1
m[1] +=2
dict[val] = m
if __name__ == '__main__':
d = Manager().dict()
l = [0,0]
d['a'] = l
p = Process(target=test, args=(d,l,'a'))
p.start()
p.join()
print(d)
Output:
{'a': [1000, 2000]}
Another (much faster) way:
Using shared memory and multiprocessing.Array()
from multiprocessing import Process, Manager,Array
d = {}
d['a'] = Array('i', [0,0])
def add_array(c,val):
for i in range(1000):
c[val][0] +=1
c[val][1] +=2
p = Process(target=add_array, args=(d,'a', ))
p.start()
p.join()
print([d['a'][0],d['a'][1]])
Output:
[1000, 2000]
https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Array
https://docs.python.org/2/library/array.html #typecodes

How to convert a binary tree in to a dictionary of levels

In python and any other language it is quite easy to to traverse (in level order so BFS) a binary tree using a queue data structure. Given an adjecency list representation in python and the root of a tree I can traverse the tree in level order and print level elements in order. Nonetheless what I cannot do is go from an adjecency list representation to a level_dictionary or something of the likes:
so for example I would like to go from
adjecency_list = {'A': {'B','C'}, 'C':{'D'}, 'B': {'E'}}
to
levels = {0: ['A'], 1: ['B','C'], 2: ['D','E']}
So far I have the following:
q = Queue()
o = OrderedDict()
root = find_root(adjencency_list) # Seperate function it works fine
height = find_height(root, adjencency_list) # Again works fine
q.put(root)
# Creating a level ordered adjecency list
# using a queue to keep track of pointers
while(not q.empty()):
current = q.get()
try:
if(current in adjencency_list):
q.put(list(adjencency_list[current])[0])
# Creating ad_list in level order
if current in o:
o[current].append(list(adjencency_list[current])[0])
else:
o[current] = [list(adjencency_list[current])[0]]
if(current in adjencency_list):
q.put(list(adjencency_list[current])[1])
# Creating ad_list in level order
if current in o:
o[current].append(list(adjencency_list[current])[1])
else:
o[current] = [list(adjencency_list[current])[1]]
except IndexError:
pass
All it does is place the adjecency list in the the correct level orders for the tree and if I printed a the start of the loop it would print in level order traversal. Nonetheless it does not solve my problem. I am aware adjecency list is not the best representation for a tree but I require using it for the task I am doing.
A recursive way to create the level dictionary from your adjacency list would be -
def level_dict(adj_list,curr_elems,order=0):
if not curr_elems: # This check ensures that for empty `curr_elems` list we return empty dictionary
return {}
d = {}
new_elems = []
for elem in curr_elems:
d.setdefault(order,[]).append(elem)
new_elems.extend(adj_list.get(elem,[]))
d.update(level_dict(adj_list,new_elems,order+1))
return d
The starting input to the method would be the root element in a list, example - ['A'] , and the initial level, which would be 0.
In each level, it takes the chlidren of the elements at that level and creates a new list, and at the same time, creates the level dictionary (in d) .
Example/Demo -
>>> adjecency_list = {'A': {'B','C'}, 'C':{'D'}, 'B': {'E'}}
>>> def level_dict(adj_list,curr_elems,order=0):
... if not curr_elems:
... return {}
... d = {}
... new_elems = []
... for elem in curr_elems:
... d.setdefault(order,[]).append(elem)
... new_elems.extend(adj_list.get(elem,[]))
... d.update(level_dict(adj_list,new_elems,order+1))
... return d
...
>>> level_dict(adjecency_list,['A'])
{0: ['A'], 1: ['C', 'B'], 2: ['D', 'E']}

Difference between dict.clear() and assigning {} in Python

In python, is there a difference between calling clear() and assigning {} to a dictionary? If yes, what is it?
Example:d = {"stuff":"things"}
d.clear() #this way
d = {} #vs this way
If you have another variable also referring to the same dictionary, there is a big difference:
>>> d = {"stuff": "things"}
>>> d2 = d
>>> d = {}
>>> d2
{'stuff': 'things'}
>>> d = {"stuff": "things"}
>>> d2 = d
>>> d.clear()
>>> d2
{}
This is because assigning d = {} creates a new, empty dictionary and assigns it to the d variable. This leaves d2 pointing at the old dictionary with items still in it. However, d.clear() clears the same dictionary that d and d2 both point at.
d = {} will create a new instance for d but all other references will still point to the old contents.
d.clear() will reset the contents, but all references to the same instance will still be correct.
In addition to the differences mentioned in other answers, there also is a speed difference. d = {} is over twice as fast:
python -m timeit -s "d = {}" "for i in xrange(500000): d.clear()"
10 loops, best of 3: 127 msec per loop
python -m timeit -s "d = {}" "for i in xrange(500000): d = {}"
10 loops, best of 3: 53.6 msec per loop
As an illustration for the things already mentioned before:
>>> a = {1:2}
>>> id(a)
3073677212L
>>> a.clear()
>>> id(a)
3073677212L
>>> a = {}
>>> id(a)
3073675716L
In addition to #odano 's answer, it seems using d.clear() is faster if you would like to clear the dict for many times.
import timeit
p1 = '''
d = {}
for i in xrange(1000):
d[i] = i * i
for j in xrange(100):
d = {}
for i in xrange(1000):
d[i] = i * i
'''
p2 = '''
d = {}
for i in xrange(1000):
d[i] = i * i
for j in xrange(100):
d.clear()
for i in xrange(1000):
d[i] = i * i
'''
print timeit.timeit(p1, number=1000)
print timeit.timeit(p2, number=1000)
The result is:
20.0367929935
19.6444659233
Mutating methods are always useful if the original object is not in scope:
def fun(d):
d.clear()
d["b"] = 2
d={"a": 2}
fun(d)
d # {'b': 2}
Re-assigning the dictionary would create a new object and wouldn't modify the original one.
One thing not mentioned is scoping issues. Not a great example, but here's the case where I ran into the problem:
def conf_decorator(dec):
"""Enables behavior like this:
#threaded
def f(): ...
or
#threaded(thread=KThread)
def f(): ...
(assuming threaded is wrapped with this function.)
Sends any accumulated kwargs to threaded.
"""
c_kwargs = {}
#wraps(dec)
def wrapped(f=None, **kwargs):
if f:
r = dec(f, **c_kwargs)
c_kwargs = {}
return r
else:
c_kwargs.update(kwargs) #<- UnboundLocalError: local variable 'c_kwargs' referenced before assignment
return wrapped
return wrapped
The solution is to replace c_kwargs = {} with c_kwargs.clear()
If someone thinks up a more practical example, feel free to edit this post.
In addition, sometimes the dict instance might be a subclass of dict (defaultdict for example). In that case, using clear is preferred, as we don't have to remember the exact type of the dict, and also avoid duplicate code (coupling the clearing line with the initialization line).
x = defaultdict(list)
x[1].append(2)
...
x.clear() # instead of the longer x = defaultdict(list)

Categories