Python, add key:value to dictionary in parallelised loop - python

I have written some code to perform some calculations in parallel (joblib) and update a dictionary with the calculation results. The code consists of a main function which calls a generator function and calculation function to be run in parallel. The calculation result (a key:value pair) are added by each instance of the calculation function to a dictionary created in the main function and market as global.
Below is a simplified version of my code, illustrating the procedure described above.
When everything runs, the result dictionary (d_result) is empty, but it should have been populated with the results generated by the calculation function. Why is it so?
import numpy as np
from joblib import Parallel, delayed
def do_calc(d, r, pair_index): # function to be run in parallel
data_1 = d[str(r)][pair_index, 1]
data_2 = d[str(r)][pair_index, 2]
result_name = str(data_1) + " ^ " + str(data_2)
result = data_1 ** data_2
d_result[result_name] = result
# d_result.setdefault(result_name, []).append(result) ## same result as above
def compute_indices(d): # generator function
for r in d:
num_pairs = d[str(r)].shape[0]
for pair_index in range(num_pairs):
yield r, pair_index
def process(): # main function
global d_result
d_result = {}
r1 = np.array([['ab', 1, 2], ['vw', 10, 12]], dtype=object)
r2 = np.array([['ac', 1, 3], ['vx', 10, 13]], dtype=object)
r3 = np.array([['ad', 1, 4], ['vy', 10, 14]], dtype=object)
r4 = np.array([['ae', 1, 5], ['vz', 10, 15]], dtype=object)
d = {'r1': r1, 'r2': r2, 'r3': r3, 'r4': r4}
Parallel(n_jobs=4)(delayed(do_calc)(d, r, pair_index) for r, pair_index in (compute_indices)(d))
print(d_result)
process()

I am glad you got your program to work. However I think you have overlooked something important, and you might run into trouble if you use your example as a basis for larger programs.
I scanned the docs for joblib, and discovered that it's built on the Python multiprocessing module. So the multiprocessing programming guidelines apply.
At first I could not figure out why your new program ran successfully and the original one did not. Here is the reason (from the link above): "Bear in mind that if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called." This is because each child process has, at least conceptually, its own copy of the Python interpreter. In each child process, the code that is used by that process must be imported. If that code declares globals, the two processes will have separate copies of those globals, even though it doesn't look that way when you read the code. So when your original program's child process put data into the global d_result, it was actually a different object from d_result in the parent process. From the docs again: "Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
For example, under Windows running the following module would fail with a RuntimeError:
from multiprocessing import Process
def foo():
print 'hello'
p = Process(target=foo)
p.start()
Instead one should protect the entry point of the program by using if __name__ == '__main__'."
So it is important to add one line of code to your program (the second version), right before the last line:
if __name__ == "__main__":
process()
Failure to do this can result in some nasty bugs that you don't want to spend time with.

OK, I've figured it out. Answer and new code below:
The do_calc() function now generates an empty dict, then populates it with a single key:value pair and returns the dict.
The parallel bit in process() by default creates a list of that which is returned from do_calc(). So what I end up with after the parallelised do_calc() is a list of dicts.
What I really want is a single dict, so using dict comprehension I convert the list of dicts to dict, and wala, she's all good!
This helped: python convert list of single key dictionaries into a single dictionary
import numpy as np
from joblib import Parallel, delayed
def do_calc(d, r, pair_index): # calculation function to be run in parallel
data_1 = d[str(r)][pair_index, 1]
data_2 = d[str(r)][pair_index, 2]
result_name = str(data_1) + " ^ " + str(data_2)
result = data_1 ** data_2
d_result = {} # create empty dict
d_result[result_name] = result #add key:value pair to dict
return d_result # return dict
def compute_indices(d): # generator function
for r in d:
num_pairs = d[str(r)].shape[0]
for pair_index in range(num_pairs):
yield r, pair_index
def process(): # main function
r1 = np.array([['ab', 1, 2], ['vw', 10, 12]], dtype=object)
r2 = np.array([['ac', 1, 3], ['vx', 10, 13]], dtype=object)
r3 = np.array([['ad', 1, 4], ['vy', 10, 14]], dtype=object)
r4 = np.array([['ae', 1, 5], ['vz', 10, 15]], dtype=object)
d = {'r1': r1, 'r2': r2, 'r3': r3, 'r4': r4}
# parallelised calc. Each run returns dict, final output is list of dicts
d_result = Parallel(n_jobs=4)(delayed(do_calc)(d, r, pair_index) for r, pair_index in (compute_indices)(d))
# transform list of dicts to dict
d_result = {k: v for x in d_result for k, v in x.items()}
print(d_result)
process()

Related

python get pointer to an item in a nested dictionary/list combination based on a list of keys

I have a data structure that looks something like this:
someData = {"apple":{"taste":"not bad","colors":["red","yellow"]},
"banana":{"taste":"perfection","shape":"banana shaped"},
"some list":[6,5,3,2,4,6,7]}
and a list of keys which describes a path to some item in this structure
someList = ["apple","colors",2]
I already have a function getPath(path) (see below) that is supposed to return a pointer to the selected object. It works fine for reading, but I get into trouble when trying to write
print(getPath(someList))
>> yellow
getPath(someList) = "green"
>> SyntaxError: can't assign to function call
a = getPath(someList)
a = "green"
print(getPath(someList))
>> "yellow"
Is there a way to make this work? Maybe like this:
someFunc(someList, "green")
print(getPath(someList))
>> green
This question looks like this question, except that I want to write something to that item, and not just read it.
My actual data can be seen here (I used json.loads() to parse the data). Note that I plan on adding stuff to this structure. I want a general approach to future proof the project.
My code:
def getPath(path):
nowSelection = jsonData
for i in path:
nowSelection = nowSelection[i]
return nowSelection
The result you're getting from getPath() is the immutable value from a dict or list. This value does not even know it's stored in a dict or list, and there's nothing you can do to change it. You have to change the dict/list itself.
Example:
a = {'hello': [0, 1, 2], 'world': 2}
b = a['hello'][1]
b = 99 # a is completely unaffected by this
Compare with:
a = {'hello': [0, 1, 2], 'world': 2}
b = a['hello'] # b is a list, which you can change
b[1] = 99 # now a is {'hello': [0, 99, 2], 'world': 2}
In your case, instead of following the path all the way to the value you want, go all the way except the last step, and then modify the dict/list you get from the penultimate step:
getPath(["apple","colors",2]) = "green" # doesn't work
getPath(["apple","colors"])[2] = "green" # should work
You could cache your getPath using custom caching function that allows you to manually populate saved cache.
from functools import wraps
def cached(func):
func.cache = {}
#wraps(func)
def wrapper(*args):
try:
return func.cache[args]
except KeyError:
func.cache[args] = result = func(*args)
return result
return wrapper
#cached
def getPath(l):
...
getPath.cache[(someList, )] = 'green'
getPath(someList) # -> 'green'
You can't literally do what you're trying to do. I think the closest you could get is to pass the new value in, then manually reassign it within the function:
someData = {"apple":{"taste":"not bad","colors":["red","yellow"]}, "banana":{"taste":"perfection","shape":"banana shaped"}, "some list":[6,5,3,2,4,6,7]}
def setPath(path, newElement):
nowSelection = someData
for i in path[:-1]: # Remove the last element of the path
nowSelection = nowSelection[i]
nowSelection[path[-1]] = newElement # Then use the last element here to do a reassignment
someList = ["apple","colors",1]
setPath(someList, "green")
print(someData)
{'apple': {'taste': 'not bad', 'colors': ['red', 'green']}, 'banana': {'taste': 'perfection', 'shape': 'banana shaped'}, 'some list': [6, 5, 3, 2, 4, 6, 7]}
I renamed it to setPath to reflect its purpose better.

Printing a Parellel Function Outputs in True Order w/Python

Looking to print everything in order, for a Python parallelized script. Note the c3 is printed prior to the b2 -- out of order. Any way to make the below function with a wait feature? If you rerun, sometimes the print order is correct for shorter batches. However, looking for a reproducible solution to this issue.
from joblib import Parallel, delayed, parallel_backend
import multiprocessing
testFrame = [['a',1], ['b', 2], ['c', 3]]
def testPrint(letr, numbr):
print(letr + str(numbr))
return letr + str(numbr)
with parallel_backend('multiprocessing'):
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs = num_cores)(delayed(testPrint)(letr = testFrame[i][0],
numbr = testFrame[i][1]) for i in range(len(testFrame)))
print('##########')
for test in results:
print(test)
Output:
b2
c3
a1
##########
a1
b2
c3
Seeking:
a1
b2
c3
##########
a1
b2
c3
Once you launch tasks in separate processes you no longer control the order of execution so you cannot expect the actions of those tasks to execute in any predictable order - especially if the tasks can take varying lengths of time.
If you are parallelizing(?) a task/function with a sequence of arguments and you want to reorder the results to match the order of the original sequence you can pass sequence information to the task/function that will be returned by the task and can be used to reconstruct the original order.
If the original function looks like this:
def f(arg):
l,n = arg
#do stuff
time.sleep(random.uniform(.1,10.))
result = f'{l}{n}'
return result
Refactor the function to accept the sequence information and pass it through with the return value.
def f(arg):
indx, (l,n) = arg
time.sleep(random.uniform(.1,10.))
result = (indx,f'{l}{n}')
return result
enumerate could be used to add the sequence information to the sequence of data:
originaldata = list(zip('abcdefghijklmnopqrstuvwxyz', range(26)))
dataplus = enumerate(originaldata)
Now the arguments have the form (index,originalarg) ... (0, ('a',0'), (1, ('b',1)).
And the returned values from the multi-processes look like this (if collected in a list) -
[(14, 'o14'), (23, 'x23'), (1, 'b1'), (4, 'e4'), (13, 'n13'),...]
Which is easily sorted on the first item of each result, key=lambda item: item[0], and the values you really want obtained by picking out the second items after sorting results = [item[1] for item in results].

my str(float) gets broken on histogram dictionary conversion, how do I stop this?

When attempting to histogram a list of numbers(in str formats) all of my numbers get broken up
for instance
a = ['1','1.5','2.5']
after running my histogram function
my dictionary looks like
{'1': 2, '2': 1, '5': 2, '.': 2}
my histogram function is
def histogram(a):
d = dict()
for c in a:
d[c] = d.get(c,0)+1
return d
I'm doing a project for school and have everything coded in, but when I get to doing the mode portion and I use numbers that aren't specifically int I get the above returns
How can I adjust/change this so it accepts the strings exactly as typed
Python 2.7 on Windows 7x64
You can convert each string element to a float before passing it your histogram function.
a = ['1','1.5','2.5']
a = [float(i) for i in a]
def histogram(a):
d = dict()
for c in a:
d[c] = d.get(c,0)+1
return d
print histogram(a)
There might be an error in your list definition. Running your code I get
{'1': 1, '1.5': 1, '2.5': 1}
If I change the definition of a from
a = ['1','1.5','2.5']
to
a = '1' '1.5' '2.5'
I get the output you showed us.
So please double check how your list is defined.
You can use something like this:
>>> a = ['1','1.5','2.5']
>>> dict.fromkeys(a, 0)
{'1': 0, '1.5': 0, '2.5': 0}
Now you can iterate over keys to set the corresponding value.
I have used the following dict comprehension to reduce my work.
>>> {key: float(key)+1 for key in a}
{'1': 2.0, '1.5': 2.5, '2.5': 3.5}
enjoy :)
The histgram function does work as it's written. If however you you inadvertently .join() your list your histogram with then histogram the resulting object. For instance... t = ['1.0','2.0','2.5'] and
s = s.join(t) s will then be == '1.02.02.5' and histogram(s) will count the decimals as values in the slice. My problem was that I had placed a .join() prior to calling histogram.
My appologies to anyone that wasted any real time on this.

Complex matlab-like data structure in python (numpy/scipy)

I have data currently structured as following in Matlab
item{i}.attribute1(2,j)
Where item is a cell from i = 1 .. n each containing the data structure of multiple attributes each a matrix of size 2,j where j = 1 .. m. The number of attributes is not fixed.
I have to translate this data structure to python, but I am new to numpy and python lists. What is the best way of structuring this data in python with numpy/scipy?
Thanks.
I've often seen the following conversion approaches:
matlab array -> python numpy array
matlab cell array -> python list
matlab structure -> python dict
So in your case that would correspond to a python list containing dicts, which themselves contain numpy arrays as entries
item[i]['attribute1'][2,j]
Note
Don't forget the 0-indexing in python!
[Update]
Additional: Use of classes
Further to the simple conversion given above, you could also define a dummy class, e.g.
class structtype():
pass
This allows the following type of usage:
>> s1 = structtype()
>> print s1.a
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-40-7734865fddd4> in <module>()
----> 1 print s1.a
AttributeError: structtype instance has no attribute 'a'
>> s1.a=10
>> print s1.a
10
Your example in this case becomes, e.g.
>> item = [ structtype() for i in range(10)]
>> item[9].a = numpy.array([1,2,3])
>> item[9].a[1]
2
A simple version of the answer by #dbouz , using the idea by #jmetz
class structtype():
def __init__(self,**kwargs):
self.Set(**kwargs)
def Set(self,**kwargs):
self.__dict__.update(kwargs)
def SetAttr(self,lab,val):
self.__dict__[lab] = val
then you can do
myst = structtype(a=1,b=2,c=3)
or
myst = structtype()
myst.Set(a=1,b=2,c=3)
and still do
myst.d = 4 # here, myst.a=1, myst.b=2, myst.c=3, myst.d=4
or even
myst = structtype(a=1,b=2,c=3)
lab = 'a'
myst.SetAttr(lab,10) # a=10,b=2,c=3 ... equivalent to myst.(lab)=10 in MATLAB
and you get exactly what you'd expect in matlab for myst=struct('a',1,'b',2,'c',3).
The equivalent of a cell of structs would be a list of structtype
mystarr = [ structtype(a=1,b=2) for n in range(10) ]
which would give you
mystarr[0].a # == 1
mystarr[0].b # == 2
If you are looking for a good example how to create a structured array in Python like it is done in MATLAB, you might want to have a look at the scipy homepage (basics.rec).
Example
x = np.zeros(1, dtype = [('Table', float64, (2, 2)),
('Number', float),
('String', '|S10')])
# Populate the array
x['Table'] = [1, 2]
x['Number'] = 23.5
x['String'] = 'Stringli'
# See what is written to the array
print(x)
The printed output is then:
[([[1.0, 2.0], [1.0, 2.0]], 23.5, 'Stringli')]
Unfortunately, I did not find out how you can define a structured array without knowing the size of the structured array. You can also define the array directly with its contents.
x = np.array(([[1, 2], [1, 2]], 23.5, 'Stringli'),
dtype = [('Table', float64, (2, 2)),
('Number', float),
('String', '|S10')])
# Same result as above but less code (if you know the contents in advance)
print(x)
For some applications a dict or list of dictionaries will suffice. However, if you really want to emulate a MATLAB struct in Python, you have to take advantage of its OOP and form your own struct-like class.
This is a simple example for instance that allows you to store an arbitrary amount of variables as attributes and can be also initialized as empty (Python 3.x only). i is the indexer that shows how many attributes are stored inside the object:
class Struct:
def __init__(self, *args, prefix='arg'): # constructor
self.prefix = prefix
if len(args) == 0:
self.i = 0
else:
i=0
for arg in args:
i+=1
arg_str = prefix + str(i)
# store arguments as attributes
setattr(self, arg_str, arg) #self.arg1 = <value>
self.i = i
def add(self, arg):
self.i += 1
arg_str = self.prefix + str(self.i)
setattr(self, arg_str, arg)
You can initialise it empty (i=0), or populate it with initial attributes. You can then add attributes at will. Trying the following:
b = Struct(5, -99.99, [1,5,15,20], 'sample', {'key1':5, 'key2':-100})
b.add(150.0001)
print(b.__dict__)
print(type(b.arg3))
print(b.arg3[0:2])
print(b.arg5['key1'])
c = Struct(prefix='foo')
print(c.i) # empty Struct
c.add(500) # add a value as foo1
print(c.__dict__)
will get you these results for object b:
{'prefix': 'arg', 'arg1': 5, 'arg2': -99.99, 'arg3': [1, 5, 15, 20], 'arg4': 'sample', 'arg5': {'key1': 5, 'key2': -100}, 'i': 6, 'arg6': 150.0001}
<class 'list'>
[1, 5]
5
and for object c:
0
{'prefix': 'foo', 'i': 1, 'foo1': 500}
Note that assigning attributes to objects is general - not only limited to scipy/numpy objects but applicable to all data types and custom objects (arrays, dataframes etc.). Of course that's a toy model - you can further develop it to make it able to be indexed, able to be pretty-printed, able to have elements removed, callable etc., based on your project needs. Just define the class at the beginning and then use it for storage-retrieval. That's the beauty of Python - it doesn't really have exactly what you seek especially if you come from MATLAB, but it can do so much more!

share data using Manager() in python multiprocessing module

I tried to share data when using the multiprocessing module (python 2.7, Linux), I got different results when using slightly different code:
import os
import time
from multiprocessing import Process, Manager
def editDict(d):
d[1] = 10
d[2] = 20
d[3] = 30
pnum = 3
m = Manager()
1st version:
mlist = m.list()
for i in xrange(pnum):
mdict = m.dict()
mlist.append(mdict)
p = Process(target=editDict,args=(mdict,))
p.start()
time.sleep(2)
print 'after process finished', mlist
This generates:
after process finished [{1: 10, 2: 20, 3: 30}, {1: 10, 2: 20, 3: 30}, {1: 10, 2: 20, 3: 30}]
2nd version:
mlist = m.list([m.dict() for i in xrange(pnum)]) # main difference to 1st version
for i in xrange(pnum):
p = Process(target=editDict,args=(mlist[i],))
p.start()
time.sleep(2)
print 'after process finished', mlist
This generates:
after process finished [{}, {}, {}]
I do not understand why the outcome is so different.
It is because you access the variable by the list index the second time, while the first time you pass the actual variable. As stated in the multiprocessing docs:
Modifications to mutable values or items in dict and list proxies will not be propagated through the manager, because the proxy has no way of knowing when its values or items are modified.
This means that, to keep track of items that are changed within a container (dictionary or list), you must reassign them after each edit. Consider the following change (for explanatory purposes, I'm not claiming this to be clean code):
def editDict(d, l, i):
d[1] = 10
d[2] = 20
d[3] = 30
l[i] = d
mlist = m.list([m.dict() for i in xrange(pnum)])
for i in xrange(pnum):
p = Process(target=editDict,args=(mlist[i], mlist, i,))
p.start()
If you will now print mlist, you'll see that is has the same output as your first attempt. The reassignment will allow the container proxy to keep track of the updated item again.
Your main issue in this case is that you have a dict (proxy) inside a list proxy: updates to the contained container won't be noticed by the manager, and hence not have the changes you expected it to have. Note that the dictionary itself will be updated in the second example, but you just don't see it since the manager didn't sync.

Categories