Looking to print everything in order, for a Python parallelized script. Note the c3 is printed prior to the b2 -- out of order. Any way to make the below function with a wait feature? If you rerun, sometimes the print order is correct for shorter batches. However, looking for a reproducible solution to this issue.
from joblib import Parallel, delayed, parallel_backend
import multiprocessing
testFrame = [['a',1], ['b', 2], ['c', 3]]
def testPrint(letr, numbr):
print(letr + str(numbr))
return letr + str(numbr)
with parallel_backend('multiprocessing'):
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs = num_cores)(delayed(testPrint)(letr = testFrame[i][0],
numbr = testFrame[i][1]) for i in range(len(testFrame)))
print('##########')
for test in results:
print(test)
Output:
b2
c3
a1
##########
a1
b2
c3
Seeking:
a1
b2
c3
##########
a1
b2
c3
Once you launch tasks in separate processes you no longer control the order of execution so you cannot expect the actions of those tasks to execute in any predictable order - especially if the tasks can take varying lengths of time.
If you are parallelizing(?) a task/function with a sequence of arguments and you want to reorder the results to match the order of the original sequence you can pass sequence information to the task/function that will be returned by the task and can be used to reconstruct the original order.
If the original function looks like this:
def f(arg):
l,n = arg
#do stuff
time.sleep(random.uniform(.1,10.))
result = f'{l}{n}'
return result
Refactor the function to accept the sequence information and pass it through with the return value.
def f(arg):
indx, (l,n) = arg
time.sleep(random.uniform(.1,10.))
result = (indx,f'{l}{n}')
return result
enumerate could be used to add the sequence information to the sequence of data:
originaldata = list(zip('abcdefghijklmnopqrstuvwxyz', range(26)))
dataplus = enumerate(originaldata)
Now the arguments have the form (index,originalarg) ... (0, ('a',0'), (1, ('b',1)).
And the returned values from the multi-processes look like this (if collected in a list) -
[(14, 'o14'), (23, 'x23'), (1, 'b1'), (4, 'e4'), (13, 'n13'),...]
Which is easily sorted on the first item of each result, key=lambda item: item[0], and the values you really want obtained by picking out the second items after sorting results = [item[1] for item in results].
Related
I have written some code to perform some calculations in parallel (joblib) and update a dictionary with the calculation results. The code consists of a main function which calls a generator function and calculation function to be run in parallel. The calculation result (a key:value pair) are added by each instance of the calculation function to a dictionary created in the main function and market as global.
Below is a simplified version of my code, illustrating the procedure described above.
When everything runs, the result dictionary (d_result) is empty, but it should have been populated with the results generated by the calculation function. Why is it so?
import numpy as np
from joblib import Parallel, delayed
def do_calc(d, r, pair_index): # function to be run in parallel
data_1 = d[str(r)][pair_index, 1]
data_2 = d[str(r)][pair_index, 2]
result_name = str(data_1) + " ^ " + str(data_2)
result = data_1 ** data_2
d_result[result_name] = result
# d_result.setdefault(result_name, []).append(result) ## same result as above
def compute_indices(d): # generator function
for r in d:
num_pairs = d[str(r)].shape[0]
for pair_index in range(num_pairs):
yield r, pair_index
def process(): # main function
global d_result
d_result = {}
r1 = np.array([['ab', 1, 2], ['vw', 10, 12]], dtype=object)
r2 = np.array([['ac', 1, 3], ['vx', 10, 13]], dtype=object)
r3 = np.array([['ad', 1, 4], ['vy', 10, 14]], dtype=object)
r4 = np.array([['ae', 1, 5], ['vz', 10, 15]], dtype=object)
d = {'r1': r1, 'r2': r2, 'r3': r3, 'r4': r4}
Parallel(n_jobs=4)(delayed(do_calc)(d, r, pair_index) for r, pair_index in (compute_indices)(d))
print(d_result)
process()
I am glad you got your program to work. However I think you have overlooked something important, and you might run into trouble if you use your example as a basis for larger programs.
I scanned the docs for joblib, and discovered that it's built on the Python multiprocessing module. So the multiprocessing programming guidelines apply.
At first I could not figure out why your new program ran successfully and the original one did not. Here is the reason (from the link above): "Bear in mind that if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called." This is because each child process has, at least conceptually, its own copy of the Python interpreter. In each child process, the code that is used by that process must be imported. If that code declares globals, the two processes will have separate copies of those globals, even though it doesn't look that way when you read the code. So when your original program's child process put data into the global d_result, it was actually a different object from d_result in the parent process. From the docs again: "Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
For example, under Windows running the following module would fail with a RuntimeError:
from multiprocessing import Process
def foo():
print 'hello'
p = Process(target=foo)
p.start()
Instead one should protect the entry point of the program by using if __name__ == '__main__'."
So it is important to add one line of code to your program (the second version), right before the last line:
if __name__ == "__main__":
process()
Failure to do this can result in some nasty bugs that you don't want to spend time with.
OK, I've figured it out. Answer and new code below:
The do_calc() function now generates an empty dict, then populates it with a single key:value pair and returns the dict.
The parallel bit in process() by default creates a list of that which is returned from do_calc(). So what I end up with after the parallelised do_calc() is a list of dicts.
What I really want is a single dict, so using dict comprehension I convert the list of dicts to dict, and wala, she's all good!
This helped: python convert list of single key dictionaries into a single dictionary
import numpy as np
from joblib import Parallel, delayed
def do_calc(d, r, pair_index): # calculation function to be run in parallel
data_1 = d[str(r)][pair_index, 1]
data_2 = d[str(r)][pair_index, 2]
result_name = str(data_1) + " ^ " + str(data_2)
result = data_1 ** data_2
d_result = {} # create empty dict
d_result[result_name] = result #add key:value pair to dict
return d_result # return dict
def compute_indices(d): # generator function
for r in d:
num_pairs = d[str(r)].shape[0]
for pair_index in range(num_pairs):
yield r, pair_index
def process(): # main function
r1 = np.array([['ab', 1, 2], ['vw', 10, 12]], dtype=object)
r2 = np.array([['ac', 1, 3], ['vx', 10, 13]], dtype=object)
r3 = np.array([['ad', 1, 4], ['vy', 10, 14]], dtype=object)
r4 = np.array([['ae', 1, 5], ['vz', 10, 15]], dtype=object)
d = {'r1': r1, 'r2': r2, 'r3': r3, 'r4': r4}
# parallelised calc. Each run returns dict, final output is list of dicts
d_result = Parallel(n_jobs=4)(delayed(do_calc)(d, r, pair_index) for r, pair_index in (compute_indices)(d))
# transform list of dicts to dict
d_result = {k: v for x in d_result for k, v in x.items()}
print(d_result)
process()
i'm having trouble to merge the values of a dictionary whereas the dictionary varies in its key count
i found an working example using two lists like
t1 = [1,2,3]
t2 = ["a","b","c"]
output = list(zip(t1, t2))
which leads to [(1, 'a'), (2, 'b'), (3, 'c')] ... first success.
But I need to zip all the values from a dictionary, which varies in the count of the key values. (Sometimes there are 2 keys in, sometimes 4 and so on..)
Is there a way to do the zip with a dynamic input, dependent on the count of the keys
Lets say
t1 = [1,2,3]
t2 = ["a","b","c"]
generated_rows = OrderedDict()
generated_rows['t1'] = t1
generated_rows['t2']=t2
output = list(zip(??*))
the expected output would be as above:
[(1, 'a'), (2, 'b'), (3, 'c')]
but the parameters of the zip method should somehow come from the dictionary in a dynamic way. The following variing dicts should work with the method:
d1 = {'k1':[0,1,2], 'k2':['a','b','c']}
d2 = {'k1':[0,1,2], 'k2':['a','b','c'], 'k3':['x','y','z']}
d3 = ...
solution (thanks to Todd):
d1 = {'k1':[0,1,2], 'k2':['a','b','c']}
o = list(zip(*d1.values()))
If your second piece of code accurately represents what you want to do with N different lists, then the code would probably be:
t1 = [ 1, 2, 3 ]
t2 = [ 'a', 'b', 'c' ]
# And so on
x = []
x.append( t1 )
x.append( t2 )
# And so on
output = zip(*x)
You don't need the extra list() because zip() already returns a list. The * operator is sometimes referred to as the 'splat' operator, and when used like this represents unpacking the arguments.
A list is used instead of a dictionary because the 'splat' operator doesn't guarantee the order it unpacks things in beyond "whatever order the type in question uses when iterating over it". An ordered dictionary may work if the keys are selected to impose the correct ordering.
I have two Counter collections C1 and C2, they have similar data set but different counts (think C1 and C2 as number of apples and oranges a group of people have).
I want to merge these two collections into one dict that looks like
{
Person1: [1, 2],
Person2: [5, 1],
...
}
I haven't decided what data structure to store the merged counts (perhaps list) in order to easily write them to a csv file with # of apples and oranges being separate columns. There are a lot of tricks I am not aware of in python collections, I am looking for minimal code size. Thanks.
EDIT: From the answers below, I felt that my question is not as clear as I thought, let me elaborate on what exactly what I am looking for:
Let me have two Counter collections c1 and c2:
c1 = [
('orange', 10),
('apple', 20)
]
c2 = [
('orange', 15),
('apple', 30)
]
I want to merge these two collections into a single dict such that it looks like:
merged = {
'orange': [10, 15],
'apple': [20, 30]
}
Or other data structure that can be easily converted and output to csv format.
Using pandas:
import pandas as pd
from collections import Counter
c1 = Counter('jdahfajksdasdhflajkdhflajh')
c2 = Counter('jahdflkjhdazzfldjhfadkhfs')
df = pd.DataFrame({'apples': c1, 'oranges': c2})
df.to_csv('apples_and_oranges.csv')
This works also if the keys of the counters are not all the same. There will be NaNs where the key only appeared in the other counter.
You can use defaultdict() from the collections module to store the merged result then you use chain() from the itertools module. What chain is doing here is that it makes an iterator that returns elements from each of your "counter" and let you avoid writing a nested loop.
>>> from collections import defaultdict
>>> from itertools import chain
>>> c1 = [
... ('orange', 10),
... ('apple', 20)
... ]
>>> c2 = [
... ('orange', 15),
... ('apple', 30)
... ]
>>> merged = defaultdict(list)
>>> for item in chain(c1, c2):
... merged[item[0]].append(item[1])
...
>>> merged
defaultdict(<class 'list'>, {'apple': [20, 30], 'orange': [10, 15]})
>>>
You can use the Counter.update() function if you start form a counter collection as you specified. I added the item banana as well, which is only in one counter collection. Be aware that update used on a Counter adds the values to the key. This is in contrast to update used on a dict where the value is replaced (!) by the update (check the docs: https://docs.python.org/3/library/collections.html#collections.Counter.update).
from collections import Counter
import pandas as pd
c1 = [('orange', 10),('apple', 20)]
c2 = [('orange', 15),('apple', 30),('banana',5)]
c = Counter()
for i in c1: c.update({i[0]:i[1]})
for i in c2: c.update({i[0]:i[1]})
However, if you start form a list of values you can construct a Counter for each list and add the counters
c1 = Counter(['orange'] * 10 + ['apple'] * 20)
c2 = Counter(['orange'] * 15 + ['apple'] * 30 + ['banana']* 5)
c = c1 + c2
Now we can write the counter to a csv file
df = pd.DataFrame.from_dict(c, orient='index', columns=['count'])
df.to_csv('counts.csv')
Yet another way is to convert the counter collection to dicts and form there to Counters, since you are looking for a small code size
c1 = Counter(dict([('orange', 10),('apple', 20)]))
c2 = Counter(dict([('orange', 15),('apple', 30),('banana',5)]))
c = c1 + c2
I have an python script running, that starts the same function in multiple threads. The functions creates and process 2 counters (c1 and c2). The result of all c1 counters from the forked processes should be merged together. Same to the results of all the c2 counters, returned by the different forks.
My (pseudo)-code looks like that:
def countIt(cfg)
c1 = Counter
c2 = Counter
#do some things and fill the counters by counting words in an text, like
#c1= Counter({'apple': 3, 'banana': 0})
#c2= Counter({'blue': 3, 'green': 0})
return c1, c2
if __name__ == '__main__':
cP1 = Counter()
cP2 = Counter()
cfg = "myConfig"
p = multiprocessing.Pool(4) #creating 4 forks
c1, c2 = p.map(countIt,cfg)[:2]
# 1.) This will only work with [:2] which seams to be no good idea
# 2.) at this point c1 and c2 are lists, not a counter anymore,
# so the following will not work:
cP1 + c1
cP2 + c2
Following the example above, I need a result like:
cP1 = Counter({'apple': 25, 'banana': 247, 'orange': 24})
cP2 = Counter({'red': 11, 'blue': 56, 'green': 3})
So my question: how can I count things insight a forked process in order to aggregate each counter (all c1 and all c2) in the parent process?
You need to "unzip" your result by using for example a for-each loop. You will receive a list of tuples where each tuple is a pair of counters: (c1, c2).
With your current solution you actually mix them up. You assigned [(c1a, c2a), (c1b, c2b)] to c1, c2 meaning that c1 contains (c1a, c2a) and c2 contains (c1b, c2b).
Try this:
if __name__ == '__main__':
from contextlib import closing
cP1 = Counter()
cP2 = Counter()
# I hope you have an actual list of configs here, otherwise map will
# will call `countIt` with the single characters of the string 'myConfig'
cfg = "myConfig"
# `contextlib.closing` makes sure the pool is closed after we're done.
# In python3, Pool is itself a contextmanager and you don't need to
# surround it with `closing` in order to be able to use it in the `with`
# construct.
# This approach, however, is compatible with both python2 and python3.
with closing(multiprocessing.Pool(4)) as p:
# Just counting, no need to order the results.
# This might actually be a bit faster.
for c1, c2 in p.imap_unordered(countIt, cfg):
cP1 += c1
cP2 += c2
I've been trying to figure out the best way to write a query to compare the rows in two tables. My goal is to see if the two tuples in result Set A are in the larger result set B. I only want to see the tuples that are different in the query results.
'''SELECT table1.field_b, table1.field_c, table1.field_d
'''FROM table1
'''ORDER BY field_b
results_a = [(101010101, 111111111, 999999999), (121212121, 222222222, 999999999)]
'''SELECT table2.field_a, table2.fieldb, table3.field3
'''FROM table2
'''ORDER BY field_a
results_b =[(101010101, 111111111, 999999999), (121212121, 333333333, 999999999), (303030303, 444444444, 999999999)]
So what I want to do is take results_a and make sure that they have an exact match somewhere in results_b. So since the second record in the second tuple is different than what is in results_a, I would like to return the second tuple in results_a.
Ultimately I would like to return a set that also has the second tuple that did not match in the other set so I could reference both in my program. Ideally since the second tuples primary key (field_b in table1) didn't match the corresponding primary key (field_a) in table2 then I would want to display results_c ={(121212121, 222222222, 999999999):(121212121, 222222222, 999999999)}. This is complicated by the facts that the results in both tables will not be in the same order so I can't write code that says (compare tuple2 in results_a to tuple2 in results_b). It is more like (compare tuple2 in results_a and see if it matches any record in results_b. If the primary keys match and none of the tuples in results b completely match or no partial match is found return the records that don't match.)
I apologize that this is so wordy. I couldn't think of a better way to explain it. Any help would be much appreciated.
Thanks!
UPDATED EFFORT ON PARTIAL MATCHES
a = [(1, 2, 3),(4,5,7)]
b = [(1, 2, 3),(4,5,6)]
pmatch = dict([])
def partial_match(x,y):
return sum(ea == eb for (ea,eb) in zip(x,y))>=2
for el_a in a:
pmatch[el_a] = [el_b for el_b in b if partial_match(el_a,el_b)]
print(pmatch)
OUTPUT = {(4, 5, 7): [(4, 5, 6)], (1, 2, 3): [(1, 2, 3)]}. I would have expected it to be just {(4,5,7):(4,5,6)} because those are the only sets that are different. Any ideas?
Take results_a and make sure that they have an exact match somewhere in results_b:
for el in results_a:
if el in results_b:
...
Get partial matches:
pmatch = dict([])
def partial_match(a,b):
# for instance ...
return sum(ea == eb for (ea,eb) in zip(a,b)) >= 2
for el_a in results_a:
pmatch[el_a] = [el_b for el_b in results_b if partial_macth(el_a,el_b)]
Return the records that don't match:
no_match = [el for el in results_a if el not in results_b]
-- EDIT / Another possible partial_match
def partial_match(x,y):
nb_matches = sum(ea == eb for (ea,eb) in zip(x,y))
return 0.6 < float(nb_matches) / len(x) < 1