return counter object in multiprocessing / map function - python

I have an python script running, that starts the same function in multiple threads. The functions creates and process 2 counters (c1 and c2). The result of all c1 counters from the forked processes should be merged together. Same to the results of all the c2 counters, returned by the different forks.
My (pseudo)-code looks like that:
def countIt(cfg)
c1 = Counter
c2 = Counter
#do some things and fill the counters by counting words in an text, like
#c1= Counter({'apple': 3, 'banana': 0})
#c2= Counter({'blue': 3, 'green': 0})
return c1, c2
if __name__ == '__main__':
cP1 = Counter()
cP2 = Counter()
cfg = "myConfig"
p = multiprocessing.Pool(4) #creating 4 forks
c1, c2 = p.map(countIt,cfg)[:2]
# 1.) This will only work with [:2] which seams to be no good idea
# 2.) at this point c1 and c2 are lists, not a counter anymore,
# so the following will not work:
cP1 + c1
cP2 + c2
Following the example above, I need a result like:
cP1 = Counter({'apple': 25, 'banana': 247, 'orange': 24})
cP2 = Counter({'red': 11, 'blue': 56, 'green': 3})
So my question: how can I count things insight a forked process in order to aggregate each counter (all c1 and all c2) in the parent process?

You need to "unzip" your result by using for example a for-each loop. You will receive a list of tuples where each tuple is a pair of counters: (c1, c2).
With your current solution you actually mix them up. You assigned [(c1a, c2a), (c1b, c2b)] to c1, c2 meaning that c1 contains (c1a, c2a) and c2 contains (c1b, c2b).
Try this:
if __name__ == '__main__':
from contextlib import closing
cP1 = Counter()
cP2 = Counter()
# I hope you have an actual list of configs here, otherwise map will
# will call `countIt` with the single characters of the string 'myConfig'
cfg = "myConfig"
# `contextlib.closing` makes sure the pool is closed after we're done.
# In python3, Pool is itself a contextmanager and you don't need to
# surround it with `closing` in order to be able to use it in the `with`
# construct.
# This approach, however, is compatible with both python2 and python3.
with closing(multiprocessing.Pool(4)) as p:
# Just counting, no need to order the results.
# This might actually be a bit faster.
for c1, c2 in p.imap_unordered(countIt, cfg):
cP1 += c1
cP2 += c2

Related

Printing a Parellel Function Outputs in True Order w/Python

Looking to print everything in order, for a Python parallelized script. Note the c3 is printed prior to the b2 -- out of order. Any way to make the below function with a wait feature? If you rerun, sometimes the print order is correct for shorter batches. However, looking for a reproducible solution to this issue.
from joblib import Parallel, delayed, parallel_backend
import multiprocessing
testFrame = [['a',1], ['b', 2], ['c', 3]]
def testPrint(letr, numbr):
print(letr + str(numbr))
return letr + str(numbr)
with parallel_backend('multiprocessing'):
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs = num_cores)(delayed(testPrint)(letr = testFrame[i][0],
numbr = testFrame[i][1]) for i in range(len(testFrame)))
print('##########')
for test in results:
print(test)
Output:
b2
c3
a1
##########
a1
b2
c3
Seeking:
a1
b2
c3
##########
a1
b2
c3
Once you launch tasks in separate processes you no longer control the order of execution so you cannot expect the actions of those tasks to execute in any predictable order - especially if the tasks can take varying lengths of time.
If you are parallelizing(?) a task/function with a sequence of arguments and you want to reorder the results to match the order of the original sequence you can pass sequence information to the task/function that will be returned by the task and can be used to reconstruct the original order.
If the original function looks like this:
def f(arg):
l,n = arg
#do stuff
time.sleep(random.uniform(.1,10.))
result = f'{l}{n}'
return result
Refactor the function to accept the sequence information and pass it through with the return value.
def f(arg):
indx, (l,n) = arg
time.sleep(random.uniform(.1,10.))
result = (indx,f'{l}{n}')
return result
enumerate could be used to add the sequence information to the sequence of data:
originaldata = list(zip('abcdefghijklmnopqrstuvwxyz', range(26)))
dataplus = enumerate(originaldata)
Now the arguments have the form (index,originalarg) ... (0, ('a',0'), (1, ('b',1)).
And the returned values from the multi-processes look like this (if collected in a list) -
[(14, 'o14'), (23, 'x23'), (1, 'b1'), (4, 'e4'), (13, 'n13'),...]
Which is easily sorted on the first item of each result, key=lambda item: item[0], and the values you really want obtained by picking out the second items after sorting results = [item[1] for item in results].

Grouping lists by a common element

Assume we have a list of list as follows:
S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
S2 = []
I want to go over this list and for each set of that check whether a property is true between that set and the other sets of that list. Then, if that property holds, join those two sets together and compare the new set to the other sets of S1. At the end, add this new set to S2.
Now, as an example, assume we say the property holds between two sets if all elements of those two sets begin with the same letter.
For the list S1 described above, I want S2 to be:
S2 = [{'A_1', 'A_3', 'A_2'}, {'B_1', 'B_3', 'B_2'}, {'C_1','C_2'}]
How we should write code for this?
This is my code. It works fine but I think it is not efficient because it tries to add set(['A_3', 'A_2', 'A_1']) several times. Assume the Checker function is given and it checks the property between two lists. That property I mentioned above is just an example. We may want to change that later. So, we should have Checker as a function.
def Checker(list1, list2):
flag = 1
for item1 in list1:
for item2 in list2:
if item1[0] != item2[0]:
flag =0
if flag ==1:
return 1
else:
return 0
S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
S2 = []
for i in range(0,len(S1)):
Temp = S1[i]
for j in range(0,i-1) + range(i+1,len(S1)):
if Checker(Temp,S1[j]) == 1:
Temp = Temp.union(S1[j])
if Temp not in S2:
S2.append(Temp)
print S2
Output:
[set(['A_3', 'A_2', 'A_1']), set(['B_1', 'B_2', 'B_3']), set(['C_1', 'C_2'])]
def Checker(list1, list2):
flag = 1
for item1 in list1:
for item2 in list2:
if item1[0] != item2[0]:
return 0
return 1
I have tried to reduce the complexity of the Checker() function.
You can flatten (many ways to do this but a simple way is to use it.chain(*nested_list)) and sorted the list using only the property as the key and then use it.groupby() with the same key to create the new list:
In []:
import operator as op
import itertools as it
prop = op.itemgetter(0)
[set(v) for _, v in it.groupby(sorted(it.chain(*S1), key=prop), key=prop)]
Out[]:
[{'A_1', 'A_2', 'A_3'}, {'B_1', 'B_2', 'B_3'}, {'C_1', 'C_2'}]
If performance is a consideration, I suggest the canoncical grouping approach in python: using a defaultdict:
>>> S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
>>> from collections import defaultdict
>>> grouper = defaultdict(set)
>>> from itertools import chain
>>> for item in chain.from_iterable(S1):
... grouper[item[0]].add(item)
...
>>> grouper
defaultdict(<class 'set'>, {'C': {'C_1', 'C_2'}, 'B': {'B_1', 'B_2', 'B_3'}, 'A': {'A_1', 'A_2', 'A_3'}})
Edit
Note, the following applies to Python 3. In Python 2, .values returns a list.
Note, you probably actually just want this dict, likely, it is much more useful to you than a list of the groups. You can also use the .values() method, which returns a view on the values:
>>> grouper.values()
dict_values([{'C_1', 'C_2'}, {'B_1', 'B_2', 'B_3'}, {'A_1', 'A_2', 'A_3'}])
If you really want a list, you can always get it in a straight-forward way:
>>> S2 = list(grouper.values())
>>> S2
[{'C_1', 'C_2'}, {'B_1', 'B_2', 'B_3'}, {'A_1', 'A_2', 'A_3'}]
Given that N is the number of items in all the nested sets, then this solution is O(N).
S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
from itertools import chain
l = list( chain.from_iterable(S1) )
s = {i[0] for i in l}
t = []
for k in s:
t.append([i for i in l if i[0]==k])
print (t)
Output:
[['B_1', 'B_3', 'B_2'], ['A_1', 'A_3', 'A_2'], ['C_1', 'C_2']]
Is your property 1. symmetric and 2. transitive? i.e. 1. prop(a,b) if and only if prop(b,a) and 2. prop(a,b) and prop(b,c) implies prop(a,c)? If so, you can write a function that takes a set and gives some code for the corresponding equivalence class. E.g.
1 S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
2
3 def eq_class(s):
4 fs = set(w[0] for w in s)
5 if len(fs) != 1:
6 return None
7 return fs.pop()
8
9 S2 = dict()
10 for s in S1:
11 cls = eq_class(s)
12 S2[cls] = S2.get(cls,set()).union(s)
13
14 S2 = list(S2.values())
This has an advantage of being amortized O(len(S1)). Also note that your final output may depend on the order of S1 if 1 or 2 fails.
A bit more verbose version using itertools.groupby
from itertools import groupby
S1 = [['A_1'], ['B_1', 'B_3'], ['C_1'], ['A_3'], ['C_2'],['B_2'], ['A_2']]
def group(data):
# Flatten the data
l = list((d for sub in data for d in sub))
# Sort it
l.sort()
groups = []
keys = []
# Iterates for each group found only
for k, g in groupby(l, lambda x: x[0]):
groups.append(list(g))
keys.append(k)
# Return keys group data
return keys, [set(x) for x in groups]
keys, S2 = group(S1)
print "Found the following keys", keys
print "S2 = ", S2
The main thought here was to reduce the number og appends as this really cripples performance. We flatten the data using a generator and sort it. Then we use groupby to group the data. The loop only iterates once per group. There is still a fair bit of data copy here that could potentially be removed.
A bonus is that the function also returns the groups keys detected in the data.

Python, add key:value to dictionary in parallelised loop

I have written some code to perform some calculations in parallel (joblib) and update a dictionary with the calculation results. The code consists of a main function which calls a generator function and calculation function to be run in parallel. The calculation result (a key:value pair) are added by each instance of the calculation function to a dictionary created in the main function and market as global.
Below is a simplified version of my code, illustrating the procedure described above.
When everything runs, the result dictionary (d_result) is empty, but it should have been populated with the results generated by the calculation function. Why is it so?
import numpy as np
from joblib import Parallel, delayed
def do_calc(d, r, pair_index): # function to be run in parallel
data_1 = d[str(r)][pair_index, 1]
data_2 = d[str(r)][pair_index, 2]
result_name = str(data_1) + " ^ " + str(data_2)
result = data_1 ** data_2
d_result[result_name] = result
# d_result.setdefault(result_name, []).append(result) ## same result as above
def compute_indices(d): # generator function
for r in d:
num_pairs = d[str(r)].shape[0]
for pair_index in range(num_pairs):
yield r, pair_index
def process(): # main function
global d_result
d_result = {}
r1 = np.array([['ab', 1, 2], ['vw', 10, 12]], dtype=object)
r2 = np.array([['ac', 1, 3], ['vx', 10, 13]], dtype=object)
r3 = np.array([['ad', 1, 4], ['vy', 10, 14]], dtype=object)
r4 = np.array([['ae', 1, 5], ['vz', 10, 15]], dtype=object)
d = {'r1': r1, 'r2': r2, 'r3': r3, 'r4': r4}
Parallel(n_jobs=4)(delayed(do_calc)(d, r, pair_index) for r, pair_index in (compute_indices)(d))
print(d_result)
process()
I am glad you got your program to work. However I think you have overlooked something important, and you might run into trouble if you use your example as a basis for larger programs.
I scanned the docs for joblib, and discovered that it's built on the Python multiprocessing module. So the multiprocessing programming guidelines apply.
At first I could not figure out why your new program ran successfully and the original one did not. Here is the reason (from the link above): "Bear in mind that if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called." This is because each child process has, at least conceptually, its own copy of the Python interpreter. In each child process, the code that is used by that process must be imported. If that code declares globals, the two processes will have separate copies of those globals, even though it doesn't look that way when you read the code. So when your original program's child process put data into the global d_result, it was actually a different object from d_result in the parent process. From the docs again: "Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
For example, under Windows running the following module would fail with a RuntimeError:
from multiprocessing import Process
def foo():
print 'hello'
p = Process(target=foo)
p.start()
Instead one should protect the entry point of the program by using if __name__ == '__main__'."
So it is important to add one line of code to your program (the second version), right before the last line:
if __name__ == "__main__":
process()
Failure to do this can result in some nasty bugs that you don't want to spend time with.
OK, I've figured it out. Answer and new code below:
The do_calc() function now generates an empty dict, then populates it with a single key:value pair and returns the dict.
The parallel bit in process() by default creates a list of that which is returned from do_calc(). So what I end up with after the parallelised do_calc() is a list of dicts.
What I really want is a single dict, so using dict comprehension I convert the list of dicts to dict, and wala, she's all good!
This helped: python convert list of single key dictionaries into a single dictionary
import numpy as np
from joblib import Parallel, delayed
def do_calc(d, r, pair_index): # calculation function to be run in parallel
data_1 = d[str(r)][pair_index, 1]
data_2 = d[str(r)][pair_index, 2]
result_name = str(data_1) + " ^ " + str(data_2)
result = data_1 ** data_2
d_result = {} # create empty dict
d_result[result_name] = result #add key:value pair to dict
return d_result # return dict
def compute_indices(d): # generator function
for r in d:
num_pairs = d[str(r)].shape[0]
for pair_index in range(num_pairs):
yield r, pair_index
def process(): # main function
r1 = np.array([['ab', 1, 2], ['vw', 10, 12]], dtype=object)
r2 = np.array([['ac', 1, 3], ['vx', 10, 13]], dtype=object)
r3 = np.array([['ad', 1, 4], ['vy', 10, 14]], dtype=object)
r4 = np.array([['ae', 1, 5], ['vz', 10, 15]], dtype=object)
d = {'r1': r1, 'r2': r2, 'r3': r3, 'r4': r4}
# parallelised calc. Each run returns dict, final output is list of dicts
d_result = Parallel(n_jobs=4)(delayed(do_calc)(d, r, pair_index) for r, pair_index in (compute_indices)(d))
# transform list of dicts to dict
d_result = {k: v for x in d_result for k, v in x.items()}
print(d_result)
process()

How to merge two or more Counters into a single dictionary?

I have two Counter collections C1 and C2, they have similar data set but different counts (think C1 and C2 as number of apples and oranges a group of people have).
I want to merge these two collections into one dict that looks like
{
Person1: [1, 2],
Person2: [5, 1],
...
}
I haven't decided what data structure to store the merged counts (perhaps list) in order to easily write them to a csv file with # of apples and oranges being separate columns. There are a lot of tricks I am not aware of in python collections, I am looking for minimal code size. Thanks.
EDIT: From the answers below, I felt that my question is not as clear as I thought, let me elaborate on what exactly what I am looking for:
Let me have two Counter collections c1 and c2:
c1 = [
('orange', 10),
('apple', 20)
]
c2 = [
('orange', 15),
('apple', 30)
]
I want to merge these two collections into a single dict such that it looks like:
merged = {
'orange': [10, 15],
'apple': [20, 30]
}
Or other data structure that can be easily converted and output to csv format.
Using pandas:
import pandas as pd
from collections import Counter
c1 = Counter('jdahfajksdasdhflajkdhflajh')
c2 = Counter('jahdflkjhdazzfldjhfadkhfs')
df = pd.DataFrame({'apples': c1, 'oranges': c2})
df.to_csv('apples_and_oranges.csv')
This works also if the keys of the counters are not all the same. There will be NaNs where the key only appeared in the other counter.
You can use defaultdict() from the collections module to store the merged result then you use chain() from the itertools module. What chain is doing here is that it makes an iterator that returns elements from each of your "counter" and let you avoid writing a nested loop.
>>> from collections import defaultdict
>>> from itertools import chain
>>> c1 = [
... ('orange', 10),
... ('apple', 20)
... ]
>>> c2 = [
... ('orange', 15),
... ('apple', 30)
... ]
>>> merged = defaultdict(list)
>>> for item in chain(c1, c2):
... merged[item[0]].append(item[1])
...
>>> merged
defaultdict(<class 'list'>, {'apple': [20, 30], 'orange': [10, 15]})
>>>
You can use the Counter.update() function if you start form a counter collection as you specified. I added the item banana as well, which is only in one counter collection. Be aware that update used on a Counter adds the values to the key. This is in contrast to update used on a dict where the value is replaced (!) by the update (check the docs: https://docs.python.org/3/library/collections.html#collections.Counter.update).
from collections import Counter
import pandas as pd
c1 = [('orange', 10),('apple', 20)]
c2 = [('orange', 15),('apple', 30),('banana',5)]
c = Counter()
for i in c1: c.update({i[0]:i[1]})
for i in c2: c.update({i[0]:i[1]})
However, if you start form a list of values you can construct a Counter for each list and add the counters
c1 = Counter(['orange'] * 10 + ['apple'] * 20)
c2 = Counter(['orange'] * 15 + ['apple'] * 30 + ['banana']* 5)
c = c1 + c2
Now we can write the counter to a csv file
df = pd.DataFrame.from_dict(c, orient='index', columns=['count'])
df.to_csv('counts.csv')
Yet another way is to convert the counter collection to dicts and form there to Counters, since you are looking for a small code size
c1 = Counter(dict([('orange', 10),('apple', 20)]))
c2 = Counter(dict([('orange', 15),('apple', 30),('banana',5)]))
c = c1 + c2

Finding symmetric difference of 2 sets separating them by origin

I have two sets:
>>> a = {1,2,3}
>>> b = {2,3,4,5,6}
And I would like to get two new sets with non common elements, first set containing elements from a and second from b, like ({1}, {4,5,6}), or like:
>>> c = a&b # Common elements
>>> d = a^b # Symmetric difference
>>> (a-b, b-a)
({1}, {4, 5, 6})
>>> (a-c, b-c)
({1}, {4, 5, 6})
>>> (a&d, b&d)
({1}, {4, 5, 6})
My problem is that I'm going to use this on large number of sha1 hashes and I'm worried about performance. What is proper way of doing this efficiently?
Note: a and b are going to have around 95% of elements common, 1% will be in a and 4% in b.
Methods I've mentioned in the question has following performance:
>>> timeit.timeit('a-b; b-a', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
135.45828826893307
>>> timeit.timeit('c=a&b; a-c; b-c', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
189.98522938665735
>>> timeit.timeit('d=a^b; a&d; b&d', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
238.35084129583106
So most effective way seems to be using (a-b, b-a) method.
I'm posting this as a reference so other answers would add new methods, not compare the ones I've found.
Python implemented function
Just out of curiosity I've tried implementing own python function to do this (that works on pre-sorted iterators):
def symmetric_diff(l1,l2):
# l1 and l2 has to be sorted and contain comparable elements
r1 = []
r2 = []
i1 = iter(l1)
i2 = iter(l2)
try:
e1 = next(i1)
except StopIteration: return ([], list(i2))
try:
e2 = next(i2)
except StopIteration: return ([e1] + list(i1), [])
try:
while True:
if e1 == e2:
e1 = next(i1)
e2 = next(i2)
elif e1 > e2:
r2.append(e2)
e2 = next(i2)
else:
r1.append(e1)
e1 = next(i1)
except StopIteration:
if e1==e2:
return (r1+list(i1), r2+list(i2))
elif e1 > e2:
return (r1+[e1]+list(i1), r2+list(i2))
else:
return (r1+list(i1), r2+[e2]+list(i2))
Compared to other methods, this one has quite low performance:
t = timeit.Timer(lambda: symmetric_diff(a,b))
>>> t.timeit(1000)
542.3225249653769
So unless some other method is implemented somewhere (some library for working with sets) I think using two sets difference is the most efficient way of doing this.

Categories