Grouping lists by a common element - python

Assume we have a list of list as follows:
S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
S2 = []
I want to go over this list and for each set of that check whether a property is true between that set and the other sets of that list. Then, if that property holds, join those two sets together and compare the new set to the other sets of S1. At the end, add this new set to S2.
Now, as an example, assume we say the property holds between two sets if all elements of those two sets begin with the same letter.
For the list S1 described above, I want S2 to be:
S2 = [{'A_1', 'A_3', 'A_2'}, {'B_1', 'B_3', 'B_2'}, {'C_1','C_2'}]
How we should write code for this?
This is my code. It works fine but I think it is not efficient because it tries to add set(['A_3', 'A_2', 'A_1']) several times. Assume the Checker function is given and it checks the property between two lists. That property I mentioned above is just an example. We may want to change that later. So, we should have Checker as a function.
def Checker(list1, list2):
flag = 1
for item1 in list1:
for item2 in list2:
if item1[0] != item2[0]:
flag =0
if flag ==1:
return 1
else:
return 0
S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
S2 = []
for i in range(0,len(S1)):
Temp = S1[i]
for j in range(0,i-1) + range(i+1,len(S1)):
if Checker(Temp,S1[j]) == 1:
Temp = Temp.union(S1[j])
if Temp not in S2:
S2.append(Temp)
print S2
Output:
[set(['A_3', 'A_2', 'A_1']), set(['B_1', 'B_2', 'B_3']), set(['C_1', 'C_2'])]

def Checker(list1, list2):
flag = 1
for item1 in list1:
for item2 in list2:
if item1[0] != item2[0]:
return 0
return 1
I have tried to reduce the complexity of the Checker() function.

You can flatten (many ways to do this but a simple way is to use it.chain(*nested_list)) and sorted the list using only the property as the key and then use it.groupby() with the same key to create the new list:
In []:
import operator as op
import itertools as it
prop = op.itemgetter(0)
[set(v) for _, v in it.groupby(sorted(it.chain(*S1), key=prop), key=prop)]
Out[]:
[{'A_1', 'A_2', 'A_3'}, {'B_1', 'B_2', 'B_3'}, {'C_1', 'C_2'}]

If performance is a consideration, I suggest the canoncical grouping approach in python: using a defaultdict:
>>> S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
>>> from collections import defaultdict
>>> grouper = defaultdict(set)
>>> from itertools import chain
>>> for item in chain.from_iterable(S1):
... grouper[item[0]].add(item)
...
>>> grouper
defaultdict(<class 'set'>, {'C': {'C_1', 'C_2'}, 'B': {'B_1', 'B_2', 'B_3'}, 'A': {'A_1', 'A_2', 'A_3'}})
Edit
Note, the following applies to Python 3. In Python 2, .values returns a list.
Note, you probably actually just want this dict, likely, it is much more useful to you than a list of the groups. You can also use the .values() method, which returns a view on the values:
>>> grouper.values()
dict_values([{'C_1', 'C_2'}, {'B_1', 'B_2', 'B_3'}, {'A_1', 'A_2', 'A_3'}])
If you really want a list, you can always get it in a straight-forward way:
>>> S2 = list(grouper.values())
>>> S2
[{'C_1', 'C_2'}, {'B_1', 'B_2', 'B_3'}, {'A_1', 'A_2', 'A_3'}]
Given that N is the number of items in all the nested sets, then this solution is O(N).

S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
from itertools import chain
l = list( chain.from_iterable(S1) )
s = {i[0] for i in l}
t = []
for k in s:
t.append([i for i in l if i[0]==k])
print (t)
Output:
[['B_1', 'B_3', 'B_2'], ['A_1', 'A_3', 'A_2'], ['C_1', 'C_2']]

Is your property 1. symmetric and 2. transitive? i.e. 1. prop(a,b) if and only if prop(b,a) and 2. prop(a,b) and prop(b,c) implies prop(a,c)? If so, you can write a function that takes a set and gives some code for the corresponding equivalence class. E.g.
1 S1 = [{'A_1'}, {'B_1', 'B_3'}, {'C_1'}, {'A_3'}, {'C_2'},{'B_2'}, {'A_2'}]
2
3 def eq_class(s):
4 fs = set(w[0] for w in s)
5 if len(fs) != 1:
6 return None
7 return fs.pop()
8
9 S2 = dict()
10 for s in S1:
11 cls = eq_class(s)
12 S2[cls] = S2.get(cls,set()).union(s)
13
14 S2 = list(S2.values())
This has an advantage of being amortized O(len(S1)). Also note that your final output may depend on the order of S1 if 1 or 2 fails.

A bit more verbose version using itertools.groupby
from itertools import groupby
S1 = [['A_1'], ['B_1', 'B_3'], ['C_1'], ['A_3'], ['C_2'],['B_2'], ['A_2']]
def group(data):
# Flatten the data
l = list((d for sub in data for d in sub))
# Sort it
l.sort()
groups = []
keys = []
# Iterates for each group found only
for k, g in groupby(l, lambda x: x[0]):
groups.append(list(g))
keys.append(k)
# Return keys group data
return keys, [set(x) for x in groups]
keys, S2 = group(S1)
print "Found the following keys", keys
print "S2 = ", S2
The main thought here was to reduce the number og appends as this really cripples performance. We flatten the data using a generator and sort it. Then we use groupby to group the data. The loop only iterates once per group. There is still a fair bit of data copy here that could potentially be removed.
A bonus is that the function also returns the groups keys detected in the data.

Related

Is there any function equivalent to np.unique for generic object in Python

np.unique() can return indices of first occurrence, indices to reconstruct, and occurrence count. Is there any function/library that can do the same for any Python object?
Not as such. You can get similar functionality using different classes depending on your needs.
unique with no extra flags has a similar result to set:
unique_value = set(x)
collections.Counter simulates return_counts:
counts = collections.Counter(x)
unique_values = list(counts.keys())
unique_counts = list(counts.values())
To mimic return_index, use list.index on a set or Counter. This assumes that the container is a list
first_indices = [x.index(k) for k in counts]
To simulate return_inverse, we look at how unique is actually implemented. unique sorts the input to get the runs of elements. A similar technique can be acheived via sorted (or in-place list.sort) and itertools.groupby:
s = sorted(zip(x, itertools.count()))
inverse = [0] * len(x)
for i, (k, g) in enumerate(itertools.groupby(s, operator.itemgetter(0))):
for v in g:
inverse[v[1]] = i
In fact, the groupby approach encodes all the options:
s = sorted(zip(x, itertools.count()))
unique_values = []
first_indices = []
unique_counts = []
inverse = [0] * len(x)
for i, (k, g) in enumerate(itertools.groupby(s, operator.itemgetter(0))):
unique_values.append(k)
count = 1
v = next(g)
inverse[v[1]] = i
first_indices.append(v[0])
for v in g:
inverse[v[1]] = i
count += 1
unique_counts.append(count)
You can use Counter:
> from collections import Counter
> bob = ['bob','bob','dob']
> Counter(bob)
Counter({'bob': 2, 'dob': 1})
> Counter(bob).keys()
dict_keys(['bob', 'dob'])

Filter a list of sets with specific criteria

I have a list of sets:
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
Expected outcome:
I want to look at each individual string, do a count and return everything x >= 2 in the original format:
a = [{'foo','cpu'}, {'foo'}, {'cpu'}]
Here's what I have so far but I'm stuck on the last part where I need to append the new list:
from collections import Counter
counter = Counter()
for a_set in a:
# Created a counter to count the occurrences a word
counter.update(a_set)
result = []
for a_set in a:
for word in a_set:
if counter[word] >= 2:
# Not sure how I should append my new set below.
result.append(a_set)
break
print(result)
You are just appending the original set. So you should create a new set with the words that occur at least twice.
result = []
for a_set in a:
new_set = {
word for word in a_set
if counter[word] >= 2
}
if new_set: # check if new set is not empty
result.append(new_set)
Instead, use the following short approach based on sets intersection:
from collections import Counter
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
c = Counter([i for s in a for i in s])
valid_keys = {k for k,v in c.items() if v >= 2}
res = [s & valid_keys for s in a if s & valid_keys]
print(res) # [{'cpu', 'foo'}, {'foo'}, {'cpu'}]
Here's what I ended up doing:
Build a counter then iterate over the original list of sets and filter items with <2 counts, then filter any empty sets:
from itertools import chain
from collections import Counter
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
c = Counter(chain.from_iterable(map(list, a)))
res = list(filter(None, ({item for item in s if c[item] >= 2} for s in a)))
print(res)
Out: [{'foo', 'cpu'}, {'foo'}, {'cpu'}]

Finding the minimum value for different variables

If i am doing some math functions for different variables for example:
a = x - y
b = x**2 - y**2
c = (x-y)**2
d = x + y
How can i find the minimum value out of all the variables. For example:
a = 4
b = 7
c = 3
d = 10
So the minimum value is 3 for c. How can i let my program do this.
What have i thought so far:
make a list
append a,b,c,d in the list
sort the list
print list[0] as it will be the smallest value.
The problem is if i append a,b,c,d to a list i have to do something like:
lst.append((a,b,c,d))
This makes the list to be -
[(4,7,3,10)]
making all the values relating to one index only ( lst[0] )
If possible is there any substitute to do this or any way possible as to how can i find the minimum!
LNG - PYTHON
Thank you
You can find the index of the smallest item like this
>>> L = [4,7,3,10]
>>> min(range(len(L)), key=L.__getitem__)
2
Now you know the index, you can get the actual item too. eg: L[2]
Another way which finds the answer in the form(index, item)
>>> min(enumerate(L), key=lambda x:x[1])
(2, 3)
I think you may be going the wrong way to solving your problem, but it's possible to pull values of variable from the local namespace if you know their names. eg.
>>> a = 4
>>> b = 7
>>> c = 3
>>> d = 10
>>> min(enumerate(['a', 'b', 'c', 'd']), key=lambda x, ns=locals(): ns[x[1]])
(2, 'c')
a better way is to use a dict, so you are not filling your working namespace with these "junk" variables
>>> D = {}
>>> D['a'] = 4
>>> D['b'] = 7
>>> D['c'] = 3
>>> D['d'] = 10
>>> min(D, key=D.get)
'c'
>>> min(D.items(), key=lambda x:x[1])
('c', 3)
You can see that when the correct data structure is used, the amount of code required is much less.
If you store the numbers in an list you can use a reduce having a O(n) complexity due the list is not sorted.
numbers = [999, 1111, 222, -1111]
minimum = reduce(lambda mn, candidate: candidate if candidate < mn else mn, numbers[1:], numbers[0])
pack as dictionary, find min value and then find keys that have matching values (possibly more than one minimum)
D = dict(a = 4, b = 7, c = 3, d = 10)
min_val = min(D.values())
for k,v in D.items():
if v == min_val: print(k)
The buiit-in function min will do the trick. In your example, min(a,b,c,d) will yield 3.

Finding index of values in a list dynamically

I am having two lists as follows:
list_1
['A-1','A-1','A-1','A-2','A-2','A-3']
list_2
['iPad','iPod','iPhone','Windows','X-box','Kindle']
I would like to split the list_2 based on the index values in list_1. For instance,
list_a1
['iPad','iPod','iPhone']
list_a2
['Windows','X-box']
list_a3
['Kindle']
I know index method, but it needs the value to be matched to be passed along with. In this case, I would like to dynamically find the indexes of the values in list_1 with the same value. Is this possible? Any tips/hints would be deeply appreciated.
Thanks.
There are a few ways to do this.
I'd do it by using zip and groupby.
First:
>>> list(zip(list_1, list_2))
[('A-1', 'iPad'),
('A-1', 'iPod'),
('A-1', 'iPhone'),
('A-2', 'Windows'),
('A-2', 'X-box'),
('A-3', 'Kindle')]
Now:
>>> import itertools, operator
>>> [(key, list(group)) for key, group in
... itertools.groupby(zip(list_1, list_2), operator.itemgetter(0))]
[('A-1', [('A-1', 'iPad'), ('A-1', 'iPod'), ('A-1', 'iPhone')]),
('A-2', [('A-2', 'Windows'), ('A-2', 'X-box')]),
('A-3', [('A-3', 'Kindle')])]
So, you just want each group, ignoring the key, and you only want the second element of each element in the group. You can get the second element of each group with another comprehension, or just by unzipping:
>>> [list(zip(*group))[1] for key, group in
... itertools.groupby(zip(list_1, list_2), operator.itemgetter(0))]
[('iPad', 'iPod', 'iPhone'), ('Windows', 'X-box'), ('Kindle',)]
I would personally find this more readable as a sequence of separate iterator transformations than as one long expression. Taken to the extreme:
>>> ziplists = zip(list_1, list_2)
>>> pairs = itertools.groupby(ziplists, operator.itemgetter(0))
>>> groups = (group for key, group in pairs)
>>> values = (zip(*group)[1] for group in groups)
>>> [list(value) for value in values]
… but a happy medium of maybe 2 or 3 lines is usually better than either extreme.
Usually I'm the one rushing to a groupby solution ;^) but here I'll go the other way and manually insert into an OrderedDict:
list_1 = ['A-1','A-1','A-1','A-2','A-2','A-3']
list_2 = ['iPad','iPod','iPhone','Windows','X-box','Kindle']
from collections import OrderedDict
d = OrderedDict()
for code, product in zip(list_1, list_2):
d.setdefault(code, []).append(product)
produces a d looking like
>>> d
OrderedDict([('A-1', ['iPad', 'iPod', 'iPhone']),
('A-2', ['Windows', 'X-box']), ('A-3', ['Kindle'])])
with easy access:
>>> d["A-2"]
['Windows', 'X-box']
and we can get the list-of-lists in list_1 order using .values():
>>> d.values()
[['iPad', 'iPod', 'iPhone'], ['Windows', 'X-box'], ['Kindle']]
If you've noticed that no one is telling you how to make a bunch of independent lists with names like list_a1 and so on-- that's because that's a bad idea. You want to keep the data together in something which you can (at a minimum) iterate over easily, and both dictionaries and list of lists qualify.
Maybe something like this?
#!/usr/local/cpython-3.3/bin/python
import pprint
import collections
def main():
list_1 = ['A-1','A-1','A-1','A-2','A-2','A-3']
list_2 = ['iPad','iPod','iPhone','Windows','X-box','Kindle']
result = collections.defaultdict(list)
for list_1_element, list_2_element in zip(list_1, list_2):
result[list_1_element].append(list_2_element)
pprint.pprint(result)
main()
Using itertools.izip_longest and itertools.groupby:
>>> from itertools import groupby, izip_longest
>>> inds = [next(g)[0] for k, g in groupby(enumerate(list_1), key=lambda x:x[1])]
First group items of list_1 and find the starting index of each group:
>>> inds
[0, 3, 5]
Now use slicing and izip_longest as we need pairs list_2[0:3], list_2[3:5], list_2[5:]:
>>> [list_2[x:y] for x, y in izip_longest(inds, inds[1:])]
[['iPad', 'iPod', 'iPhone'], ['Windows', 'X-box'], ['Kindle']]
To get a list of dicts you can something like:
>>> inds = [next(g) for k, g in groupby(enumerate(list_1), key=lambda x:x[1])]
>>> {k: list_2[ind1: ind2[0]] for (ind1, k), ind2 in
zip_longest(inds, inds[1:], fillvalue=[None])}
{'A-1': ['iPad', 'iPod', 'iPhone'], 'A-3': ['Kindle'], 'A-2': ['Windows', 'X-box']}
You could do this if you want simple code, it's not pretty, but gets the job done.
list_1 = ['A-1','A-1','A-1','A-2','A-2','A-3']
list_2 = ['iPad','iPod','iPhone','Windows','X-box','Kindle']
list_1a = []
list_1b = []
list_1c = []
place = 0
for i in list_1[::1]:
if list_1[place] == 'A-1':
list_1a.append(list_2[place])
elif list_1[place] == 'A-2':
list_1b.append(list_2[place])
else:
list_1c.append(list_2[place])
place += 1

using FOR statement on 2 elements at once python

I have the following list of variables and a mastervariable
a = (1,5,7)
b = (1,3,5)
c = (2,2,2)
d = (5,2,8)
e = (5,5,8)
mastervariable = (3,2,5)
I'm trying to check if 2 elements in each variable exist in the master variable, such that the above would show B (3,5) and D (5,2) as being elements with at least 2 elements matching in the mastervariable. Also note that using sets would result in C showing up as matchign but I don't want to count C cause only 'one' of the elements in C are in mastervariable (i.e. 2 only shows up once in mastervariable not twice)
I currently have the very inefficient:
if current_variable[0]==mastervariable[0]:
if current_variable[1] = mastervariable[1]:
True
elif current_variable[2] = mastervariable[1]:
True
#### I don't use OR here because I need to know which variables match.
elif current_variable[1] == mastervariable[0]: ##<-- I'm now checking 2nd element
etc. etc.
I then continue to iterate like the above by checking each one at a time which is extremely inefficient. I did the above because using a FOR statement resulted in me checking the first element twice which was incorrect:
For i in a:
for j in a:
### this checked if 1 was in the master variable and not 1,5 or 1,7
Is there a way to use 2 FOR statement that allows me to check 2 elements in a list at once while skipping any element that has been used already? Alternatively, can you suggest an efficient way to do what I'm trying?
Edit: Mastervariable can have duplicates in it.
For the case where matching elements can be duplicated so that set breaks, use Counter as a multiset - the duplicates between a and master are found by:
count_a = Counter(a)
count_master = Counter(master)
count_both = count_a + count_master
dups = Counter({e : min((count_a[e], count_master[e])) for e in count_a if count_both[e] > count_a[e]})
The logic is reasonably intuitive: if there's more of an item in the combined count of a and master, then it is duplicated, and the multiplicity is however many of that item are in whichever of a and master has less of them.
It gives a Counter of all the duplicates, where the count is their multiplicity. If you want it back as a tuple, you can do tuple(dups.elements()):
>>> a
(2, 2, 2)
>>> master
(1, 2, 2)
>>> dups = Counter({e : min((count_a[e], count_master[e])) for e in count_a if count_both[e] > count_a[e]})
>>> tuple(dups.elements())
(2, 2)
Seems like a good job for sets. Edit: sets aren't suitable since mastervariable can contain duplicates. Here is a version using Counters.
>>> a = (1,5,7)
>>>
>>> b = (1,3,5)
>>>
>>> c = (2,2,2)
>>>
>>> d = (5,2,8)
>>>
>>> e = (5,5,8)
>>> D=dict(a=a, b=b, c=c, d=d, e=e)
>>>
>>> from collections import Counter
>>> mastervariable = (5,5,3)
>>> mvc = Counter(mastervariable)
>>> for k,v in D.items():
... vc = Counter(v)
... if sum(min(count, vc[item]) for item, count in mvc.items())==2:
... print k
...
b
e

Categories