Group Python lists based on repeated items

Group Python lists based on repeated items - python

This question is very similar to this one Group Python list of lists into groups based on overlapping items, in fact it could be called a duplicate.
Basically, I have a list of sub-lists where each sub-list contains some number of integers (this number is not the same among sub-lists). I need to group all sub-lists that share one integer or more.
The reason I'm asking a new separate question is because I'm attempting to adapt Martijn Pieters' great answer with no luck.
Here's the MWE:
def grouper(sequence):
result = [] # will hold (members, group) tuples
for item in sequence:
for members, group in result:
if members.intersection(item): # overlap
members.update(item)
group.append(item)
break
else: # no group found, add new
result.append((set(item), [item]))
return [group for members, group in result]
gr = [[29, 27, 26, 28], [31, 11, 10, 3, 30], [71, 51, 52, 69],
[78, 67, 68, 39, 75], [86, 84, 81, 82, 83, 85], [84, 67, 78, 77, 81],
[86, 68, 67, 84]]
for i, group in enumerate(grouper(gr)):
print 'g{}:'.format(i), group
and the output I get is:
g0: [[29, 27, 26, 28]]
g1: [[31, 11, 10, 3, 30]]
g2: [[71, 51, 52, 69]]
g3: [[78, 67, 68, 39, 75], [84, 67, 78, 77, 81], [86, 68, 67, 84]]
g4: [[86, 84, 81, 82, 83, 85]]
The last group g4 should have been merged with g3, since the lists inside them share the items 81, 83 and 84, and even a single repeated element should be enough for them to be merged.
I'm not sure if I'm applying the code wrong, or if there's something wrong with the code.

You can describe the merge you want to do as a set consolidation or as a connected-components problem. I tend to use an off-the-shelf set consolidation algorithm and then adapt it to the particular situation. For example, IIUC, you could use something like
def consolidate(sets):
# http://rosettacode.org/wiki/Set_consolidation#Python:_Iterative
setlist = [s for s in sets if s]
for i, s1 in enumerate(setlist):
if s1:
for s2 in setlist[i+1:]:
intersection = s1.intersection(s2)
if intersection:
s2.update(s1)
s1.clear()
s1 = s2
return [s for s in setlist if s]
def wrapper(seqs):
consolidated = consolidate(map(set, seqs))
groupmap = {x: i for i,seq in enumerate(consolidated) for x in seq}
output = {}
for seq in seqs:
target = output.setdefault(groupmap[seq[0]], [])
target.append(seq)
return list(output.values())
which gives
>>> for i, group in enumerate(wrapper(gr)):
... print('g{}:'.format(i), group)
...
g0: [[29, 27, 26, 28]]
g1: [[31, 11, 10, 3, 30]]
g2: [[71, 51, 52, 69]]
g3: [[78, 67, 68, 39, 75], [86, 84, 81, 82, 83, 85], [84, 67, 78, 77, 81], [86, 68, 67, 84]]
(Order not guaranteed because of the use of the dictionaries.)

Sounds like set consolidation if you turn each sub list into a set instead as you are interested in the contents not the order so sets are the best data-structure choice. See this: http://rosettacode.org/wiki/Set_consolidation

Related

Faster/lazier way to evenly and randomly split m*n into n group (each has m elements) in python

I want to split m*n elements (e.g., 1, 2, ..., m*n) into n group randomly and evenly such that each group has m random elements. Each group will process k (k>=1) elements at one time from its own group and at the same speed (via some synchronization mechanism), until all group has processed all their own elements. Actually each group is in an independent process/thread.
I use numpy.random.choice(m*n, m*n, replace=False) to generate the permutation first, and then index the permuted result from each group.
The problem is that when m*n is very large (e.g., >=1e8), the speed is very slow (tens of seconds or minutes).
Is there any faster/lazier way to do this? I think maybe this can be done in a lazier way, which is not generating the permuted result in the first time, but generate a generator first, and in each group, generate k elements at each time, and its effect should be identical to the method I currently use. But I don't know how to achieve this lazy way. And I am not sure whether this can be implemented actually.

You can make a generator that will progressively shuffle (a copy of) the list and lazily yield distinct groups:
import random
def rndGroups(A,size):
A = A.copy() # work on a copy (if needed)
p = len(A) # target position of random item
for _ in range(0,len(A),size): # work in chunks of group size
for _ in range(size): # Create one group
i = random.randrange(p) # random index in remaining items
p -= 1 # update randomized position
A[i],A[p] = A[p],A[i] # swap items
yield A[p:p+size] # return shuffled sub-range
Output:
A = list(range(100))
iG = iter(rndGroups(A,10)) # 10 groups of 10 items
s = set() # set to validate uniqueness
for _ in range(10): # 10 groups
g = next(iG) # get the next group from generator
s.update(g) # to check that all items are distinct
print(g)
print(len(s)) # must get 100 distinct values from groups
[87, 19, 85, 90, 35, 55, 86, 58, 96, 68]
[38, 92, 93, 78, 39, 62, 43, 20, 66, 44]
[34, 75, 72, 50, 42, 52, 60, 81, 80, 41]
[13, 14, 83, 28, 53, 5, 94, 67, 79, 95]
[9, 33, 0, 76, 4, 23, 2, 3, 32, 65]
[61, 24, 31, 77, 36, 40, 47, 49, 7, 97]
[63, 15, 29, 25, 11, 82, 71, 89, 91, 30]
[12, 22, 99, 37, 73, 69, 45, 1, 88, 51]
[74, 70, 98, 26, 59, 6, 64, 46, 27, 21]
[48, 17, 18, 8, 54, 10, 57, 84, 16, 56]
100
This will take just as long as pre-shuffling the list (if not longer) but it will let you start/feed threads as you go, thus augmenting the parallelism

Remove elements in a list if difference with previous element less than value

Given a list of numbers in ascending order. It is necessary to leave only elements to get such a list where the difference between the elements was greater or equal than a certain value (10 in my case).
Given:
list = [10,15,17,21,34,36,42,67,75,84,92,94,103,115]
Goal:
list=[10,21,34,67,84,94,115]

you could use a while loop and a variable to track the current index you are currently looking at. So starting at index 1, check if the number at this index minus the number in the previous index is less than 10. If it is then delete this index but keep the index counter the same so we look at the next num that is now in this index. If the difference is 10 or more increase the index to look at the next num. I have an additional print line in the loop you can remove this is just to show the comparing.
nums = [10, 15, 17, 21, 34, 36, 42, 67, 75, 84, 92, 94, 103, 115]
index = 1
while index < len(nums):
print(f"comparing {nums[index-1]} with {nums[index]} nums list {nums}")
if nums[index] - nums[index - 1] < 10:
del nums[index]
else:
index += 1
print(nums)
OUTPUT
comparing 10 with 15 nums list [10, 15, 17, 21, 34, 36, 42, 67, 75, 84, 92, 94, 103, 115]
comparing 10 with 17 nums list [10, 17, 21, 34, 36, 42, 67, 75, 84, 92, 94, 103, 115]
comparing 10 with 21 nums list [10, 21, 34, 36, 42, 67, 75, 84, 92, 94, 103, 115]
comparing 21 with 34 nums list [10, 21, 34, 36, 42, 67, 75, 84, 92, 94, 103, 115]
comparing 34 with 36 nums list [10, 21, 34, 36, 42, 67, 75, 84, 92, 94, 103, 115]
comparing 34 with 42 nums list [10, 21, 34, 42, 67, 75, 84, 92, 94, 103, 115]
comparing 34 with 67 nums list [10, 21, 34, 67, 75, 84, 92, 94, 103, 115]
comparing 67 with 75 nums list [10, 21, 34, 67, 75, 84, 92, 94, 103, 115]
comparing 67 with 84 nums list [10, 21, 34, 67, 84, 92, 94, 103, 115]
comparing 84 with 92 nums list [10, 21, 34, 67, 84, 92, 94, 103, 115]
comparing 84 with 94 nums list [10, 21, 34, 67, 84, 94, 103, 115]
comparing 94 with 103 nums list [10, 21, 34, 67, 84, 94, 103, 115]
comparing 94 with 115 nums list [10, 21, 34, 67, 84, 94, 115]
[10, 21, 34, 67, 84, 94, 115]

You could build up the list in a loop. Start with the first number in the list. Keep track of the last number chosen to be in the new list. Add an item to the new list only when it differs from the last number chosen by at least the target amount:
my_list = [10,15,17,21,34,36,42,67,75,84,92,94,103,115]
last_num = my_list[0]
new_list = [last_num]
for x in my_list[1:]:
if x - last_num >= 10:
new_list.append(x)
last_num = x
print(new_list) #prints [10, 21, 34, 67, 84, 94, 115]

This problem can be solved fairly simply by iterating over your initial set of values, and adding them to your new list only when your difference of x condition is met.
Additionally, by putting this functionality into a function, you can get easily swap out the values or the minimum distance.
values = [10,15,17,21,34,36,42,67,75,84,92,94,103,115]
def foo(elements, distance):
elements = sorted(elements) # sorting the user input
new_elements = [elements[0]] # make a new list for output
for element in elements[1:]: # Iterate over the remaining elements...
if element - new_elements[-1] >= distance:
# this is the condition you described above
new_elements.append(element)
return new_elements
print(foo(values, 10))
# >>> [10, 21, 34, 67, 84, 94, 115]
print(foo(values, 5))
# >>> [10, 15, 21, 34, 42, 67, 75, 84, 92, 103, 115]
A few other notes here...
I sorted the array before I processed it. You may not want to do that for your particular application, but it seemed to make sense, since your sample data was already sorted. In the case that you don't want to sort the data before you build the list, you can remove the sorted on the line that I commented above.
I named the function foo because I was lazy and didn't want to think about the name. I highly recommend that you give it a more descriptive name.

Count duplicate lists inside a list

lis = [ [12,34,56],[45,78,334],[56,90,78],[12,34,56] ]
I want the result to be 2 since number of duplicate lists are 2 in total. How do I do that?
I have done something like this
count=0
for i in range(0, len(lis)-1):
for j in range(i+1, len(lis)):
if lis[i] == lis[j]:
count+=1
But the count value is 1 as it returns matched lists. How do I get the total number of duplicate lists?

Solution
You can use collections.Counter if your sub-lists only contain numbers and therefore are hashable:
>>> from collections import Counter
>>> lis = [[12, 34, 56], [45, 78, 334], [56, 90, 78], [12, 34, 56]]
>>> sum(y for y in Counter(tuple(x) for x in lis).values() if y > 1)
2
>>> lis = [[12, 34, 56], [45, 78, 334], [56, 90, 78], [12, 34, 56], [56, 90, 78], [12, 34, 56]]
>>> sum(y for y in Counter(tuple(x) for x in lis).values() if y > 1)
5
In Steps
Convert your sub-list into tuples:
tuple(x) for x in lis
Count them:
>>> Counter(tuple(x) for x in lis)
Counter({(12, 34, 56): 3, (45, 78, 334): 1, (56, 90, 78): 2})
take only the values:
>>> Counter(tuple(x) for x in lis).values()
dict_values([3, 1, 2])
Finally, sum only the ones that have a count greater than 1:
> sum(y for y in Counter(tuple(x) for x in lis).values() if y > 1)
5
Make it Re-usable
Put it into a function, add a docstring, and a doc test:
"""Count duplicates of sub-lists.
"""
from collections import Counter
def count_duplicates(lis):
"""Count duplicates of sub-lists.
Assumption: Sub-list contain only hashable elements.
Result: If a sub-list appreas twice the result is 2.
If a sub-list aprears three time and a other twice the result is 5.
>>> count_duplicates([[12, 34, 56], [45, 78, 334], [56, 90, 78],
... [12, 34, 56]])
2
>>> count_duplicates([[12, 34, 56], [45, 78, 334], [56, 90, 78],
... [12, 34, 56], [56, 90, 78], [12, 34, 56]])
...
5
"""
# Make it a bit more verbose than necessary for readability and
# educational purposes.
tuples = (tuple(elem) for elem in lis)
counts = Counter(tuples).values()
return sum(elem for elem in counts if elem > 1)
if __name__ == '__main__':
import doctest
doctest.testmod(verbose=True)
Run the test:
python count_dupes.py
Trying:
count_duplicates([[12, 34, 56], [45, 78, 334], [56, 90, 78],
[12, 34, 56]])
Expecting:
2
ok
Trying:
count_duplicates([[12, 34, 56], [45, 78, 334], [56, 90, 78],
[12, 34, 56], [56, 90, 78], [12, 34, 56]])
Expecting:
5
ok
1 items had no tests:
__main__
1 items passed all tests:
2 tests in __main__.count_duplicates
2 tests in 2 items.
2 passed and 0 failed.
Test passed.

Why does my program output the wrong highest number?

When I run this program, it says that the max number is 1 digit lower than it actually is in the list. For example, I run the code below and it tells me the max number is 91 when it is 92 from the list.
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 92]]
for eachRow in range(len(examMarks)):
for eachColumn in range(len(examMarks[eachRow])):
eachExamMark = (examMarks[eachRow][eachColumn])
max = -100
for everyMark in range(eachExamMark):
if everyMark > max:
max = everyMark
print(max)

I don't see the reason for the 3 loops to be honest, have you tried something like this?
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 92]]
highest = 0
for marks in examMarks:
for mark in marks:
highest = max(mark, highest)
print('Highest mark: %d' % highest)

You should try this code!
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 12]]
eachExamMark =[]
for eachRow in range(len(examMarks)):
for eachColumn in range(len(examMarks[eachRow])):
eachExamMark.append(examMarks[eachRow][eachColumn])
max = -100
for everyMark in eachExamMark:
if everyMark > max:
max = everyMark
print(max)

eachExamMark will be set to (92) that is the number 92 as the last step of the first part of your program. If you do a for loop over range(92) it will end at 91.
You should at least do:
print(eachExamMark)
before the max = -100 line.
You probably want to do:
eachExamMark.append(examMarks[eachRow][eachColumn])
after defining eachExamMark = [] at the beginning.
I am not sure if you have to solve things this way, but IMHO you should not be using range() at all, and there is no need to build a flattened list either.
You could e.g. do:
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 92]]
print(max(max(x) for x in examMarks))

As others point out the problem is the last loop is performing a range of a number, not a list as you expect.
An alternative algorithm:
from itertools import chain
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 92]]
print(max(chain.from_iterable(examMarks)))
chain.from_iterable(examMarks) flattens the list to an iterator
max() finds the maximum number on the list
Note: My original answer used sum(examMarks, []) to flatten the list. Thanks #John Coleman for your comment on a faster solution.

Try this one:
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 12]]
eachExamMark =[]
for eachRow in examMarks:
eachExamMark.append(max(eachRow))
print max(eachExamMark)

What's going on here? Repeating rows in random list of lists

I expected to get a grid of unique random numbers. Instead each row is the same sequence of numbers. What's going on here?
from pprint import pprint
from random import random
nrows, ncols = 5, 5
grid = [[0] * ncols] * nrows
for r in range(nrows):
for c in range(ncols):
grid[r][c] = int(random() * 100)
pprint(grid)
Example output:
[[64, 82, 90, 69, 36],
[64, 82, 90, 69, 36],
[64, 82, 90, 69, 36],
[64, 82, 90, 69, 36],
[64, 82, 90, 69, 36]]

I think that this is because python uses a weak copy of the list when you call
grid = [...] * nrows
I tried hard coding the list and it worked correctly:
>>> grid = [[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0]]
>>> for r in range(nrows):
... for c in range(ncols):
... grid[r][c] = int(random() * 100)
...
>>> pprint(grid)
[[67, 40, 41, 50, 92],
[26, 42, 64, 77, 77],
[65, 67, 88, 77, 76],
[36, 21, 41, 29, 25],
[98, 77, 38, 40, 96]]
This tells me that when python copies the list 5 times, all it is doing is storing 5 pointers to your first list - then, when you change the values in that list, you are actually just changing the value in the first list and it gets reflected in all lists which point to that one.
Using your method, you can't update all the list independently.
Instead, I would suggest changing your list generation line to look more like this:
grid = [[0] * ncols for row in range(ncols)]
That should create 5 independent lists for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group Python lists based on repeated items - python

Sounds like set consolidation if you turn each sub list into a set instead as you are interested in the contents not the order so sets are the best data-structure choice. See this: http://rosettacode.org/wiki/Set_consolidation

Related

Faster/lazier way to evenly and randomly split m*n into n group (each has m elements) in python

Remove elements in a list if difference with previous element less than value

Count duplicate lists inside a list

Why does my program output the wrong highest number?

What's going on here? Repeating rows in random list of lists

Categories

Resources