Count duplicate lists inside a list - python

lis = [ [12,34,56],[45,78,334],[56,90,78],[12,34,56] ]
I want the result to be 2 since number of duplicate lists are 2 in total. How do I do that?
I have done something like this
count=0
for i in range(0, len(lis)-1):
for j in range(i+1, len(lis)):
if lis[i] == lis[j]:
count+=1
But the count value is 1 as it returns matched lists. How do I get the total number of duplicate lists?

Solution
You can use collections.Counter if your sub-lists only contain numbers and therefore are hashable:
>>> from collections import Counter
>>> lis = [[12, 34, 56], [45, 78, 334], [56, 90, 78], [12, 34, 56]]
>>> sum(y for y in Counter(tuple(x) for x in lis).values() if y > 1)
2
>>> lis = [[12, 34, 56], [45, 78, 334], [56, 90, 78], [12, 34, 56], [56, 90, 78], [12, 34, 56]]
>>> sum(y for y in Counter(tuple(x) for x in lis).values() if y > 1)
5
In Steps
Convert your sub-list into tuples:
tuple(x) for x in lis
Count them:
>>> Counter(tuple(x) for x in lis)
Counter({(12, 34, 56): 3, (45, 78, 334): 1, (56, 90, 78): 2})
take only the values:
>>> Counter(tuple(x) for x in lis).values()
dict_values([3, 1, 2])
Finally, sum only the ones that have a count greater than 1:
> sum(y for y in Counter(tuple(x) for x in lis).values() if y > 1)
5
Make it Re-usable
Put it into a function, add a docstring, and a doc test:
"""Count duplicates of sub-lists.
"""
from collections import Counter
def count_duplicates(lis):
"""Count duplicates of sub-lists.
Assumption: Sub-list contain only hashable elements.
Result: If a sub-list appreas twice the result is 2.
If a sub-list aprears three time and a other twice the result is 5.
>>> count_duplicates([[12, 34, 56], [45, 78, 334], [56, 90, 78],
... [12, 34, 56]])
2
>>> count_duplicates([[12, 34, 56], [45, 78, 334], [56, 90, 78],
... [12, 34, 56], [56, 90, 78], [12, 34, 56]])
...
5
"""
# Make it a bit more verbose than necessary for readability and
# educational purposes.
tuples = (tuple(elem) for elem in lis)
counts = Counter(tuples).values()
return sum(elem for elem in counts if elem > 1)
if __name__ == '__main__':
import doctest
doctest.testmod(verbose=True)
Run the test:
python count_dupes.py
Trying:
count_duplicates([[12, 34, 56], [45, 78, 334], [56, 90, 78],
[12, 34, 56]])
Expecting:
2
ok
Trying:
count_duplicates([[12, 34, 56], [45, 78, 334], [56, 90, 78],
[12, 34, 56], [56, 90, 78], [12, 34, 56]])
Expecting:
5
ok
1 items had no tests:
__main__
1 items passed all tests:
2 tests in __main__.count_duplicates
2 tests in 2 items.
2 passed and 0 failed.
Test passed.

Related

Get second minimum values per column in 2D array

How can I get the second minimum value from each column? I have this array:
A = [[72 76 44 62 81 31]
[54 36 82 71 40 45]
[63 59 84 36 34 51]
[58 53 59 22 77 64]
[35 77 60 76 57 44]]
I wish to have output like:
A = [54 53 59 36 40 44]
Try this, in just one line:
[sorted(i)[1] for i in zip(*A)]
in action:
In [12]: A = [[72, 76, 44, 62, 81, 31],
...: [54 ,36 ,82 ,71 ,40, 45],
...: [63 ,59, 84, 36, 34 ,51],
...: [58, 53, 59, 22, 77 ,64],
...: [35 ,77, 60, 76, 57, 44]]
In [18]: [sorted(i)[1] for i in zip(*A)]
Out[18]: [54, 53, 59, 36, 40, 44]
zip(*A) will transpose your list of list so the columns become rows.
and if you have duplicate value, for example:
In [19]: A = [[72, 76, 44, 62, 81, 31],
...: [54 ,36 ,82 ,71 ,40, 45],
...: [63 ,59, 84, 36, 34 ,51],
...: [35, 53, 59, 22, 77 ,64], # 35
...: [35 ,77, 50, 76, 57, 44],] # 35
If you need to skip both 35s, you can use set():
In [29]: [sorted(list(set(i)))[1] for i in zip(*A)]
Out[29]: [54, 53, 50, 36, 40, 44]
Operations on numpy arrays should be done with numpy functions, so look at this one:
np.sort(A, axis=0)[1, :]
Out[61]: array([54, 53, 59, 36, 40, 44])
you can use heapq.nsmallest
from heapq import nsmallest
[nsmallest(2, e)[-1] for e in zip(*A)]
output:
[54, 53, 50, 36, 40, 44]
I added a simple benchmark to compare the performance of the different solutions already posted:
from simple_benchmark import BenchmarkBuilder
from heapq import nsmallest
b = BenchmarkBuilder()
#b.add_function()
def MehrdadPedramfar(A):
return [sorted(i)[1] for i in zip(*A)]
#b.add_function()
def NicolasGervais(A):
return np.sort(A, axis=0)[1, :]
#b.add_function()
def imcrazeegamerr(A):
rotated = zip(*A[::-1])
result = []
for arr in rotated:
# sort each 1d array from min to max
arr = sorted(list(arr))
# add the second minimum value to result array
result.append(arr[1])
return result
#b.add_function()
def Daweo(A):
return np.apply_along_axis(lambda x:heapq.nsmallest(2,x)[-1], 0, A)
#b.add_function()
def kederrac(A):
return [nsmallest(2, e)[-1] for e in zip(*A)]
#b.add_arguments('Number of row/cols (A is square matrix)')
def argument_provider():
for exp in range(2, 18):
size = 2**exp
yield size, [[randint(0, 1000) for _ in range(size)] for _ in range(size)]
r = b.run()
r.plot()
Using zip with sorted function is the fastest solution for small 2d lists while using zip with heapq.nsmallest shows to be the best on big 2d lists
I hope I understood your question correctly but either way here's my solution, im sure there is a more elegent way of doing this but it works
A = [[72,76,44,62,81,31]
,[54,36,82,71,40,45]
,[63,59,84,36,34,51]
,[58,53,59,22,77,64]
,[35,77,50,76,57,44]]
#rotate the array 90deg
rotated = zip(*A[::-1])
result = []
for arr in rotated:
# sort each 1d array from min to max
arr = sorted(list(arr))
# add the second minimum value to result array
result.append(arr[1])
print(result)
Assuming that A is numpy.array (if this holds true please consider adding numpy tag to your question) then you might use apply_along_axis for that following way:
import heap
import numpy as np
A = np.array([[72, 76, 44, 62, 81, 31],
[54, 36, 82, 71, 40, 45],
[63, 59, 84, 36, 34, 51],
[58, 53, 59, 22, 77, 64],
[35, 77, 60, 76, 57, 44]])
second_mins = np.apply_along_axis(lambda x:heapq.nsmallest(2,x)[-1], 0, A)
print(second_mins) # [54 53 59 36 40 44]
Note that I used heapq.nsmallest as it does as much sorting as required to get 2 smallest elements, unlike sorted which does complete sort.
>>> A = np.arange(30).reshape(5,6).tolist()
>>> A
[[0, 1, 2, 3, 4, 5],
[6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]]
Updated:
Use set to prevent from duplicate and transpose list using zip(*A)
>>> [sorted(set(items))[1] for items in zip(*A)]
[6, 7, 8, 9, 10, 11]
old: second minimum item in each row
>>> [sorted(set(items))[1] for items in A]
[1, 7, 13, 19, 25]

Why does my program output the wrong highest number?

When I run this program, it says that the max number is 1 digit lower than it actually is in the list. For example, I run the code below and it tells me the max number is 91 when it is 92 from the list.
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 92]]
for eachRow in range(len(examMarks)):
for eachColumn in range(len(examMarks[eachRow])):
eachExamMark = (examMarks[eachRow][eachColumn])
max = -100
for everyMark in range(eachExamMark):
if everyMark > max:
max = everyMark
print(max)
I don't see the reason for the 3 loops to be honest, have you tried something like this?
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 92]]
highest = 0
for marks in examMarks:
for mark in marks:
highest = max(mark, highest)
print('Highest mark: %d' % highest)
You should try this code!
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 12]]
eachExamMark =[]
for eachRow in range(len(examMarks)):
for eachColumn in range(len(examMarks[eachRow])):
eachExamMark.append(examMarks[eachRow][eachColumn])
max = -100
for everyMark in eachExamMark:
if everyMark > max:
max = everyMark
print(max)
eachExamMark will be set to (92) that is the number 92 as the last step of the first part of your program. If you do a for loop over range(92) it will end at 91.
You should at least do:
print(eachExamMark)
before the max = -100 line.
You probably want to do:
eachExamMark.append(examMarks[eachRow][eachColumn])
after defining eachExamMark = [] at the beginning.
I am not sure if you have to solve things this way, but IMHO you should not be using range() at all, and there is no need to build a flattened list either.
You could e.g. do:
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 92]]
print(max(max(x) for x in examMarks))
As others point out the problem is the last loop is performing a range of a number, not a list as you expect.
An alternative algorithm:
from itertools import chain
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 92]]
print(max(chain.from_iterable(examMarks)))
chain.from_iterable(examMarks) flattens the list to an iterator
max() finds the maximum number on the list
Note: My original answer used sum(examMarks, []) to flatten the list. Thanks #John Coleman for your comment on a faster solution.
Try this one:
examMarks = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 12]]
eachExamMark =[]
for eachRow in examMarks:
eachExamMark.append(max(eachRow))
print max(eachExamMark)

Group Python lists based on repeated items

This question is very similar to this one Group Python list of lists into groups based on overlapping items, in fact it could be called a duplicate.
Basically, I have a list of sub-lists where each sub-list contains some number of integers (this number is not the same among sub-lists). I need to group all sub-lists that share one integer or more.
The reason I'm asking a new separate question is because I'm attempting to adapt Martijn Pieters' great answer with no luck.
Here's the MWE:
def grouper(sequence):
result = [] # will hold (members, group) tuples
for item in sequence:
for members, group in result:
if members.intersection(item): # overlap
members.update(item)
group.append(item)
break
else: # no group found, add new
result.append((set(item), [item]))
return [group for members, group in result]
gr = [[29, 27, 26, 28], [31, 11, 10, 3, 30], [71, 51, 52, 69],
[78, 67, 68, 39, 75], [86, 84, 81, 82, 83, 85], [84, 67, 78, 77, 81],
[86, 68, 67, 84]]
for i, group in enumerate(grouper(gr)):
print 'g{}:'.format(i), group
and the output I get is:
g0: [[29, 27, 26, 28]]
g1: [[31, 11, 10, 3, 30]]
g2: [[71, 51, 52, 69]]
g3: [[78, 67, 68, 39, 75], [84, 67, 78, 77, 81], [86, 68, 67, 84]]
g4: [[86, 84, 81, 82, 83, 85]]
The last group g4 should have been merged with g3, since the lists inside them share the items 81, 83 and 84, and even a single repeated element should be enough for them to be merged.
I'm not sure if I'm applying the code wrong, or if there's something wrong with the code.
You can describe the merge you want to do as a set consolidation or as a connected-components problem. I tend to use an off-the-shelf set consolidation algorithm and then adapt it to the particular situation. For example, IIUC, you could use something like
def consolidate(sets):
# http://rosettacode.org/wiki/Set_consolidation#Python:_Iterative
setlist = [s for s in sets if s]
for i, s1 in enumerate(setlist):
if s1:
for s2 in setlist[i+1:]:
intersection = s1.intersection(s2)
if intersection:
s2.update(s1)
s1.clear()
s1 = s2
return [s for s in setlist if s]
def wrapper(seqs):
consolidated = consolidate(map(set, seqs))
groupmap = {x: i for i,seq in enumerate(consolidated) for x in seq}
output = {}
for seq in seqs:
target = output.setdefault(groupmap[seq[0]], [])
target.append(seq)
return list(output.values())
which gives
>>> for i, group in enumerate(wrapper(gr)):
... print('g{}:'.format(i), group)
...
g0: [[29, 27, 26, 28]]
g1: [[31, 11, 10, 3, 30]]
g2: [[71, 51, 52, 69]]
g3: [[78, 67, 68, 39, 75], [86, 84, 81, 82, 83, 85], [84, 67, 78, 77, 81], [86, 68, 67, 84]]
(Order not guaranteed because of the use of the dictionaries.)
Sounds like set consolidation if you turn each sub list into a set instead as you are interested in the contents not the order so sets are the best data-structure choice. See this: http://rosettacode.org/wiki/Set_consolidation

Looping and appending to a list

def makelist():
list = []
initial = [0, 100]
a = 1
while a <= 10:
new_num = (initial[0] + initial[1])//2
if new_num%2 == 0:
initial[1] = new_num
else:
initial[0] = new_num
list.append(initial)
a += 1
return list
Why does the above code return:
[[43, 44], [43, 44], [43, 44], [43, 44], [43, 44], [43, 44], [43, 44], [43, 44], [43, 44], [43, 44]]
Rather than:
[[0, 50], [25, 50], [37, 50], [43, 50], [43, 46], [43,44], [43, 44], [43, 44], [43, 44], [43, 44]]
ie. The successive divisions rather than the final division 10 times over.
Thanks
First things first, NEVER name your variables with the builtin types names. In your case list. And your program doesnt work as expected, because
list.append(initial)
you are appending the reference to the same list again and again. At the end, all the elements in the list are pointing to the same list initial. To fix that, you can create a copy of initial like this and append it
list.append(initial[:])
Check this diagram out https://www.lucidchart.com/documents/edit/43c8-0c30-5288c928-a4fe-105f0a00c875. This might help you understand better.

What's going on here? Repeating rows in random list of lists

I expected to get a grid of unique random numbers. Instead each row is the same sequence of numbers. What's going on here?
from pprint import pprint
from random import random
nrows, ncols = 5, 5
grid = [[0] * ncols] * nrows
for r in range(nrows):
for c in range(ncols):
grid[r][c] = int(random() * 100)
pprint(grid)
Example output:
[[64, 82, 90, 69, 36],
[64, 82, 90, 69, 36],
[64, 82, 90, 69, 36],
[64, 82, 90, 69, 36],
[64, 82, 90, 69, 36]]
I think that this is because python uses a weak copy of the list when you call
grid = [...] * nrows
I tried hard coding the list and it worked correctly:
>>> grid = [[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0]]
>>> for r in range(nrows):
... for c in range(ncols):
... grid[r][c] = int(random() * 100)
...
>>> pprint(grid)
[[67, 40, 41, 50, 92],
[26, 42, 64, 77, 77],
[65, 67, 88, 77, 76],
[36, 21, 41, 29, 25],
[98, 77, 38, 40, 96]]
This tells me that when python copies the list 5 times, all it is doing is storing 5 pointers to your first list - then, when you change the values in that list, you are actually just changing the value in the first list and it gets reflected in all lists which point to that one.
Using your method, you can't update all the list independently.
Instead, I would suggest changing your list generation line to look more like this:
grid = [[0] * ncols for row in range(ncols)]
That should create 5 independent lists for you.

Categories