Finding duplicates in few lists - python

In my case duplicate is not a an item that reappear in one list, but also in the same positions on another lists. For example:
list1 = [1,2,3,3,3,4,5,5]
list2 = ['a','b','b','c','b','d','e','e']
list3 = ['T1','T2','T3','T4','T3','T4','T5','T5']
So the position of the real duplicates in all 3 lists is [2,4] and [6,7]. Because in list1 3 is repeated, in list2 'b' is repeated in the same position as in list1, in list 3 'T3'. in second case 5,e,T5 represent duplicated items in the same positions in their lists. I have a hard time to present results "automatically" in one step.
1) I find duplicate in first list
# Find Duplicated part numbers (exact maches)
def list_duplicates(seq):
seen = set()
seen_add = seen.add
# adds all elements it doesn't know yet to seen and all other to seen_twice
seen_twice = set( x for x in seq if x in seen or seen_add(x) )
# turn the set into a list (as requested)
return list(seen_twice)
# List of Duplicated part numbers
D_list1 = list_duplicates(list1)
D_list2 = list_duplicates(list2)
2) Then I find the positions of given duplicate and look at that position in second list
# find the row position of duplicated part numbers
def list_position_duplicates(list1,n,D_list1):
position = []
gen = (i for i,x in enumerate(data) if x == D_list1[n])
for i in gen: position.append(i)
return position
# Actual calculation find the row position of duplicated part numbers, beginning and end
lpd_part = list_position_duplicates(list1,1,D_list1)
start = lpd_part[0]
end = lpd_part[-1]
lpd_parent = list_position_duplicates(list2[start:end+1],0,D_list2)
So in step 2 I need to put n (position of found duplicate in the list), I would like to do this step automatically, to have a position of duplicated elements in the same positions in the lists. For all duplicates in the same time, and not one by one "manualy". I think it just need a for loop or if, but I'm new to Python and I tried many combinations and it didn't work.

You can use items from all 3 lists on the same index as key and store the the corresponding index as value(in a list). If for any key there are more than 1 indices stored in the list, it is duplicate:
from itertools import izip
def solve(*lists):
d = {}
for i, k in enumerate(izip(*lists)):
d.setdefault(k, []).append(i)
for k, v in d.items():
if len(v) > 1:
print k, v
solve(list1, list2, list3)
#(3, 'b', 'T3') [2, 4]
#(5, 'e', 'T5') [6, 7]

Related

How to Iterate over two lists and position the elements of the output list differently in python pandas?

How to iterate over two lists such that the output list should have the first value of the first list as the first element, the first value of the second list as the last element, the second value of the first list as the second element, the second value of the second list as the second last element and so on and then remove the duplicates.
example: a=['A','C','B','E','D']
b=['B','D','A','E','C']
Output: c=['A','C','E','D','B']
This is a possible solution:
from itertools import zip_longest
lst = [[], []]
s = set()
for t in zip_longest(a, b):
for i, x in enumerate(t):
if x is not None and x not in s:
lst[i].append(x)
s.add(x)
c = lst[0] + lst[1][::-1]

Save list number within a list only if it contains elements in python

I have list of lists such as :
my_list_of_list=[['A','B','C','E'],['A','B','C','E','F'],['D','G','A'],['X','Z'],['D','M'],['B','G'],['X','Z']]
as you can see, the list 1 and 2 share the most elements (4). So, I keep a list within my_list_of_list only if the 4 shared elements (A,B,C or E) are present within that list.
Here I then save within the list_shared_number[], only the lists 1,2,3 and 6 since the other does not contain (A,B,C or E).
Expected output:
print(list_shared_number)
[0,1,2,5]
Probably sub optimal because I need to iterate 3 times over lists but it's the expect result:
from itertools import combinations
from functools import reduce
common_elements = [set(i).intersection(j)
for i, j in combinations(my_list_of_list, r=2)]
common_element = reduce(lambda i, j: i if len(i) >= len(j) else j, common_elements)
list_shared_number = [idx for idx, l in enumerate(my_list_of_list)
if common_element.intersection(l)]
print(list_shared_number)
# Output
[0, 1, 2, 5]
Alternative with 2 iterations:
common_element = {}
for i, j in combinations(my_list_of_list, r=2):
c = set(i).intersection(j)
common_element = c if len(c) > len(common_element) else common_element
list_shared_number = [idx for idx, l in enumerate(my_list_of_list)
if common_element.intersection(l)]
print(list_shared_number)
# Output
[0, 1, 2, 5]
You can find shared elements by using list comprehension. Checking if index 0 and index 1:
share = [x for x in my_list_of_list[0] if x in my_list_of_list[1]]
print(share)
Assume j is each item so [j for j in x if j in share] can find shared inner elements. if the length of this array is more than 0 so it should include in the output.
So final code is like this:
share = [x for x in my_list_of_list[0] if x in my_list_of_list[1]]
my_list = [i for i, x in enumerate(my_list_of_list) if len([j for j in x if j in share]) > 0]
print(my_list)
You can use itertools.combinations and set operations.
In the first line, you find the intersection that is the longest among pairs of lists. In the second line, you iterate over my_list_of_list to identify the lists that contain elements from the set you found in the first line.
from itertools import combinations
comparison = max(map(lambda x: (len(set(x[0]).intersection(x[1])), set(x[0]).intersection(x[1])), combinations(my_list_of_list, 2)))[1]
out = [i for i, lst in enumerate(my_list_of_list) if comparison - set(lst) != comparison]
Output:
[0, 1, 2, 5]
Oh boy, so mine is a bit messy, however I did not use any imports AND I included the initial "finding" of the two lists which have the most in common with one another. This can easily be optimised but it does do exactly what you wanted.
my_list_of_list=[['A','B','C','E'],['A','B','C','E','F'],['D','G','A'],['X','Z'],['D','M'],['B','G'],['X','Z']]
my_list_of_list = list(map(set,my_list_of_list))
mostIntersects = [0, (None,)]
for i, IndSet in enumerate(my_list_of_list):
for j in range(i+1,len(my_list_of_list)):
intersects = len(IndSet.intersection(my_list_of_list[j]))
if intersects > mostIntersects[0]: mostIntersects = [intersects, (i,j)]
FinalIntersection = set(my_list_of_list[mostIntersects[1][0]]).intersection(my_list_of_list[mostIntersects[1][1]])
skipIndexes = set(mostIntersects[1])
for i,sub_list in enumerate(my_list_of_list):
[skipIndexes.add(i) for char in sub_list
if i not in skipIndexes and char in FinalIntersection]
print(*map(list,(mostIntersects, FinalIntersection, skipIndexes)), sep = '\n')
The print provides this :
[4, (0, 1)]
['E', 'C', 'B', 'A']
[0, 1, 2, 5]
This works by first converting the lists to sets using the map function (it has to be turned back into a list so i can use len and iterate properly) I then intersect each list with the others in the list of lists and count how many elements are in each. Each time i find one with a larger number, i set mostIntersections equal to the len and the set indexes. Once i go through them all, i get the lists at the two indexes (0 and 1 in this case) and intersect them to give a list of elements [A,B,C,E] (var:finalIntersection). From there, i just iterate over all lists which are not already being used and just check if any of the elements are found in finalIntersection. If one is, the index of the list is appended to skipIndexes. This results in the final list of indexes {indices? idk} that you were after. Technically the result is a set, but to convert it back you can just use list({0,1,2,5}) which will give you the value you were after.

Grouping a grouped list of str without duplicates

I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:
text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]
The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.
So from the above example the output might look like this (output is the index of the list):
[[0,2,4],[3,5]]
Notice how if there is another list that contains the same elements but in a different order is removed.
I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:
grouped_list = []
for i in range(0,len(text_list)):
int_temp = []
for m in range(0,len(text_list)):
if i == m:
continue
bool_check = all( x in text_list[m] for x in text_list[i][0:3])
if bool_check:
if len(int_temp) == 0:
int_temp.append(i)
int_temp.append(m)
continue
int_temp.append(m)
grouped_list.append(int_temp)
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]
Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.
Edit:
To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.
You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:
groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])
If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:
from itertools import combinations
sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
for c in map(frozenset, combinations(lst, 3)):
if c in sets:
groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])

how to generate list of lists from conditional items in another list of lists

I have a list of lists and am trying to make another list of lists from specific items within the first list of lists
listOne = [[1,1,9,9],[1,4,9,6],[2,1,12,12]]
listTwo = []
for every inner list with the same numbers in positions 0 and 2, append to listTwo only the inner list with the largest value in position 3
for example, inner list 0 and inner list 1 both have a 1 in position 0 and a 9 in position 2, but inner list 0 has a 9 in position 3 and inner list 1 has a 6 in position 3 so I want to append inner list 1 and not inner list 9 to listTwo. Since inner list 2 is the only list with a 2 in position 0 and a 12 in position 1, it need not be compared to anything else, and can be appended to listTwo.
I'm thinking something like:
for items in listOne :
#for all items where items[0] and items[2] are equal :
#tempList = []
#tempList.append(items)
#tempList.sort(key = lambda x: (x[3]))
#for value in tempList[0] :
#listTwo.append(all lists with value in tempList[0])
but I'm not sure how to implement this without a lot of really bad looking code, any suggestions for a "pythonic" way of sorting these lists?
If you're looking to write concise python you will want to use list comprehensions wherever possible. Your description was a little confusing, but something like
list_two = [inner_list for inner_list in list_one if inner_list[0] == inner_list[2]]
will get you all of the inner lists in which the 0 and 2 indices match. Then you can search all these to find the one with the largest 3 index, assuming there aren't any ties
list_three = [0,0,0,0]
for i in list_two:
if i[3] > list_three[3]:
list_three = i
Perhaps throwing everything into a dictionary? Something like this:
def strangeFilter(listOne):
listTwo = []
d = {}
for innerList in listOne:
positions = (innerList[0],innerList[2])
if positions not in d:
d[positions] = []
d[positions].append(innerList)
for positions in d:
listTwo.append(max(d[positions], key= lambda x: x[3]))
return listTwo
Not sure how much of a 'pythonic' solution this is, but it uses python-defined structures alright and has a nice time order of O(n)
Sort the list on items zero and two of the inner-lists. Using itertools.groupby extract the item in each group that has a maximum value at position 3.
import operator, itertools
# a couple of useful callables for the key functions
zero_two = operator.itemgetter(0,2)
three = operator.itemgetter(3)
a = [[2,1,12,22],[1,1,9,9],[2,1,12,10],
[1,4,9,6],[8,8,8,1],[2,1,12,12],
[1,3,9,8],[2,1,12,15],[8,8,8,0]
]
a.sort(key = zero_two)
for key, group in itertools.groupby(a, zero_two):
print(key, max(group, key = three))
'''
>>>
(1, 9) [1, 1, 9, 9]
(2, 12) [2, 1, 12, 22]
(8, 8) [8, 8, 8, 1]
>>>
'''
result = [max(group, key = three) for key, group in itertools.groupby(a, zero_two)]
You could also sort on items zero, two, three. Then group by items zero and two and extract the last item of the group.
zero_two_three = operator.itemgetter(0,2,3)
zero_two = operator.itemgetter(0,2)
last_item = operator.itemgetter(-1)
a.sort(key = zero_two_three)
for key, group in itertools.groupby(a, zero_two):
print(key, last_item(list(group)))

Get list based on occurrences in unknown number of sublists

I'm looking for a way to make a list containing list (a below) into a single list (b below) with 2 conditions:
The order of the new list (b) is based on the number of times the value has occurred in some of the lists in a.
A value can only appear once
Basically turn a into b:
a = [[1,2,3,4], [2,3,4], [4,5,6]]
# value 4 occurs 3 times in list a and gets first position
# value 2 occurs 2 times in list a and get second position and so on...
b = [4,2,3,1,5,6]
I figure one could do this with set and some list magic. But can't get my head around it when a can contain any number of list. The a list is created based on user input (I guess that it can contain between 1 - 20 list with up 200-300 items in each list).
My trying something along the line with [set(l) for l in a] but don't know how to perform set(l) & set(l).... to get all matched items.
Is possible without have a for loop iterating sublist count * items in sublist times?
I think this is probably the closest you're going to get:
from collections import defaultdict
d = defaultdict(int)
for sub in outer:
for val in sub:
d[val] += 1
print sorted(d.keys(), key=lambda k: d[k], reverse = True)
# Output: [4, 2, 3, 1, 5, 6]
There is an off chance that the order of elements that appear an identical number of times may be indeterminate - the output of d.keys() is not ordered.
import itertools
all_items = set(itertools.chain(*a))
b = sorted(all_items, key = lambda y: -sum(x.count(y) for x in a))
Try this -
a = [[1,2,3,4], [2,3,4], [4,5,6]]
s = set()
for l in a:
s.update(l)
print s
#set([1, 2, 3, 4, 5, 6])
b = list(s)
This will add each list to the set, which will give you a unique set of all elements in all the lists. If that is what you are after.
Edit. To preserve the order of elements in the original list, you can't use sets.
a = [[1,2,3,4], [2,3,4], [4,5,6]]
b = []
for l in a:
for i in l:
if not i in b:
b.append(i)
print b
#[1,2,3,4,5,6] - The same order as the set in this case, since thats the order they appear in the list
import itertools
from collections import defaultdict
def list_by_count(lists):
data_stream = itertools.chain.from_iterable(lists)
counts = defaultdict(int)
for item in data_stream:
counts[item] += 1
return [item for (item, count) in
sorted(counts.items(), key=lambda x: (-x[1], x[0]))]
Having the x[0] in the sort key ensures that items with the same count are in some kind of sequence as well.

Categories