Extract list of duplicate values and locations from array - python

Given an array a of length N, which is a list of integers, I want to extract the duplicate values, where I have a seperate list for each value containing the location of the duplicates. In pseudo-math:
If |M| > 1:
val -> M = { i | a[i] == val }
Example (N=11):
a = [0, 3, 1, 6, 8, 1, 3, 3, 2, 10, 10]
should give the following lists:
3 -> [1, 6, 7]
1 -> [2, 5]
10 -> [9, 10]
I added the python tag since I'm currently programming in that language (numpy and scipy are available), but I'm more interestead in a general algorithm of how to do it. Code examples are fine, though.
One idea, which I did not yet flesh out: Construct a list of tuples, pairing each entry of a with its index: (i, a[i]). Sort the list with the second entry as key, then check consecutive entries for which the second entry is the same.

Here's an implementation using a python dictionary (actually a defaultdict, for convenience)
a = [0, 3, 1, 6, 8, 1, 3, 3, 2, 10, 10]
from collections import defaultdict
d = defaultdict(list)
for k, item in enumerate(a):
d[item].append(k)
finalD = {key : value for key, value in d.items() if len(value) > 1} # Filter dict for items that only occurred once.
print(finalD)
# {1: [2, 5], 10: [9, 10], 3: [1, 6, 7]}

The idea is to create a dictionary mapping the values to the list of the position where it appears.
This can be done in a simple way with setdefault. This can also be done using defaultdict.
>>> a = [0, 3, 1, 6, 8, 1, 3, 3, 2, 10, 10]
>>> dup={}
>>> for i,x in enumerate(a):
... dup.setdefault(x,[]).append(i)
...
>>> dup
{0: [0], 1: [2, 5], 2: [8], 3: [1, 6, 7], 6: [3], 8: [4], 10: [9, 10]}
Then, actual duplicates can be extracted using set comprehension to filter out elements appearing only once.
>>> {i:x for i,x in dup.iteritems() if len(x)>1}
{1: [2, 5], 10: [9, 10], 3: [1, 6, 7]}

Populate a dictionary whose keys are the values of the integers, and whose values are the lists of positions of those keys. Then go through that dictionary and remove all key/value pairs with only one position. You will be left with the ones that are duplicated.

Related

How to create a dictionary from a list where the count of each element is the key and values are the lists of the according element?

For Example:
list = [1,2,2,3,3,3,4,4,4,4]
the output should be:
{1:[1],2:[2,2],3:[3,3,3],4:[4,4,4,4]}
where the key = 1 is the count of the the element 1 and the value is the list containing all the elements whose count is 1 and so on.
The following code creates three dictionaries where the case when the same count occurs multiple times is handled differently:
l = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 44, 44, 44, 44]
d_replace = dict()
d_flat = dict()
d_nested = dict()
for item in set(l):
elements = list(filter(lambda x: x == item, l))
key = len(elements)
d_replace[key] = elements
d_flat.setdefault(key, list()).extend(elements)
d_nested.setdefault(key, list()).append(elements)
print('Dictionary with replaced elements:\n', d_replace)
print('\nDictionary with a flat list of elements\n', d_flat)
print('\nDictionary with a nested lists of elements\n', d_nested)
Output:
Dictionary with replaced elements:
{1: [1], 2: [2, 2], 3: [3, 3, 3], 4: [44, 44, 44, 44]}
Dictionary with a flat list of elements
{1: [1], 2: [2, 2], 3: [3, 3, 3], 4: [4, 4, 4, 4, 44, 44, 44, 44]}
Dictionary with a nested lists of elements
{1: [[1]], 2: [[2, 2]], 3: [[3, 3, 3]], 4: [[4, 4, 4, 4], [44, 44, 44, 44]]}
d_replace: the according list of elements is overwritten.
d_flat: contains a single list of elements with the according count
d_nested: contains a list of lists of elements with the according count
You can try to use dict comprehension with filter or list comprehension
ls = [1,2,2,3,3,3,4,4,4,4]
print({ls.count(i): [el for el in ls if el == i] for i in set(ls)})
OR
print({ls.count(i): list(filter(lambda x: x == i, ls)) for i in set(ls)})
Output
{1: [1], 2: [2, 2], 3: [3, 3, 3], 4: [4, 4, 4, 4]}

Dict from two lists including multiple values for keys

Is there a possibility to create a dict from two lists with same key occurring multiple times without iterating over the whole dataset?
Minimal example:
keys = [1, 2, 3, 2, 3, 4, 5, 1]
values = [1, 2, 3, 4, 5, 6, 7, 8]
# hoped for result:
dictionary = dict(???)
dictionary = {1 : [1,8], 2:[2,4], 3:[3,5], 4:[6], 5:[7]}
When using zip the key-value-pair is inserted overwriting the old one:
dictionary = dict(zip(keys,values))
dictionary = {1: 8, 2: 4, 3: 5, 4: 6, 5: 7}
I would be happy with a Multidict as well.
This is one approach that doesn't require 2 for loops
h = defaultdict(list)
for k, v in zip(keys, values):
h[k].append(v)
print(h)
# defaultdict(<class 'list'>, {1: [1, 8], 2: [2, 4], 3: [3, 5], 4: [6], 5: [7]})
print(dict(h))
# {1: [1, 8], 2: [2, 4], 3: [3, 5], 4: [6], 5: [7]}
This is the only one-liner I could do.
dictionary = {k: [values[i] for i in [j for j, x in enumerate(keys) if x == k]] for k in set(keys)}
It is far from readable. Remember that clear code is always better than pseudo-clever code ;)
Here is an example that I think is easy to follow logically. Unfortunately it does not use zip like you would prefer, nor does it avoid iterating, because a task like this has to involve iterating In some form.
# Your data
keys = [1, 2, 3, 2, 3, 4, 5, 1]
values = [1, 2, 3, 4, 5, 6, 7, 8]
# Make result dict
result = {}
for x in range(1, max(keys)+1):
result[x] = []
# Populate result dict
for index, num in enumerate(keys):
result[num].append(values[index])
# Print result
print(result)
If you know the range of values in the keys array, you could make this faster by providing the results dictionary as a literal with integer keys and empty list values.

Create sublists of indexes of equal values from list

I'm trying to split a list of integers into sublists of the the indexes of equal integers. So say I have a list:
original_list = [1,2,1,4,4,4,3,4,4,1,4,3,3]
The desired output would be:
indexes : [[0,2,9], [1], [6,11,12], [3,4,5,7,8,10]]
# corresponds to sublists: [[1,1,1] [2], [3,3,3], [4,4,4,4,4,4]]
I can't figure out how to do this though, as most solutions require you to first sort the original list, but in my case, this messes up the indices. Itertools or np.arrays have not helped me for this reason, as they only group sequential equal elements.
Does anyone know of a solution for this problem? I would love to hear!
You can use enumerate:
original_list = [1,2,1,4,4,4,3,4,4,1,4,3,3]
groups = {a:[i for i, c in enumerate(original_list) if c == a] for a in set(original_list)}
Output:
{1: [0, 2, 9], 2: [1], 3: [6, 11, 12], 4: [3, 4, 5, 7, 8, 10]}
Using enumerate and a defaultdict, you can build a mapping of values to their indices with
from collections import defaultdict
dd = defaultdict(list)
for index, value in enumerate(original_list):
dd[value].append(index)
print(dd)
# defaultdict(<class 'list'>, {1: [0, 2, 9], 2: [1], 4: [3, 4, 5, 7, 8, 10], 3: [6, 11, 12]})
You can use collections.defaultdict for a one-pass solution. Then use sorted if you need, as in your desired result, to sort your indices by value.
original_list = [1,2,1,4,4,4,3,4,4,1,4,3,3]
from collections import defaultdict
from operator import itemgetter
dd = defaultdict(list)
for idx, value in enumerate(original_list):
dd[value].append(idx)
keys, values = zip(*sorted(dd.items(), key=itemgetter(0)))
print(keys, values, sep='\n')
(1, 2, 3, 4)
([0, 2, 9], [1], [6, 11, 12], [3, 4, 5, 7, 8, 10])
For comparison, the values of dd are insertion ordered in Python 3.6+ (officially in 3.7+, as a CPython implementation detail in 3.6):
print(list(dd.values()))
[[0, 2, 9], [1], [3, 4, 5, 7, 8, 10], [6, 11, 12]]
Here is how I would do it with numpy, using the argsort function I linked in the comments.
original = [1,2,1,4,4,4,3,4,4,1,4,3,3]
indexes = []
s = set()
for n in np.argsort(original):
if original[n] in s:
indexes[-1].append(n)
else:
indexes.append([n])
s.add(original[n])
print(indexes)
This can be achieved with a list comprehension.
>>> x = [1,2,1,4,4,4,3,4,4,1,4,3,3]
>>> [[i for i in range(len(x)) if x[i]==y] for y in sorted(set(x))]
[[0, 2, 9], [1], [6, 11, 12], [3, 4, 5, 7, 8, 10]]

Find intersection of dictionary values which are lists

I have dictionaries in a list with same keys, while the values are variant:
[{1:[1,2,3,4,5], 2:[6,7,8], 3:[1,3,5,7,9]},
{1:[2,3,4], 2:[6,7], 3:[1,3,5]},
...]
I would like to get intersection as dictionary under same keys like this:
{1:[2,3,4], 2:[6,7], 3:[1,3,5]}
Give this a try?
dicts = [{1:[1,2,3,4,5], 2:[6,7,8], 3:[1,3,5,7,9]},{1:[2,3,4], 2:[6,7], 3:[1,3,5]}]
result = { k: set(dicts[0][k]).intersection(*(d[k] for d in dicts[1:])) for k in dicts[0].keys() }
print(result)
# Output:
# {1: {2, 3, 4}, 2: {6, 7}, 3: {1, 3, 5}}
If you want lists instead of sets as the output value type, just throw a list(...) around the set intersection.
For a list of dictionaries, reduce the whole list as this:
>>> from functools import reduce
>>> d = [{1:[1,2,3,4,5], 2:[6,7,8], 3:[1,3,5,7,9]},{1:[2,3,4], 2:[6,7], 3:[1,3,5]}]
>>> reduce(lambda x, y: {k: sorted(list(set(x[k])&set(y[k]))) for k in x.keys()}, d)
{1: [2, 3, 4], 2: [6, 7], 3: [1, 3, 5]}
I would probably do something along these lines:
# Take the first dict and convert the values to `set`.
output = {k: set(v) for k, v in dictionaries[0].items()}
# For the rest of the dicts, update the set at a given key by intersecting it with each of the other lists that have the same key.
for d in dictionaries[1:]:
for k, v in output.items():
output[k] = v.intersection(d[k])
There are different variations on this same theme, but I find this one to be about as simple to read as it gets (and since code is read more often than it is written, I consider that a win :-)
use dict.viewkeys and dict.viewitems
In [103]: dict.viewkeys?
Docstring: D.viewkeys() -> a set-like object providing a view on D's keys
dict.viewitems?
Docstring: D.viewitems() -> a set-like object providing a view on D's items
a = [{1: [1, 2, 3, 4, 5], 2: [6, 7, 8], 3: [1, 3, 5, 7, 9]},
{1: [2, 3, 4], 2: [6, 7], 3: [1, 3, 5]}]
In [100]: dict(zip(a[0].viewkeys() and a[1].viewkeys(), a[0].viewvalues() and a[1].viewvalues()))
Out[100]: {1: [2, 3, 4], 2: [6, 7], 3: [1, 3, 5]}

Removing duplicates from a list of lists based on a comparison of an element of the inner lists

I have a large list of lists and need to remove duplicate elements based on specific criteria:
Uniqueness is determined by the first element of the lists.
Removal of duplicates is determined by comparing the value of the second element of the duplicate lists, namely keep the list with the lowest second element.
[[1, 4, 5], [1, 3, 4], [1, 2, 3]]
All the above lists are considered duplicates since their first elements are equal. The third list needs to be kept since it's second element is the smallest. Note the actual list of lists has over 4 million elements, is double sorted and ordering needs to be preserved.
The list is first sorted based on the second element of the inner lists and in reverse (descending) order, followed by normal (ascending) order based on the first element:
sorted(sorted(the_list, key=itemgetter(1), reverse=True), key=itemgetter(0))
An example of three duplicate lists in their actual ordering:
[...
[33554432, 50331647, 1695008306],
[33554432, 34603007, 1904606324],
[33554432, 33554687, 2208089473],
...]
The goal is to prepare the list for bisect searching. Can someone provide me with insight on how this might be achieved using Python?
You can group the elements using a dict, always keeping the sublist with the smaller second element:
l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = {}
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[k] = sub
Also you can pass two keys to sorted, you don't need to call sorted twice:
In [3]: l = [[1,4,6,2],[2,2,4,6],[1,2,4,5]]
In [4]: sorted(l,key=lambda x: (-x[1],x[0]))
Out[4]: [[1, 4, 6, 2], [1, 2, 4, 5], [2, 2, 4, 6]]
If you wanted to maintain order in the dict as per ordering needs to be preserved.:
from collections import OrderedDict
l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = OrderedDict()
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[sub[0]] = sub
But not sure how that fits as you are sorting the data after so you will lose any order.
What you may find very useful is a sortedcontainers.sorteddict:
A SortedDict provides the same methods as a dict. Additionally, a SortedDict efficiently maintains its keys in sorted order. Consequently, the keys method will return the keys in sorted order, the popitem method will remove the item with the highest key, etc.
An optional key argument defines a callable that, like the key argument to Python’s sorted function, extracts a comparison key from each dict key. If no function is specified, the default compares the dict keys directly. The key argument must be provided as a positional argument and must come before all other arguments.
from sortedcontainers import SortedDict
l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = SortedDict()
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[k] = sub
print(list(d.values()))
It has all the methods you want bisect, bisect_left etc..
If I got it correctly, the solution might be like this:
mylist = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [7, 3, 6], [7, 1, 8]]
ordering = []
newdata = {}
for a, b, c in mylist:
if a in newdata:
if b < newdata[a][1]:
newdata[a] = [a, b, c]
else:
newdata[a] = [a, b, c]
ordering.append(a)
newlist = [newdata[v] for v in ordering]
So in newlist we will receive reduced list of [[1, 2, 3], [7, 1, 8]].

Categories