Dedupe python list based on multiple criteria

Dedupe python list based on multiple criteria - python

I have a list:
mylist = [('Item A','CA','10'),('Item B','CT','12'),('Item A','CA','14'),('Item A','NH','10')]
I would like to remove duplicates based on column 1 and 2. So my desired output would be:
[('Item A','CA','10'),('Item B','CT','12'),('Item A','NH','10')]
I'm not really sure how to go about this, so I haven't posted any code, but am just looking for some help :)

Use a dict. The other answer is good. For variety, here's a single expression that will give you the uniq'd list (though the order of elements is not preserved).
{ tuple(item[0:2]):item for item in mylist[::-1] }.values()
This creates a dict from the elements of mylist using elements 0 and 1 as the key (implicitly removing duplicates). Because mylist is iterated in reverse order, the last element with a duplicate key (elements 0 and 1) will remain in the dict.

Dict keys can be of any hashable type. Create a dict with the first two columns of each item as the key, and only add to unique if those columns haven't been seen before.
unique = {}
for item in mylist:
if item[0:2] not in unique:
unique[item[0:2]] = item
print unique.values()

Related

Python pick a random value from hashmap that has a list as value?

so I have a defaultdict(list) hashmap, potential_terms
potential_terms={9: ['leather'], 10: ['type', 'polyester'], 13:['hello','bye']}
What I want to output is the 2 values (words) with the lowest keys, so 'leather' is definitely the first output, but 'type' and 'polyester' both have k=10, when the key is the same, I want a random choice either 'type' or 'polyester'
What I did is:
out=[v for k,v in sorted(potential_terms.items(), key=lambda x:(x[0],random.choice(x[1])))][:2]
but when I print out I get :
[['leather'], ['type', 'polyester']]
My guess is ofcourse the 2nd part of the lambda function: random.choice(x[1]). Any ideas on how to make it work as expected by outputting either 'type' or 'polyester' ?
Thanks

EDIT: See Karl's answer and comment as to why this solution isn't correct for OP's problem.
I leave it here because it does demonstrate what OP originally got wrong.
key= doesn't transform the data itself, it only tells sorted how to sort,
you want to apply choice on v when selecting it for the comprehension, like so:
out=[random.choice(v) for k,v in sorted(potential_terms.items())[:2]]
(I also moved the [:2] inside, to shorten the list before the comprehension)
Output:
['leather', 'type']
OR
['leather', 'polyester']

You have (with some extra formatting to highlight the structure):
out = [
v
for k, v in sorted(
potential_terms.items(),
key=lambda x:(x[0], random.choice(x[1]))
)
][:2]
This means (reading from the inside out): sort the items according to the key, breaking ties using a random choice from the value list. Extract the values (which are lists) from those sorted items into a list (of lists). Finally, get the first two items of that list of lists.
This doesn't match the problem description, and is also somewhat nonsensical: since the keys are, well, keys, there cannot be duplicates, and thus there cannot be ties to break.
What we wanted: sort the items according to the key, then put all the contents of those individual lists next to each other to make a flattened list of strings, but randomizing the order within each sublist (i.e., shuffling those sublists). Then, get the first two items of that list of strings.
Thus, applying the technique from the link, and shuffling the sublists "inline" as they are discovered by the comprehension:
out = [
term
for k, v in sorted(
potential_terms.items(),
key = lambda x:x[0] # this is not actually necessary now,
# since the natural sort order of the items will work.
)
for term in random.sample(v, len(v))
][:2]
Please also see https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/ to understand how the list flattening and result ordering works in a two-level comprehension like this.

Instead of the out, a simpler function, is:
d = list(p.values()) which stores all the values.
It will store the values as:
[['leather'], ['polyester', 'type'], ['hello', 'bye']]
You can access, leather as d[0] and the list, ['polyester', 'type'], as d[1]. Now we'll just use random.shuffle(d[1]), and use d[1][0].
Which would get us a random word, type or polyester.
Final code should be like this:
import random
potential_terms={9: ['leather'], 10: ['type', 'polyester'], 13:['hello','bye']}
d = list(p.values())
random.shuffle(d[1])
c = []
c.append(d[0][0])
c.append(d[1][0])
Which gives the desired output,
either ['leather', 'polyester'] or ['leather', 'type'].

How do I efficiently extract all elements in a column from a dictionary whose values are 2D lists in Python?

I have a dictionary whose values are lists of lists.
mydict = {
"A" : [["gactacgat", "IE"],["gactacgat", "IE"]],
"G" : [["ggctacgat", "EI"],["gactacgat", "IE"]],
"C" : [["gcctacgat", "N"],["gactacgat", "IE"]],
"T" : [["gtctacgat", "IE"],["gactacgat", "IE"]]
}
And I am trying to create a list containing all of the elements in the second column of these 2D lists.
This is what I am doing:
mylist = []
for key in mydict.keys():
for row in mydict[key]:
mylist.append(row[1])
Is there a more efficient way of accessing a specific index from all the lists in these lists of lists than what I have done?
(I apologize if my wording was bad, and any feedback for a better title question is appreciated)

List comprehensions are your friend:
mylist = [row[1] for value in mydict.values() for row in value]
This uses a nested loop, but presumably a bit better optimized. Additionally, using dict.values saves you a key lookup at each iteration of the outer loop, and returns items in the same order as the keys would be.

Removing original and duplicate value from list of lists based on single element

I have a list of lists and I want to remove both the duplicate and original, if a duplicate exists in the first element:
x=[["1.2.3.4","thing1"], ["8.8.8.8","thing2"], ["8.8.8.8","thing3"], ["8.8.4.4","thing4"]]
So I would want the output to be:
[["1.2.3.4","thing1"], ["8.8.4.4","thing4"]]
I have tried the following:
print [a for a in x if x[0].count(a[0]) == 1]
However, I only get the first item:
[['1.2.3.4', 'DNS']]
Any assistance would be appreciated. I need to remove both the duplicate and the original value if a duplicate is found. Both "Most pythonic way to remove tuples from a list if first element is a duplicate" and "Removing duplicates from list of lists in Python" keep one of the duplicate values.
Thank you.

You can use collections.Counter:
from collections import Counter
counts = Counter([a[0] for a in x])
print([a for a in x if counts[a[0]]==1] )
#[['1.2.3.4', 'thing1'], ['8.8.4.4', 'thing4']]
First build a Counter to count the occurrences of the first element in each item of your list. Then filter your list and keep only the values where the count for that item is 1.

Adding two asynchronous lists, into a dictionary

I've always found Dictionaries to be an odd thing in python. I know it is just me i'm sure but I cant work out how to take two lists and add them to the dict. If both lists were mapable it wouldn't be a problem something like dictionary = dict(zip(list1, list2)) would suffice. However, during each run the list1 will always have one item and the list2 could have multiple items or single item that I'd like as values.
How could I approach adding the key and potentially multiple values to it?
After some deliberation, Kasramvd's second option seems to work well for this scenario:
dictionary.setdefault(list1[0], []).append(list2)

Based on your comment all you need is assigning the second list as a value to only item of first list.
d = {}
d[list1[0]] = list2
And if you want to preserve the values for duplicate keys you can use dict.setdefault() in order to create value of list of list for duplicate keys.
d = {}
d.setdefault(list1[0], []).append(list2)

Searching a python list quickly?

I have a dictionary and a list. The list is made up of values. The dictionary has all of the values plus some more values.
I'm trying to count the number of times the values in the list show up in the dictionary per key/values pair.
It looks something like this:
for k in dict:
count = 0
for value in dict[k]:
if value in list:
count += 1
list.remove(value)
dict[k].append(count)
I have something like ~1 million entries in the list so searching through each time is ultra slow.
Is there some faster way to do what I'm trying to do?
Thanks,
Rohan

You're going to have all manner of trouble with this code, since you're both removing items from your list and using an index into it. Also, you're using list as a variable name, which gets you into interesting trouble as list is also a type.
You should be able to get a huge performance improvement (once you fix the other defects in your code) by using a set instead of a list. What you lose by using a set is the ordering of the items and the ability to have an item appear in the list more than once. (Also your items have to be hashable.) What you gain is O(1) lookup time.

If you search in a list, then convert this list to a set, it will be much faster:
listSet = set(list)
for k, values in dict.iteritems():
count = 0
for value in values:
if value in listSet:
count += 1
listSet.remove(value)
dict[k].append(count)
list = [elem for elem in list if elem in listSet]
# return the original list without removed elements

for val in my_list:
if val in my_dict:
my_dict[val] = my_dict[val] + 1
else:
my_dict[val] = 0
What you still need
Handle case when val is not in dict

I changed the last line to append to the dictionary. It's a defaultdict(list). Hopefully that clears up some of the questions. Thanks again.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dedupe python list based on multiple criteria - python

Dict keys can be of any hashable type. Create a dict with the first two columns of each item as the key, and only add to unique if those columns haven't been seen before. unique = {} for item in mylist: if item[0:2] not in unique: unique[item[0:2]] = item print unique.values()

Related

Python pick a random value from hashmap that has a list as value?

How do I efficiently extract all elements in a column from a dictionary whose values are 2D lists in Python?

Removing original and duplicate value from list of lists based on single element

Adding two asynchronous lists, into a dictionary

Searching a python list quickly?

Categories

Resources