numpy.unique gives wrong output for list of sets

numpy.unique gives wrong output for list of sets - python

I have a list of sets given by,
sets1 = [{1},{2},{1}]
When I find the unique elements in this list using numpy's unique, I get
np.unique(sets1)
Out[18]: array([{1}, {2}, {1}], dtype=object)
As can be seen seen, the result is wrong as {1} is repeated in the output.
When I change the order in the input by making similar elements adjacent, this doesn't happen.
sets2 = [{1},{1},{2}]
np.unique(sets2)
Out[21]: array([{1}, {2}], dtype=object)
Why does this occur? Or is there something wrong in the way I have done?

What happens here is that the np.unique function is based on the np._unique1d function from NumPy (see the code here), which itself uses the .sort() method.
Now, sorting a list of sets that contain only one integer in each set will not result in a list with each set ordered by the value of the integer present in the set. So we will have (and that is not what we want):
sets = [{1},{2},{1}]
sets.sort()
print(sets)
# > [{1},{2},{1}]
# ie. the list has not been "sorted" like we want it to
Now, as you have pointed out, if the list of sets is already ordered in the way you want, np.unique will work (since you would have sorted the list beforehand).
One specific solution (though, please be aware that it will only work for a list of sets that each contain a single integer) would then be:
np.unique(sorted(sets, key=lambda x: next(iter(x))))

That is because set is unhashable type
{1} is {1} # will give False
you can use python collections.Counter if you can can convert the set to tuple like below
from collections import Counter
sets1 = [{1},{2},{1}]
Counter([tuple(a) for a in sets1])

Related

Python pick a random value from hashmap that has a list as value?

so I have a defaultdict(list) hashmap, potential_terms
potential_terms={9: ['leather'], 10: ['type', 'polyester'], 13:['hello','bye']}
What I want to output is the 2 values (words) with the lowest keys, so 'leather' is definitely the first output, but 'type' and 'polyester' both have k=10, when the key is the same, I want a random choice either 'type' or 'polyester'
What I did is:
out=[v for k,v in sorted(potential_terms.items(), key=lambda x:(x[0],random.choice(x[1])))][:2]
but when I print out I get :
[['leather'], ['type', 'polyester']]
My guess is ofcourse the 2nd part of the lambda function: random.choice(x[1]). Any ideas on how to make it work as expected by outputting either 'type' or 'polyester' ?
Thanks

EDIT: See Karl's answer and comment as to why this solution isn't correct for OP's problem.
I leave it here because it does demonstrate what OP originally got wrong.
key= doesn't transform the data itself, it only tells sorted how to sort,
you want to apply choice on v when selecting it for the comprehension, like so:
out=[random.choice(v) for k,v in sorted(potential_terms.items())[:2]]
(I also moved the [:2] inside, to shorten the list before the comprehension)
Output:
['leather', 'type']
OR
['leather', 'polyester']

You have (with some extra formatting to highlight the structure):
out = [
v
for k, v in sorted(
potential_terms.items(),
key=lambda x:(x[0], random.choice(x[1]))
)
][:2]
This means (reading from the inside out): sort the items according to the key, breaking ties using a random choice from the value list. Extract the values (which are lists) from those sorted items into a list (of lists). Finally, get the first two items of that list of lists.
This doesn't match the problem description, and is also somewhat nonsensical: since the keys are, well, keys, there cannot be duplicates, and thus there cannot be ties to break.
What we wanted: sort the items according to the key, then put all the contents of those individual lists next to each other to make a flattened list of strings, but randomizing the order within each sublist (i.e., shuffling those sublists). Then, get the first two items of that list of strings.
Thus, applying the technique from the link, and shuffling the sublists "inline" as they are discovered by the comprehension:
out = [
term
for k, v in sorted(
potential_terms.items(),
key = lambda x:x[0] # this is not actually necessary now,
# since the natural sort order of the items will work.
)
for term in random.sample(v, len(v))
][:2]
Please also see https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/ to understand how the list flattening and result ordering works in a two-level comprehension like this.

Instead of the out, a simpler function, is:
d = list(p.values()) which stores all the values.
It will store the values as:
[['leather'], ['polyester', 'type'], ['hello', 'bye']]
You can access, leather as d[0] and the list, ['polyester', 'type'], as d[1]. Now we'll just use random.shuffle(d[1]), and use d[1][0].
Which would get us a random word, type or polyester.
Final code should be like this:
import random
potential_terms={9: ['leather'], 10: ['type', 'polyester'], 13:['hello','bye']}
d = list(p.values())
random.shuffle(d[1])
c = []
c.append(d[0][0])
c.append(d[1][0])
Which gives the desired output,
either ['leather', 'polyester'] or ['leather', 'type'].

numpy.unique has the problem with frozensets

Just run the code:
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
print(set(a)) # out: {frozenset({3, 4}), frozenset({1, 2})}
print(np.unique(a)) # out: [frozenset({1, 2}), frozenset({3, 4}), frozenset({1, 2})]
The first out is correct, the second is not.
The problem exactly is here:
a[0]==a[-1] # out: True
But set from np.unique has 3 elements, not 2.
I used to utilize np.unique to work with duplicates for ex (using return_index=True and others). What can u advise for me to use instead np.unique for these purposes?

numpy.unique operates by sorting, then collapsing runs of identical elements. Per the doc string:
Returns the sorted unique elements of an array.
The "sorted" part implies it's using a sort-collapse-adjacent technique (similar to what the *NIX sort | uniq pipeline accomplishes).
The problem is that while frozenset does define __lt__ (the overload for <, which most Python sorting algorithms use as their basic building block), it's not using it for the purposes of a total ordering like numbers and sequences use it. It's overloaded to test "is a proper subset of" (not including direct equality). So frozenset({1,2}) < frozenset({3,4}) is False, and so is frozenset({3,4}) > frozenset({1,2}).
Because the expected sort invariant is broken, sorting sequences of set-like objects produces implementation-specific and largely useless results. Uniquifying strategies based on sorting will typically fail under those conditions; one possible result is that it will find the sequence to be sorted in order or reverse order already (since each element is "less than" both the prior and subsequent elements); if it determines it to be in order, nothing changes, if it's in reverse order, it swaps the element order (but in this case that's indistinguishable from preserving order). Then it removes adjacent duplicates (since post-sort, all duplicates should be grouped together), finds none (the duplicates aren't adjacent), and returns the original data.
For frozensets, you probably want to use hash based uniquification, e.g. via set or (to preserve original order of appearance on Python 3.7+), dict.fromkeys; the latter would be simply:
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
uniqa = list(dict.fromkeys(a)) # Works on CPython/PyPy 3.6 as implementation detail, and on 3.7+ everywhere
It's also possible to use sort-based uniquification, but numpy.unique doesn't seem to support a key function, so it's easier to stick to Python built-in tools:
from itertools import groupby # With no key argument, can be used much like uniq command line tool
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
uniqa = [k for k, _ in groupby(sorted(a, key=sorted))]
That second line is a little dense, so I'll break it up:
sorted(a, key=sorted) - Returns a new list based on a where each element is sorted based on the sorted list form of the element (so the < comparison actually does put like with like)
groupby(...) returns an iterator of key/group-iterator pairs. With no key argument to groupby, it just means each key is a unique value, and the group-iterator produces that value as many times as it was seen.
[k for k, _ in ...] Since we don't care how many times each duplicate value was seen, so we ignore the group-iterator (assigning to _ means "ignored" by convention), and have the list comprehension produce only the keys (the unique values)

Find indices of x minimum values of a list

I have a list of length n.
I want to find the indices that hold the 5 minimum values of this list.
I know how to find the index holding the minimum value using operator
min_index,min_value = min(enumerate(list), key=operator.itemgetter(1))
Can this code be altered to get a list of the 5 indices I am after?

Although this requires sorting the entire list, you can get a slice of the sorted list:
data = sorted(enumerate(list), key=operator.itemgetter(1))[:5]

if use package heapq, it can be done by nsamllest:
heapq.nsmallest(5, enumerate(list), key=operator.itemgetter(1))

what about something like this?
map(lambda x: [a.index(x),x],sorted(list)[:5])
that will return a list of lists where list[x][0] = the index and list[x][1] = the value
EDIT:
This assumes the list doesn't have repeated minimum values. As adhg McDonald-Jensen pointed out, it this will only return the first instance of the give value.

Select maximum value in a list and the attributes related with that value

I am looking for a way to select the major value in a list of numbers in order to get the attributes.
data
[(14549.020163184512, 58.9615170298556),
(18235.00848249135, 39.73350448334156),
(12577.353023695543, 37.6940001866714)]
I wish to extract (18235.00848249135, 39.73350448334156) in order to have 39.73350448334156. The previous list (data) is derived from a a empty list data=[]. Is it the list the best format to store data in a loop?

You can get it by :
max(data)[1]
since tuples will be compared by the first element by default.

max(data)[1]
Sorting a tuple sorts according to the first elements, then the second. It means max(data) sorts according to the first element.
[1] returns then the second element from the "maximal" object.

Hmm, it seems easy or what?)
max(a)[1] ?

You can actually sort on any attribute of the list. You can use itemgetter. Another way to sort would be to use a definitive compare functions (when you might need multiple levels of itemgetter, so the below code is more readable).
dist = ((1, {'a':1}), (7, {'a': 99}), (-1, {'a':99}))
def my_cmp(x, y):
tmp = cmp(x[1][a], y[1][a])
if tmp==0:
return (-1 * cmp(x[0], y[0]))
else: return tmp
sorted = dist.sort(cmp=my_cmp) # sorts first descending on attr "a" of the second item, then sorts ascending on first item

Python .sort() not working as expected

Tackling a few puzzle problems on a quiet Saturday night (wooohoo... not) and am struggling with sort(). The results aren't quite what I expect. The program iterates through every combination from 100 - 999 and checks if the product is a palindome. If it is, append to the list. I need the list sorted :D Here's my program:
list = [] #list of numbers
for x in xrange(100,1000): #loops for first value of combination
for y in xrange(x,1000): #and 2nd value
mult = x*y
reversed = str(mult)[::-1] #reverses the number
if (reversed == str(mult)):
list.append(reversed)
list.sort()
print list[:10]
which nets:
['101101', '10201', '102201', '102201', '105501', '105501', '106601', '108801',
'108801', '110011']
Clearly index 0 is larger then 1. Any idea what's going on? I have a feeling it's got something to do with trailing/leading zeroes, but I had a quick look and I can't see the problem.
Bonus points if you know where the puzzle comes from :P

You are sorting strings, not numbers. '101101' < '10201' because '1' < '2'. Change list.append(reversed) to list.append(int(reversed)) and it will work (or use a different sorting function).

Sort is doing its job. If you intended to store integers in the list, take Lukáš advice. You can also tell sort how to sort, for example by making ints:
list.sort(key=int)
the key parameter takes a function that calculates an item to take the list object's place in all comparisons. An integer will compare numerically as you expect.
(By the way, list is a really bad variable name, as you override the builtin list() type!)

Your list contains strings so it is sorting them alphabetically - try converting the list to integers and then do the sort.

You're sorting strings, not numbers. Strings compare left-to-right.

No need to convert to int. mult already is an int and as you have checked it is a palindrome it will look the same as reversed, so just:
list.append(mult)

You have your numbers stored as strings, so python is sorting them accordingly. So: '101x' comes before '102x' (the same way that 'abcd' will come before 'az').

No, it is sorting properly, just that it is sorting lexographically and you want numeric sorting... so remove the "str()"

The comparator operator is treating your input as strings instead of integers. In string comparsion 2 as the 3rd letter is lexically greater than 1.
reversed = str(mult)[::-1]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy.unique gives wrong output for list of sets - python

That is because set is unhashable type {1} is {1} # will give False you can use python collections.Counter if you can can convert the set to tuple like below from collections import Counter sets1 = [{1},{2},{1}] Counter([tuple(a) for a in sets1])

Related

Python pick a random value from hashmap that has a list as value?

numpy.unique has the problem with frozensets

Find indices of x minimum values of a list

Select maximum value in a list and the attributes related with that value

Python .sort() not working as expected

Categories

Resources