numpy.unique has the problem with frozensets - python

Just run the code:
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
print(set(a)) # out: {frozenset({3, 4}), frozenset({1, 2})}
print(np.unique(a)) # out: [frozenset({1, 2}), frozenset({3, 4}), frozenset({1, 2})]
The first out is correct, the second is not.
The problem exactly is here:
a[0]==a[-1] # out: True
But set from np.unique has 3 elements, not 2.
I used to utilize np.unique to work with duplicates for ex (using return_index=True and others). What can u advise for me to use instead np.unique for these purposes?

numpy.unique operates by sorting, then collapsing runs of identical elements. Per the doc string:
Returns the sorted unique elements of an array.
The "sorted" part implies it's using a sort-collapse-adjacent technique (similar to what the *NIX sort | uniq pipeline accomplishes).
The problem is that while frozenset does define __lt__ (the overload for <, which most Python sorting algorithms use as their basic building block), it's not using it for the purposes of a total ordering like numbers and sequences use it. It's overloaded to test "is a proper subset of" (not including direct equality). So frozenset({1,2}) < frozenset({3,4}) is False, and so is frozenset({3,4}) > frozenset({1,2}).
Because the expected sort invariant is broken, sorting sequences of set-like objects produces implementation-specific and largely useless results. Uniquifying strategies based on sorting will typically fail under those conditions; one possible result is that it will find the sequence to be sorted in order or reverse order already (since each element is "less than" both the prior and subsequent elements); if it determines it to be in order, nothing changes, if it's in reverse order, it swaps the element order (but in this case that's indistinguishable from preserving order). Then it removes adjacent duplicates (since post-sort, all duplicates should be grouped together), finds none (the duplicates aren't adjacent), and returns the original data.
For frozensets, you probably want to use hash based uniquification, e.g. via set or (to preserve original order of appearance on Python 3.7+), dict.fromkeys; the latter would be simply:
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
uniqa = list(dict.fromkeys(a)) # Works on CPython/PyPy 3.6 as implementation detail, and on 3.7+ everywhere
It's also possible to use sort-based uniquification, but numpy.unique doesn't seem to support a key function, so it's easier to stick to Python built-in tools:
from itertools import groupby # With no key argument, can be used much like uniq command line tool
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
uniqa = [k for k, _ in groupby(sorted(a, key=sorted))]
That second line is a little dense, so I'll break it up:
sorted(a, key=sorted) - Returns a new list based on a where each element is sorted based on the sorted list form of the element (so the < comparison actually does put like with like)
groupby(...) returns an iterator of key/group-iterator pairs. With no key argument to groupby, it just means each key is a unique value, and the group-iterator produces that value as many times as it was seen.
[k for k, _ in ...] Since we don't care how many times each duplicate value was seen, so we ignore the group-iterator (assigning to _ means "ignored" by convention), and have the list comprehension produce only the keys (the unique values)

Related

Tuple-key dictionary in python: Accessing a whole block of entries

I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?
The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call

numpy.unique gives wrong output for list of sets

I have a list of sets given by,
sets1 = [{1},{2},{1}]
When I find the unique elements in this list using numpy's unique, I get
np.unique(sets1)
Out[18]: array([{1}, {2}, {1}], dtype=object)
As can be seen seen, the result is wrong as {1} is repeated in the output.
When I change the order in the input by making similar elements adjacent, this doesn't happen.
sets2 = [{1},{1},{2}]
np.unique(sets2)
Out[21]: array([{1}, {2}], dtype=object)
Why does this occur? Or is there something wrong in the way I have done?
What happens here is that the np.unique function is based on the np._unique1d function from NumPy (see the code here), which itself uses the .sort() method.
Now, sorting a list of sets that contain only one integer in each set will not result in a list with each set ordered by the value of the integer present in the set. So we will have (and that is not what we want):
sets = [{1},{2},{1}]
sets.sort()
print(sets)
# > [{1},{2},{1}]
# ie. the list has not been "sorted" like we want it to
Now, as you have pointed out, if the list of sets is already ordered in the way you want, np.unique will work (since you would have sorted the list beforehand).
One specific solution (though, please be aware that it will only work for a list of sets that each contain a single integer) would then be:
np.unique(sorted(sets, key=lambda x: next(iter(x))))
That is because set is unhashable type
{1} is {1} # will give False
you can use python collections.Counter if you can can convert the set to tuple like below
from collections import Counter
sets1 = [{1},{2},{1}]
Counter([tuple(a) for a in sets1])

sort list of tuples with multiple criteria

I have a list of tuples of k elements. I'd like to sort with respect to element 0, then element 1 and so on and so forth. I googled but I still can't quite figure out how to do it. Would it be something like this?
list.sort(key = lambda x : (x[0], x[1], ...., x[k-1])
In particular, I'd like to sort using different criteria, for example, descending on element 0, ascending on element 1 and so on.
Since python's sort is stable for versions after 2.2 (or perhaps 2.3), the easiest implementation I can think of is a serial repetition of sort using a series of index, reverse_value tuples:
# Specify the index, and whether reverse should be True/False
sort_spec = ((0, True), (1, False), (2, False), (3, True))
# Sort repeatedly from last tuple to the first, to have final output be
# sorted by first tuple, and ties sorted by second tuple etc
for index, reverse_value in sort_spec[::-1]:
list_of_tuples.sort(key = lambda x: x[index], reverse=reverse_value)
This does multiple passes so it may be inefficient in terms of constant time cost, but still O(nlogn) in terms of asymptotic complexity.
If the sort order for indices is truly 0, 1... n-1, n for a list of n-sized tuples as shown in your example, then all you need is a sequence of True and False to denote whether you want reverse or not, and you can use enumerate to add the index.
sort_spec = (True, False, False, True)
for index, reverse_value in list(enumerate(sort_spec))[::-1]:
list_of_tuples.sort(key = lambda x: x[index], reverse=reverse_value)
While the original code allowed for the flexibility of sorting by any order of indices.
Incidentally, this "sequence of sorts" method is recommended in the Python Sorting HOWTO with minor modifications.
Edit
If you didn't have the requirement to sort ascending by some indices and descending by others, then
from operator import itemgetter
list_of_tuples.sort(key = itemgetter(1, 3, 5))
will sort by index 1, then ties will be sorted by index 3, and further ties by index 5. However, changing the ascending/descending order of each index is non-trivial in one-pass.
list.sort(key = lambda x : (x[0], x[1], ...., x[k-1])
This is actually using the tuple as its own sort key. In other words, the same thing as calling sort() with no argument.
If I assume that you simplified the question, and the actual elements are actually not in the same order you want to sort by (for instance, the last value has the most precedence), you can use the same technique, but reorder the parts of the key based on precedence:
list.sort(key = lambda x : (x[k-1], x[1], ...., x[0])
In general, this is a very handy trick, even in other languages like C++ (if you're using libraries): when you want to sort a list of objects by several members with varying precedence, you can construct a sort key by making a tuple containing all the relevant members, in the order of precedence.
Final trick (this one is off topic, but it may help you at some point): When using a library that doesn't support the idea of "sort by" keys, you can usually get the same effect by building a list that contains the sort-key. So, instead of sorting a list of Obj, you would construct then sort a list of tuples: (ObjSortKey, Obj). Also, just inserting the objects into a sorted set will work, if they sort key is unique. (The sort key would be the index, in that case.)
So I am assuming you want to sort tuple_0 ascending, then tuple_1 descending, and so on. A bit verbose but this is what you might be looking for:
ctr = 0
for i in range(list_of_tuples):
if ctr%2 == 0:
list_of_tuples[0] = sorted(list_of_tuples[0])
else:
list_of_tuples[0] = sorted(list_of_tuples[0], reverse=True)
ctr+=1
print list_of_tuples

Syntax of Lists in Python

I am learning python, now, i came across a code snippet which looks like this:
my_name={'sujit','amit','ajit','arijit'}
for i, names in enumerate(my_name):
print "%s" %(names[i])
OUTPUT
s
m
i
t
But when I modify the code as:
my_name=['sujit','amit','ajit','arijit']
for i, names in enumerate(my_name):
print "%s" %(names[i])
OUTPUT
s
m
i
j
What is the difference between {} and []? The [] is giving me the desired result for printing the ith character of the current name from the list. Bu the use of {} is not.
{} creates a set, whereas [] creates a list. The key differences are:
the list preserves the order, whereas the set does not;
the list preserves duplicates, whereas the set does not;
the list can be accessed through indexing (i.e. l[5]), whereas the set can not.
The first point holds the key to your puzzle. When you use a list, the loop iterates over the names in order. When you're using a set, the loop iterates over the elements in an unspecified order, which in my Python interpreter happens to be sujit, amit, arijit, ajit.
P.S. {} can also be used to create a dictionary: {'a':1, 'b':2, 'c':3}.
The {} notation is set notation rather than list notation. That is basically the same as a list, but the items are stored in a jumbled up order, and duplicate elements are removed. (To make things even more confusing, {} is also dictionary syntax, but only when you use colons to separate keys and values -- the way you are using it, is a set.)
Secondly, you aren't using enumerate properly. (Or maybe you are, but I'm not sure...)
enumerate gives you corresponding index and value pairs. So enumerate(['sujit','amit','ajit','arijit']) gives you:
[(0, 'sujit'), (1, 'amit'), (2, 'ajit'), (3, 'arijit')]
So this will get you the first letter of "sujit", the second letter of "amit", and so on. Is that what you wanted?
{} do not enclose a list. They do not enclose any kind of sequence; they enclose (when used this way) a set (in the mathematical sense). The elements of a set do not have a specified order, so you get them enumerated in whatever order Python put them in. (It does this so that it can efficiently ensure the other important constraint on sets: they cannot contain a duplicate value).
This is specific to Python 3. In 2.x, {} cannot be used to create a set, but only to create a dict. (This also works in Python 3.) To do this, you specify the key-value pairs separated by colons, thus: {'sujit': 'amit', 'ajit': 'arijit'}.
(Also, a general note: if you say "question" instead everywhere that you currently say "doubt", you will be wrong much less often, at least per the standards of English as spoken outside of India. I don't particularly understand how the overuse of 'doubt' has become so common in English as spoken by those from India, but I've seen it in many places across the Internet...)
sets do not preserve order:
[] is a list:
>>> print ['sujit','amit','ajit','arijit']
['sujit', 'amit', 'ajit', 'arijit']
{} is a set:
>>> print {'sujit','amit','ajit','arijit'}
set(['sujit', 'amit', 'arijit', 'ajit'])
So you get s,m,i,j in the first case; s,m,i,t in the second.

In Python, when to use a Dictionary, List or Set?

When should I use a dictionary, list or set?
Are there scenarios that are more suited for each data type?
A list keeps order, dict and set don't: when you care about order, therefore, you must use list (if your choice of containers is limited to these three, of course ;-) ).
dict associates each key with a value, while list and set just contain values: very different use cases, obviously.
set requires items to be hashable, list doesn't: if you have non-hashable items, therefore, you cannot use set and must instead use list.
set forbids duplicates, list does not: also a crucial distinction. (A "multiset", which maps duplicates into a different count for items present more than once, can be found in collections.Counter -- you could build one as a dict, if for some weird reason you couldn't import collections, or, in pre-2.7 Python as a collections.defaultdict(int), using the items as keys and the associated value as the count).
Checking for membership of a value in a set (or dict, for keys) is blazingly fast (taking about a constant, short time), while in a list it takes time proportional to the list's length in the average and worst cases. So, if you have hashable items, don't care either way about order or duplicates, and want speedy membership checking, set is better than list.
Do you just need an ordered sequence of items? Go for a list.
Do you just need to know whether or not you've already got a particular value, but without ordering (and you don't need to store duplicates)? Use a set.
Do you need to associate values with keys, so you can look them up efficiently (by key) later on? Use a dictionary.
When you want an unordered collection of unique elements, use a set. (For example, when you want the set of all the words used in a document).
When you want to collect an immutable ordered list of elements, use a tuple. (For example, when you want a (name, phone_number) pair that you wish to use as an element in a set, you would need a tuple rather than a list since sets require elements be immutable).
When you want to collect a mutable ordered list of elements, use a list. (For example, when you want to append new phone numbers to a list: [number1, number2, ...]).
When you want a mapping from keys to values, use a dict. (For example, when you want a telephone book which maps names to phone numbers: {'John Smith' : '555-1212'}). Note the keys in a dict are unordered. (If you iterate through a dict (telephone book), the keys (names) may show up in any order).
Use a dictionary when you have a set of unique keys that map to values.
Use a list if you have an ordered collection of items.
Use a set to store an unordered set of items.
In short, use:
list - if you require an ordered sequence of items.
dict - if you require to relate values with keys
set - if you require to keep unique elements.
Detailed Explanation
List
A list is a mutable sequence, typically used to store collections of homogeneous items.
A list implements all of the common sequence operations:
x in l and x not in l
l[i], l[i:j], l[i:j:k]
len(l), min(l), max(l)
l.count(x)
l.index(x[, i[, j]]) - index of the 1st occurrence of x in l (at or after i and before j indeces)
A list also implements all of the mutable sequence operations:
l[i] = x - item i of l is replaced by x
l[i:j] = t - slice of l from i to j is replaced by the contents of the iterable t
del l[i:j] - same as l[i:j] = []
l[i:j:k] = t - the elements of l[i:j:k] are replaced by those of t
del l[i:j:k] - removes the elements of s[i:j:k] from the list
l.append(x) - appends x to the end of the sequence
l.clear() - removes all items from l (same as del l[:])
l.copy() - creates a shallow copy of l (same as l[:])
l.extend(t) or l += t - extends l with the contents of t
l *= n - updates l with its contents repeated n times
l.insert(i, x) - inserts x into l at the index given by i
l.pop([i]) - retrieves the item at i and also removes it from l
l.remove(x) - remove the first item from l where l[i] is equal to x
l.reverse() - reverses the items of l in place
A list could be used as stack by taking advantage of the methods append and pop.
Dictionary
A dictionary maps hashable values to arbitrary objects. A dictionary is a mutable object. The main operations on a dictionary are storing a value with some key and extracting the value given the key.
In a dictionary, you cannot use as keys values that are not hashable, that is, values containing lists, dictionaries or other mutable types.
Set
A set is an unordered collection of distinct hashable objects. A set is commonly used to include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.
For C++ I was always having this flow chart in mind: In which scenario do I use a particular STL container?, so I was curious if something similar is available for Python3 as well, but I had no luck.
What you need to keep in mind for Python is: There is no single Python standard as for C++. Hence there might be huge differences for different Python interpreters (e.g. CPython, PyPy). The following flow chart is for CPython.
Additionally I found no good way to incorporate the following data structures into the diagram: bytes, byte arrays, tuples, named_tuples, ChainMap, Counter, and arrays.
OrderedDict and deque are available via collections module.
heapq is available from the heapq module
LifoQueue, Queue, and PriorityQueue are available via the queue module which is designed for concurrent (threads) access. (There is also a multiprocessing.Queue available but I don't know the differences to queue.Queue but would assume that it should be used when concurrent access from processes is needed.)
dict, set, frozen_set, and list are builtin of course
For anyone I would be grateful if you could improve this answer and provide a better diagram in every aspect. Feel free and welcome.
PS: the diagram has been made with yed. The graphml file is here
Although this doesn't cover sets, it is a good explanation of dicts and lists:
Lists are what they seem - a list of values. Each one of them is
numbered, starting from zero - the first one is numbered zero, the
second 1, the third 2, etc. You can remove values from the list, and
add new values to the end. Example: Your many cats' names.
Dictionaries are similar to what their name suggests - a dictionary.
In a dictionary, you have an 'index' of words, and for each of them a
definition. In python, the word is called a 'key', and the definition
a 'value'. The values in a dictionary aren't numbered - tare similar
to what their name suggests - a dictionary. In a dictionary, you have
an 'index' of words, and for each of them a definition. The values in
a dictionary aren't numbered - they aren't in any specific order,
either - the key does the same thing. You can add, remove, and modify
the values in dictionaries. Example: telephone book.
http://www.sthurlow.com/python/lesson06/
In combination with lists, dicts and sets, there are also another interesting python objects, OrderedDicts.
Ordered dictionaries are just like regular dictionaries but they remember the order that items were inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added.
OrderedDicts could be useful when you need to preserve the order of the keys, for example working with documents: It's common to need the vector representation of all terms in a document. So using OrderedDicts you can efficiently verify if a term has been read before, add terms, extract terms, and after all the manipulations you can extract the ordered vector representation of them.
May be off topic in terms of the question OP asked-
List: A unhashsable collection of ordered, mutable objects.
Tuple: A hashable collection of ordered, immutable objects, like
list.
Set: An unhashable collection of unordered, mutable and distinct
objects.
Frozenset: A hashable collection of unordered, immutable and
distinct objects.
Dictionary : A unhashable,unordered collection of mutable objects
that maps hashable values to arbitrary values.
To compare them visually, at a glance, see the image-
Lists are what they seem - a list of values. Each one of them is numbered, starting from zero - the first one is numbered zero, the second 1, the third 2, etc. You can remove values from the list, and add new values to the end. Example: Your many cats' names.
Tuples are just like lists, but you can't change their values. The values that you give it first up, are the values that you are stuck with for the rest of the program. Again, each value is numbered starting from zero, for easy reference. Example: the names of the months of the year.
Dictionaries are similar to what their name suggests - a dictionary. In a dictionary, you have an 'index' of words, and for each of them a definition. In python, the word is called a 'key', and the definition a 'value'. The values in a dictionary aren't numbered - tare similar to what their name suggests - a dictionary. In a dictionary, you have an 'index' of words, and for each of them a definition. In python, the word is called a 'key', and the definition a 'value'. The values in a dictionary aren't numbered - they aren't in any specific order, either - the key does the same thing. You can add, remove, and modify the values in dictionaries. Example: telephone book.
When use them, I make an exhaustive cheatsheet of their methods for your reference:
class ContainerMethods:
def __init__(self):
self.list_methods_11 = {
'Add':{'append','extend','insert'},
'Subtract':{'pop','remove'},
'Sort':{'reverse', 'sort'},
'Search':{'count', 'index'},
'Entire':{'clear','copy'},
}
self.tuple_methods_2 = {'Search':'count','index'}
self.dict_methods_11 = {
'Views':{'keys', 'values', 'items'},
'Add':{'update'},
'Subtract':{'pop', 'popitem',},
'Extract':{'get','setdefault',},
'Entire':{ 'clear', 'copy','fromkeys'},
}
self.set_methods_17 ={
'Add':{['add', 'update'],['difference_update','symmetric_difference_update','intersection_update']},
'Subtract':{'pop', 'remove','discard'},
'Relation':{'isdisjoint', 'issubset', 'issuperset'},
'operation':{'union' 'intersection','difference', 'symmetric_difference'}
'Entire':{'clear', 'copy'}}
Dictionary: A python dictionary is used like a hash table with key as index and object as value.
List: A list is used for holding objects in an array indexed by position of that object in the array.
Set: A set is a collection with functions that can tell if an object is present or not present in the set.
Dictionary: When you want to look up something using something else than indexes. Example:
dictionary_of_transport = {
"cars": 8,
"boats": 2,
"planes": 0
}
print("I have the following amount of planes:")
print(dictionary_of_transport["planes"])
#Output: 0
List and sets: When you want to add and remove values.
Lists: To look up values using indexes
Sets: To have values stored, but you cannot access them using anything.

Categories