Tuple-key dictionary in python: Accessing a whole block of entries - python

I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?

The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call

Related

Python: Create sorted list of keys moving one key to the head

Is there a more pythonic way of obtaining a sorted list of dictionary keys with one key moved to the head? So far I have this:
# create a unique list of keys headed by 'event' and followed by a sorted list.
# dfs is a dict of dataframes.
for k in (dict.fromkeys(['event']+sorted(dfs))):
display(k,dfs[k]) # ideally this should be (k,v)
I suppose you would be able to do
for k, v in list(dfs.items()) + [('event', None)]:
.items() casts a dictionary to a list of tuples (or technically a dict_items, which is why I have to cast it to list explicitly to append), to which you can append a second list. Iterating through a list of tuples allows for automatic unpacking (so you can do k,v in list instead of tup in list)
What we really want is an iterable, but that's not possible with sorted, because it must see all the keys before it knows what the first item should be.
Using dict.fromkeys to create a blank dictionary by insertion order was pretty clever, but relies on an implementation detail of the current version of python. (dict is fundamentally unordered) I admit, it took me a while to figure out that line.
Since the code you posted is just working with the keys, I suggest you focus on that. Taking up a few more lines for readability is a good thing, especially if we can hide it in a testable function:
def display_by_keys(dfs, priority_items=None):
if not priority_items:
priority_items = ['event']
featured = {k for k in priority_items if k in dfs}
others = {k for k in dfs.keys() if k not in featured}
for key in list(featured) + sorted(others):
display(key, dfs[key])
The potential downside is you must sort the keys every time. If you do this much more often than the data store changes, on a large data set, that's a potential concern.
Of course you wouldn't be displaying a really large result, but if it becomes a problem, then you'll want to store them in a collections.OrderedDict (https://stackoverflow.com/a/13062357/1766544) or find a sorteddict module.
from collections import OrderedDict
# sort once
ordered_dfs = OrderedDict.fromkeys(sorted(dfs.keys()))
ordered_dfs.move_to_end('event', last=False)
ordered_dfs.update(dfs)
# display as often as you need
for k, v in ordered_dfs.items():
print (k, v)
If you display different fields first in different views, that's not a problem. Just sort all the fields normally, and use a function like the one above, without the sort.

numpy.unique has the problem with frozensets

Just run the code:
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
print(set(a)) # out: {frozenset({3, 4}), frozenset({1, 2})}
print(np.unique(a)) # out: [frozenset({1, 2}), frozenset({3, 4}), frozenset({1, 2})]
The first out is correct, the second is not.
The problem exactly is here:
a[0]==a[-1] # out: True
But set from np.unique has 3 elements, not 2.
I used to utilize np.unique to work with duplicates for ex (using return_index=True and others). What can u advise for me to use instead np.unique for these purposes?
numpy.unique operates by sorting, then collapsing runs of identical elements. Per the doc string:
Returns the sorted unique elements of an array.
The "sorted" part implies it's using a sort-collapse-adjacent technique (similar to what the *NIX sort | uniq pipeline accomplishes).
The problem is that while frozenset does define __lt__ (the overload for <, which most Python sorting algorithms use as their basic building block), it's not using it for the purposes of a total ordering like numbers and sequences use it. It's overloaded to test "is a proper subset of" (not including direct equality). So frozenset({1,2}) < frozenset({3,4}) is False, and so is frozenset({3,4}) > frozenset({1,2}).
Because the expected sort invariant is broken, sorting sequences of set-like objects produces implementation-specific and largely useless results. Uniquifying strategies based on sorting will typically fail under those conditions; one possible result is that it will find the sequence to be sorted in order or reverse order already (since each element is "less than" both the prior and subsequent elements); if it determines it to be in order, nothing changes, if it's in reverse order, it swaps the element order (but in this case that's indistinguishable from preserving order). Then it removes adjacent duplicates (since post-sort, all duplicates should be grouped together), finds none (the duplicates aren't adjacent), and returns the original data.
For frozensets, you probably want to use hash based uniquification, e.g. via set or (to preserve original order of appearance on Python 3.7+), dict.fromkeys; the latter would be simply:
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
uniqa = list(dict.fromkeys(a)) # Works on CPython/PyPy 3.6 as implementation detail, and on 3.7+ everywhere
It's also possible to use sort-based uniquification, but numpy.unique doesn't seem to support a key function, so it's easier to stick to Python built-in tools:
from itertools import groupby # With no key argument, can be used much like uniq command line tool
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
uniqa = [k for k, _ in groupby(sorted(a, key=sorted))]
That second line is a little dense, so I'll break it up:
sorted(a, key=sorted) - Returns a new list based on a where each element is sorted based on the sorted list form of the element (so the < comparison actually does put like with like)
groupby(...) returns an iterator of key/group-iterator pairs. With no key argument to groupby, it just means each key is a unique value, and the group-iterator produces that value as many times as it was seen.
[k for k, _ in ...] Since we don't care how many times each duplicate value was seen, so we ignore the group-iterator (assigning to _ means "ignored" by convention), and have the list comprehension produce only the keys (the unique values)

How to calculate Euclidian of dictionary with tuple as key

I have created a matrix by using a dictionary with a tuple as the key (e.g. {(user, place) : 1 } )
I need to calculate the Euclidian for each place in the matrix.
I've created a method to do this, but it is extremely inefficient because it iterates through the entire matrix for each place.
def calculateEuclidian(self, place):
count = 0;
for key, value in self.matrix.items():
if(key[1] == place and value == 1):
count += 1
euclidian = math.sqrt(count)
return euclidian
Is there a way to do this more efficiently?
I need the result to be in a dictionary with the place as a key, and the euclidian as the value.
You can use a dictionary comprehension (using a vectorized form is much faster than a for loop) and accumulate the result of the conditionals (0 or 1) as the euclidean value:
def calculateEuclidian(self, place):
return {place: sum(p==place and val==1 for (_,p), val in self.matrix.items())}
With your current data structure, I doubt there is any way you can avoid iterating through the entire dictionary.
If you cannot use another way (or an auxiliary way) of representing your data, iterating through every element of the dict is as efficient as you can get (asymptotically), since there is no way to ask a dict with tuple keys to give you all elements with keys matching (_, place) (where _ denotes "any value"). There are other, and more succinct, ways of writing the iteration code, but you cannot escape the asymptotic efficiency limitation.
If this is your most common operation, and you can in fact use another way of representing your data, you can use a dict[Place, list[User]] instead. That way, you can, in O(1) time, get the list of all users at a certain place, and all you would need to do is count the items in the list using the len(...) function which is also O(1). Obviously, you'll still need to take the sqrt in the end.
There may be ways to make it more Pythonic, but I do not think you can change the overall complexity since you are making a query based off both key and value. I think you have to search the whole matrix for your instances.
you may want to create a new dictionary from your current dictionary which isn't adapted to this kind of search and create a dictionary with place as key and list of (user,value) tuples as values.
Get the tuple list under place key (that'll be fast), then count the times where value is 1 (linear, but on a small set of data)
Keep the original dictionary for euclidian distance computation. Hoping that you don't change the data too often in the program, because you'd need to keep both dicts in-sync.

Python dict key delete if pattern match with other dict key

Python dict key delete, if key pattern match with other dict key.
e.g.
a={'a.b.c.test':1, 'b.x.d.pqr':2, 'c.e.f.dummy':3, 'd.x.y.temp':4}
b={'a.b.c':1, 'b.p.q':20}
result
a={'b.x.d.pqr':2,'c.e.f.dummy':3,'d.x.y.temp':4}`
If "pattern match with other dict key" means "starts with any key in the other dict", the most direct way to write that would be like this:
a = {k:v for (k, v) in a.items() if any(k.startswith(k2) for k2 in b)}
If that's hard to follow at first glance, it's basically the equivalent of this:
def matches(key1, d2):
for key2 in d2:
if key1.startswith(key2):
return True
return False
c = {}
for key in a:
if not matches(key, b):
c[key] = a[key]
a = c
This is going to be slower than necessary. If a has N keys, and b has M keys, the time taken is O(NM). While you can checked "does key k exist in dict b" in constant time, there's no way to check "does any key starting with k exist in dict b" without iterating over the whole dict. So, if b is potentially large, you probably want to search sorted(b.keys()) and write a binary search, which will get the time down to O(N log M). But if this isn't a bottleneck, you may be better off sticking with the simple version, just because it's simple.
Note that I'm generating a new a with the matches filtered out, rather than deleting the matches. This is almost always a better solution than deleting in-place, for multiple reasons:
* It's much easier to reason about. Treating objects as immutable and doing pure operations on them means you don't need to think about how states change over time. For example, the naive way to delete in place would run into the problem that you're changing the dictionary while iterating over it, which will raise an exception. Issues like that never come up without mutable operations.
* It's easier to read, and (once you get the hang of it) even to write.
* It's almost always faster. (One reason is that it takes a lot more memory allocations and deallocations to repeatedly modify a dictionary than to build one with a comprehension.)
The one tradeoff is memory usage. The delete-in-place implementation has to make a copy of all of the keys; the built-a-new-dict implementation has to have both the filtered dict and the original dict in memory. If you're keeping 99% of the values, and the values are much larger than the keys, this could hurt you. (On the other hand, if you're keeping 10% of the values, and the values are about the same size as the keys, you'll actually save space.) That's why it's "almost always" a better solution, rather than "always".
for key in list(a.keys()):
if any(key.startswith(k) for k in b):
del a[key]
Replace key.startswith(k) with an appropriate condition for "matching".
c={} #result in dict c
for key in b.keys():
if all([z.count(key)==0 for z in a.keys()]): #string of the key in b should not be substring for any of the keys in a
c[key]=b[key]

Sorting a dict on __iter__

I am trying to sort a dict based on its key and return an iterator to the values from within an overridden iter method in a class. Is there a nicer and more efficient way of doing this than creating a new list, inserting into the list as I sort through the keys?
How about something like this:
def itersorted(d):
for key in sorted(d):
yield d[key]
By far the easiest approach, and almost certainly the fastest, is something along the lines of:
def sorted_dict(d):
keys = d.keys()
keys.sort()
for key in keys:
yield d[key]
You can't sort without fetching all keys. Fetching all keys into a list and then sorting that list is the most efficient way to do that; list sorting is very fast, and fetching the keys list like that is as fast as it can be. You can then either create a new list of values or yield the values as the example does. Keep in mind that you can't modify the dict if you are iterating over it (the next iteration would fail) so if you want to modify the dict before you're done with the result of sorted_dict(), make it return a list.
def sortedDict(dictobj):
return (value for key, value in sorted(dictobj.iteritems()))
This will create a single intermediate list, the 'sorted()' method returns a real list. But at least it's only one.
Assuming you want a default sort order, you can used sorted(list) or list.sort(). If you want your own sort logic, Python lists support the ability to sort based on a function you pass in. For example, the following would be a way to sort numbers from least to greatest (the default behavior) using a function.
def compareTwo(a, b):
if a > b:
return 1
if a == b:
return 0
if a < b:
return -1
List.Sort(compareTwo)
print a
This approach is conceptually a bit cleaner than manually creating a new list and appending the new values and allows you to control the sort logic.

Categories