Efficiently mapping unhashable objects to their index in a list - python

A Python list
f = [x0, x1, x2]
may be seen as an efficient representation of a mapping from [0, 1, ..., len(f) - 1] to the set of its elements. By "efficient" I mean that f[i] returns the element associated with i in O(1) time.
The inverse mapping may be defined as follows:
class Inverse:
def __init__(self, f):
self.f = f
def __getitem__(self, x):
return self.f.index(x)
This works, but Inverse(f)[x] takes O(n) time on average.
Alternatively, one may use a dict:
f_inv = {x: i for i, x in enumerate(f)}
This has O(1) average time complexity, but it requires the objects in the list to be hashable.
Is there a way to define an inverse mapping that provides equality-based lookups, in O(1) average time, with unhashable objects?
Edit: sample input and expected output:
>>> f = [x0, x1, x2]
>>> f_inv = Inverse(f) # this is to be defined
>>> f_inv[x0] # in O(1) time
0
>>> f_inv[x2] # in O(1) time
2

You can create an associated dictionary mapping the object ID's back to the list index.
The obvious disadvantage is that you will have to search the index for the identity object, not for on eobject that is merely equal.
On the upside, by creating a custom MutableSequence class using collections.abc, you can, with minimal code, write a class that keeps your data both as a sequence and as the reverse dictionary.
from collections.abc import MutableSequence
from threading import RLock
class MD(dict):
# No need for a full MutableMapping subclass, as the use is limited
def __getitem__(self, key):
return super().__getitem__(id(key))
class Reversible(MutableSequence):
def __init__(self, args):
self.seq = list()
self.reverse = MD()
self.lock = RLock()
for element in args:
self.append(element)
def __getitem__(self, index):
return self.seq[index]
def __setitem__(self, index, value):
with self.lock:
del self.reverse[id(self.seq[index])]
self.seq[index] = value
self.reverse[id(value)] = index
def __delitem__(self, index):
if index < 0:
index += len(self)
with self.lock:
# Increase all mapped indexes
for obj in self.seq[index:]:
self.reverse[obj] -= 1
del self.reverse[id(self.seq[index])]
del self.seq[index]
def __len__(self):
return len(self.seq)
def insert(self, index, value):
if index < 0:
index += len(self)
with self.lock:
# Increase all mapped indexes
for obj in self.seq[index:]:
self.reverse[obj] += 1
self.seq.insert(index, value)
self.reverse[id(value)] = index
And voilá: just use this object in place of your list, and the public attribute "reverse" to get the index of identity objects.
Perceive you can augment the "intelligence" of the "MD" class by trying to use different strategies, like to use the objects themselves, if they are hashable, and only resort to id, or other custom key based on other object attributes, when needed. That way you could mitigate the need for the search to be for the same object.
So, for ordinary operations on the list, this class maintain the reverted dictionary synchronized. There is no support for slice indexing, though.
For more information, check the docs at https://docs.python.org/3/library/collections.abc.html

Unfortunately you're stuck with an algorithm limitation here. Fast lookup structures, like hash tables or binary trees, are efficient because they put objects in particular buckets or order them based on their values. This requires them to be hashable or comparable consistently for the entire time you are storing them in this structure, otherwise a lookup is very likely to fail.
If the objects you need are mutable (usually the reason they are not hashable) then any time an object you are tracking changes you need to update the data structure. The safest way to do this is to create immutable objects. If you need to change an object, then create a new one, remove the old one from the dictionary, and insert the new object as a key with the same value.
The operations here are still O(1) with respect to the size of the dictionary, you just need to consider whether the cost of copying objects on every change is worth it.

Related

Python - overriding a method based on specific access type

I have a situation where I need to create a dictionary that keeps track of global order of the values. I haven't been able to find a good way for the class itself to have an incrementing counter that's also tracked by the value.
Here's what I've written in the meanwhile to get around this:
from collections import defaultdict
class NotMyDict(object):
""" defaultdict(list) that tracks order globally across the dict.
Will function as a normal defaultdict(list) unless you modify the
'ordered' attribute and set it to a non-false evaluating value. This
"""
ordered = False
_data = defaultdict(NotMyDictList)
_next_index = 0
class NotMyDictList(list):
def append(self, value):
def __repr__(self):
if self.ordered:
return repr(self._data)
else:
temp = defaultdict(list)
for key in self._data:
for value in self._data[key]:
temp[key].append(value[0])
return repr(temp)
def __getitem__(self, key):
if self.ordered:
return self._data[key]
else:
return [val[0] for val in self._data[key]]
def add_value_to_key(self, key, value):
self._data[key].append((value, self._next_index))
self._next_index += 1
So I can use this like a normal dictionary for pulling values. I could have instantiated a list if the key didn't exist, but defaultdict was simple and easy.
Here's an example of the use:
test = NotMyDict()
test.add_value_to_key('test', 'hi')
test.add_value_to_key('test', 'there')
test.add_value_to_key('test', 'buddy')
test['test']
Result:
['hi', 'there', 'buddy']
test.ordered = True
test['test']
Result:
[('hi', 0), ('there', 1), ('buddy', 2)]
Now - the example of use isn't super important, but the functionality that I can't seem to figure out, is instead of using the .add_value_to_key(), I want to be able to use a normal defaultdict(list) convention of:
dict[key].append()
and still have it track the index. Do I need to pass global object locations with id() and reference those objects at a memory level, or is there a way I just don't understand to have a "class global" that's accessible by it's members?
I had also tried to use nested classes, but the nested class didn't have access to the parent class's globals, so I'd have to:
Make a list-like class that references the parent class attribute somehow (Maybe with id() and direct memory location reference?)
modify/make it's append() function so that it also updates the parent class global counter, and tracks the value with this counter as a metadata field.
I really just can't seem to wrap my head around how to create this object/class in a way that let's me use the same functionality of a defaultdict(list) where I can index/append directly AND have it track the global index order of that new value.
dict[key].append(value)
Help would be appreciated - I sunk three hours into trying different solutions before I scrapped it and went with the "just use this method to append" for now.

How to set up a class with all the methods of and functions like a built in such as float, but holds onto extra data?

I am working with 2 data sets on the order of ~ 100,000 values. These 2 data sets are simply lists. Each item in the list is a small class.
class Datum(object):
def __init__(self, value, dtype, source, index1=None, index2=None):
self.value = value
self.dtype = dtype
self.source = source
self.index1 = index1
self.index2 = index2
For each datum in one list, there is a matching datum in the other list that has the same dtype, source, index1, and index2, which I use to sort the two data sets such that they align. I then do various work with the matching data points' values, which are always floats.
Currently, if I want to determine the relative values of the floats in one data set, I do something like this.
minimum = min([x.value for x in data])
for datum in data:
datum.value -= minimum
However, it would be nice to have my custom class inherit from float, and be able to act like this.
minimum = min(data)
data = [x - minimum for x in data]
I tried the following.
class Datum(float):
def __new__(cls, value, dtype, source, index1=None, index2=None):
new = float.__new__(cls, value)
new.dtype = dtype
new.source = source
new.index1 = index1
new.index2 = index2
return new
However, doing
data = [x - minimum for x in data]
removes all of the extra attributes (dtype, source, index1, index2).
How should I set up a class that functions like a float, but holds onto the extra data that I instantiate it with?
UPDATE: I do many types of mathematical operations beyond subtraction, so rewriting all of the methods that work with a float would be very troublesome, and frankly I'm not sure I could rewrite them properly.
I suggest subclassing float and using a couple decorators to "capture" the float output from any method (except for __new__ of course) and returning a Datum object instead of a float object.
First we write the method decorator (which really isn't being used as a decorator below, it's just a function that modifies the output of another function, AKA a wrapper function):
def mydecorator(f,cls):
#f is the method being modified, cls is its class (in this case, Datum)
def func_wrapper(*args,**kwargs):
#*args and **kwargs are all the arguments that were passed to f
newvalue = f(*args,**kwargs)
#newvalue now contains the output float would normally produce
##Now get cls instance provided as part of args (we need one
##if we're going to reattach instance information later):
try:
self = args[0]
##Now check to make sure new value is an instance of some numerical
##type, but NOT a bool or a cls type (which might lead to recursion)
##Including ints so things like modulo and round will work right
if (isinstance(newvalue,float) or isinstance(newvalue,int)) and not isinstance(newvalue,bool) and type(newvalue) != cls:
##If newvalue is a float or int, now we make a new cls instance using the
##newvalue for value and using the previous self instance information (arg[0])
##for the other fields
return cls(newvalue,self.dtype,self.source,self.index1,self.index2)
#IndexError raised if no args provided, AttributeError raised of self isn't a cls instance
except (IndexError, AttributeError):
pass
##If newvalue isn't numerical, or we don't have a self, just return what
##float would normally return
return newvalue
#the function has now been modified and we return the modified version
#to be used instead of the original version, f
return func_wrapper
The first decorator only applies to a method to which it is attached. But we want it to decorate all (actually, almost all) the methods inherited from float (well, those that appear in the float's __dict__, anyway). This second decorator will apply our first decorator to all of the methods in the float subclass except for those listed as exceptions (see this answer):
def for_all_methods_in_float(decorator,*exceptions):
def decorate(cls):
for attr in float.__dict__:
if callable(getattr(float, attr)) and not attr in exceptions:
setattr(cls, attr, decorator(getattr(float, attr),cls))
return cls
return decorate
Now we write the subclass much the same as you had before, but decorated, and excluding __new__ from decoration (I guess we could also exclude __init__ but __init__ doesn't return anything, anyway):
#for_all_methods_in_float(mydecorator,'__new__')
class Datum(float):
def __new__(klass, value, dtype="dtype", source="source", index1="index1", index2="index2"):
return super(Datum,klass).__new__(klass,value)
def __init__(self, value, dtype="dtype", source="source", index1="index1", index2="index2"):
self.value = value
self.dtype = dtype
self.source = source
self.index1 = index1
self.index2 = index2
super(Datum,self).__init__()
Here are our testing procedures; iteration seems to work correctly:
d1 = Datum(1.5)
d2 = Datum(3.2)
d3 = d1+d2
assert d3.source == 'source'
L=[d1,d2,d3]
d4=max(L)
assert d4.source == 'source'
L = [i for i in L]
assert L[0].source == 'source'
assert type(L[0]) == Datum
minimum = min(L)
assert [x - minimum for x in L][0].source == 'source'
Notes:
I am using Python 3. Not certain if that will make a difference for you.
This approach effectively overrides EVERY method of float other than the exceptions, even the ones for which the result isn't modified. There may be side effects to this (subclassing a built-in and then overriding all of its methods), e.g. a performance hit or something; I really don't know.
This will also decorate nested classes.
This same approach could also be implemented using a metaclass.
The problem is when you do :
x - minimum
in terms of types you are doing either :
datum - float, or datum - integer
Either way python doesn't know how to do either of them, so what it does is look at parent classes of the arguments if it can. since datum is a type of float, it can easily use float - and the calculation ends up being
float - float
which will obviously result in a 'float' - python has no way of knowing how to construct your datum object unless you tell it.
To solve this you either need to implement the mathematical operators so that python knows how to do datum - float or come up with a different design.
Assuming that 'dtype', 'source', index1 & index2 need to stay the same after a calculation - then as an example your class needs :
def __sub__(self, other):
return datum(value-other, self.dtype, self.source, self.index1, self.index2)
this should work - not tested
and this will now allow you to do this
d = datum(23.0, dtype="float", source="me", index1=1)
e = d - 16
print e.value, e.dtype, e.source, e.index1, e.index2
which should result in :
7.0 float me 1 None

Pythonic slicing of nested attributes

I am dealing with classes whose attributes are sometimes list whose elements can be dictionaries or further nested objects with attributes etc. I would like to perform some slicing that with my grasp of python is only doable with what feels profoundly un-Pythonic.
My minimal code looks like this:
class X(object):
def __init__(self):
self.a = []
x=X()
x.a.append({'key1':'v1'})
x.a.append({'key1':'v2'})
x.a.append({'key1':'v3'})
# this works as desired
x.a[0]['key1'] # 'v1'
I would like to perform an access to a key in the nested dictionary but make that call for all elements of the list containing that dictionary. The standard python way of doing this would be a list comprehension a la:
[v['key1'] for v in x.a]
However, my minimal example doesn't quite convey the full extent of nesting in my real-world scenario: The attribute list a in class X might contain objects, whose attributes are objects, whose attributes are dictionaries whose keys I want to select on while iterating over the outer list.
# I would like something like
useful_list = x.a[:]['key1'] # TypeError: list indices must be integers, not str
# or even better
cool_list = where(x.a[:]['key1'] == 'v2') # same TypeError
If I start list comprehending for every interesting key it quickly doesn't look all that Pythonic. Is there a nice way of doing this or do I have to code 'getter' methods for all conceivable pairings of lists and dictionary keys?
UPDATE:
I have been reading about overloading lists. Apparently one can mess with the getitem method which is used for indeces for lists and keys for dict. Maybe a custom class that iterates over list members. This is starting to sound contrived...
So, you want to create an hierarchical structure, with an operation which means
different things for different types, and is defined recursively.
Polymorphism to the rescue.
You could override __getitem__ instead of my get_items below, but in your case it might be better to define a non-builtin operation to avoid risking ambiguity. It's up to you really.
class ItemsInterface(object):
def get_items(self, key):
raise NotImplementedError
class DictItems(ItemsInterface, dict):
def __init__(self, *args, **kwargs):
dict.__init__(self, *args, **kwargs)
def get_items(self, key):
res = self[key]
# apply recursively
try:
res = res.get_items(key)
except AttributeError:
pass
return res
class ListItems(ItemsInterface, list):
def __init__(self, *args, **kwargs):
list.__init__(self, *args, **kwargs)
def get_items(self, key):
return [ x.get_items(key) for x in self ]
x = ListItems()
x.append(DictItems({'key1':'v1'}))
x.append(DictItems({'key1':'v2'}))
x.append(DictItems({'key1':'v3'}))
y = DictItems({'key1':'v999'})
x.append(ListItems([ y ]))
x.get_items('key1')
=> ['v1', 'v2', 'v3', ['v999']]
Of course, this solution might not be exactly what you need (you didn't explain what it should do if the key is missing, etc.)
but you can easily modify it to suit your needs.
This solution also supports ListItems as values of the DictItems. the get_items operation is applied recursively.

Python heapq module, heapify method on an object

Since I'm trying to be efficient in this program I'm making, I thought I'd use the built in heapq module in python, but some of my objects have multiple attributes, like name and number. Is there a way to use the heapify method to heapify my objects based on a certain attribute? I don't see anything in the documentation.
Right after I posted, I figured you could make a list of the objects by the attribute needed before using heapify which would take O(n) linear time. This wouldn't affect the runtime of heapify or other heapq methods.
#vsekhar and # all the others wondering about the accepted answer.
Assumption:
class SomeObject():
def __init__(self,name, number):
self.name = name
self.number = number
a_list = []
obj_1 = SomeObject("tim", 12)
obj_2 = SomeObject("tom", 13)
Now, instead of creating a heap with the objects only as elements:
heapq.heappush(a_list, obj_1)
heapq.heappush(a_list, obj_2)
you actually want to create the heap with a tuple of 2 values as heap elements - The idea is to have the attribute you want to sort with as first value of the tuple and the object (as before) as the second element of the tuple:
# Sort by 'number'.
heapq.heappush(a_list, (obj_1.number, obj_1))
heapq.heappush(a_list, (obj_2.number, obj_2))
The heap considers this first value of the tuple as the value to sort by.
In case the element pushed to the heap is not of a simple data type like int or str, the underlying implementation needs to know how to compare elements.
If the element is an iterable the first element is considered to contain the sort value.
Have a look at the examples here: https://docs.python.org/3/library/heapq.html#basic-examples (search for tuple)
Heap elements can be tuples. This is useful for assigning comparison values (such as task priorities) alongside the main record being tracked:
Another option might be to make comparison work with your custom class - This can be implemented so the object itself can be used as the heap element (as in the first example).
Have a look here for reference and an example: "Enabling" comparison for classes
Have a look at the enahnced class SomeObject:
class SomeObject():
def __init__(self,name, number):
self.name = name
self.number = number
def __eq__(self, obj):
return self.number == obj.number
def __lt__(self, obj):
return self.number < obj.number
def __hash__(self):
return hash(self.number)
This way you can create the heap with the objects only as elements:
heapq.heappush(a_list, obj_1)
heapq.heappush(a_list, obj_2)

Python dictionary - binary search for a key?

I want to write a container class that acts like a dictionary (actually derives from a dict), The keys for this structure will be dates.
When a key (i.e. date) is used to retrieve a value from the class, if the date does not exist then the next available date that preceeds the key is used to return the value.
The following data should help explain the concept further:
Date (key) Value
2001/01/01 123
2001/01/02 42
2001/01/03 100
2001/01/04 314
2001/01/07 312
2001/01/09 321
If I try to fetch the value associated with key (date) '2001/01/05' I should get the value stored under the key 2001/01/04 since that key occurs before where the key '2001/01/05' would be if it existed in the dictionary.
In order to do this, I need to be able to do a search (ideally binary, rather than naively looping through every key in the dictionary). I have searched for bsearch dictionary key lookups in Python dictionaries - but have not found anything useful.
Anyway, I want to write a class like that encapsulates this behavior.
This is what I have so far (not much):
#
class NearestNeighborDict(dict):
#
"""
#
a dictionary which returns value of nearest neighbor
if specified key not found
#
"""
def __init__(self, items={}):
dict.__init__(self, items)
def get_item(self, key):
# returns the item stored with the key (if key exists)
# else it returns the item stored with the key
You really don't want to subclass dict because you can't really reuse any of its functionality. Rather, subclass the abstract base class collections.Mapping (or MutableMapping if you want to also be able to modify an instance after creation), implement the indispensable special methods for the purpose, and you'll get other dict-like methods "for free" from the ABC.
The methods you need to code are __getitem__ (and __setitem__ and __delitem__ if you want mutability), __len__, __iter__, and __contains__.
The bisect module of the standard library gives you all you need to implement these efficiently on top of a sorted list. For example...:
import collections
import bisect
class MyDict(collections.Mapping):
def __init__(self, contents):
"contents must be a sequence of key/value pairs"
self._list = sorted(contents)
def __iter__(self):
return (k for (k, _) in self._list)
def __contains__(self, k):
i = bisect.bisect_left(self._list, (k, None))
return i < len(self._list) and self._list[i][0] == k
def __len__(self):
return len(self._list)
def __getitem__(self, k):
i = bisect.bisect_left(self._list, (k, None))
if i >= len(self._list): raise KeyError(k)
return self._list[i][1]
You'll probably want to fiddle __getitem__ depending on what you want to return (or whether you want to raise) for various corner cases such as "k greater than all keys in self".
The sortedcontainers module provides a SortedDict type that maintains the keys in sorted order and supports bisecting on those keys. The module is pure-Python and fast-as-C implementations with 100% test coverage and hours of stress.
For example:
from sortedcontainers import SortedDict
sd = SortedDict((date, value) for date, value in data)
# Bisect for the index of the desired key.
index = sd.bisect('2001/01/05')
# Lookup the real key at that index.
key = sd.iloc[index]
# Retrieve the value associated with that key.
value = sd[key]
Because SortedDict supports fast indexing, it's easy to look ahead or behind your key as well. SortedDict is also a MutableMapping so it should work nicely in your type system.
I'd extend a dict, and override the __getitem__ and __setitem__ method to store a sorted list of keys.
from bisect import bisect
class NearestNeighborDict(dict):
def __init__(self):
dict.__init__(self)
self._keylist = []
def __getitem__(self, x):
if x in self:
return dict.__getitem__(self, x)
index = bisect(self._keylist, x)
if index == len(self._keylist):
raise KeyError('No next date')
return dict.__getitem__(self, self._keylist[index])
def __setitem__(self, x, value):
if x not in self:
index = bisect(self._keylist, x)
self._keylist.insert(index, value)
dict.__setitem__(self, x, value)
It's true you're better off inheriting from MutableMapping, but the principle is the same, and the above code can be easily adapted.
Why not just maintain a sorted list from dict.keys() and search that? If you're subclassing dict you may even devise an opportunity to do a binary insert on that list when values are added.
Use the floor_key method on bintrees.RBTree: https://pypi.python.org/pypi/bintrees/2.0.1

Categories