Set of non hashable objects in python - python

Is there an equivalent to python set for non-hashable objects? (For instance custom class that can be compared to one another but not hashed?)

If your values are not hashable, then there is no point in using set.
Just use a list instead. If all your objects can do is test for equality, then you'd have to scan each element every time to test for membership. obj in listvalue does just that, scan the list until an equality match is found:
if not someobj in somelist:
somelist.append(someobj)
would give you a list of 'unique' values.
Yes, this is going to be slower than a set, but sets can only achieve O(1) complexity through hashes.
If your objects are orderable, you could speed up operations by using the bisect module to bring down tests to O(log N) complexity, perhaps. Make sure you insert new values using the information gleaned from the bisection test to retain the order.

There is the sortedset class from the blist library, which offers a set-like api for comparable (and potentially non-hashable) objects, using a storage mechanism based on a sorted list.

Related

How does Python determine if two sets are equal? Is there any possible optimization for this?

I understand implementing set equality is easy if we just do bare-bone iteration
def __eq__(self, b):
if len(self) != len(b): # Cut fast optimization
return False
for x in self:
if x not in b:
return False
for x in b:
if x not in a:
return False
return True
Is there a more efficient way to implement this? Is there any optimization Python performs (or possible optimization for computing non-colliding Hash) for these equality?
CPython's Implementation
Built-in optimizations for set equality depend on the implementation of Python that is being used, which is CPython in most cases. You can check out the internal definition of set objects and how set comparisons are implemented by looking at CPython's source code.
First of all, it is important to note that Python features the classes Set and FrozenSet. The main difference is that instances of FrozenSet are immutable, and therefore are hashable and can be used as hash keys; the hash member is only calculated for instances of FrozenSet (see struct definition). They largely share the same codebase in CPython. CPython uses the following steps to determine set equality:
compare sizes of sets
compare hash values of sets (FrozenSet only!)
is set v a subset of w?
Hashes may collide, hence why the other steps are still necessary when comparing instances of FrozenSet.
CPython's Optimizations
A typical baseline implementation would check that both sets are subsets of each other. Your example also features a very intuitive optimization by comparing the sizes of the sets to determine set inequality.
CPython implements the optimization that Arya McCarthy and Tim Peters already pointed out: comparing sizes (step 1) and checking that one set is a subset of the other set (step 3) are enough to determine equality. This cuts down the time needed to determine set equality by half.
CPython also implements another optimization for instances of FrozenSet by comparing their hash values, but those values may collide. Therefore, we can only rely on mismatching values as an indicator that two sets are not equal, at a cost of O(1). However, matching hash values do not imply that two sets are equal, because collisions are possible.
I don't know if the extra overhead introduced by (re)calculating hash values for mutable sets is worth it, but perhaps CPython's implementation could benefit from a solution where the hash values are only calculated/used internally to determine equality, when needed. This is where mutable generic sets could potentially benefit from a hash-based optimization.
Non-Colliding Hashes
You mentioned an optimization strategy based on non-colliding hashes, but as far as I know, non-colliding hashes and generic sets cannot go together, regardless of implementation.
(Here's a link to a post on crypto.stackexchange.com that explains the problem pretty well.)
If the use of non-colliding hashes was possible for sets, then the only step needed to determine the equality of two sets would be to compare the hash values of both sets. This would be a massive optimization, as it would reduce the cost of determining set equality to O(1) in all cases. Steps 1 and 3 of CPython's implementations would not be necessary anymore.
That being said, the use of non-colliding hash values could still be possible for specific problem domains, where the contents of sets are constrained in some way, such that a hash function exists that promises that hash values do not collide. In that case, you probably should not rely on Python's built-in sets, or implement your own specialized set in Python (to avoid all that extra overhead). Instead, your best option would be to implement your own specialized implementation of a set in C/C++, and use Python bindings to integrate your specialized implementation into Python.

Python fast duplicate detection, can I store only the hash but not the value

I have a method for creating an image "hash" which is useful for duplicate frame detection. (Doesn't really matter for the question)
Currently I put each frame of a video in a set, and can do things like find videos that contain intersections by comparing the sets. (I have billions of hashes)
Since I have my own "hash" I don't need the values of the set, only the ability to detect duplicate items.
This would reduce my memory footprint by like half (since I would only have the hashes).
I know internally a set is actually hash,value pairs. There must be a way to make a "SparseSet" or a "hashonly" set.
Something like
2 in sparset(1,2,3)
True
but where
for s in sparset(1,2,3)
would return nothing, or hashes not values.
That's not quite how sets work. Both the hash value and the value are required, because the values must be checked for equality in case of a hash collision.
If you don't care about collisions, you can use a Bloom filter instead of a set. These are very memory efficient, but give probabilistic answers (either definitely not in the set, or maybe in the set). There's no Bloom filter in the standard library, but there are several implementations on PyPI.
If you care more about optimizing space than time, you could just keep the hashes in a list and then when you need to check for an element, sort it in place and do a binary search. Python's Timsort is very efficient when the list is mostly sorted already, so subsequent sorts will be relatively fast. Python lists have a sort() method and you can implement a binary search fairly easily using the standard library bisect module.
You can combine both techniques, i.e. don't bother sorting if the Bloom filter indicates the element is not in the set. And of course, don't bother sorting again if you haven't added any elements since last time.

More efficient use of dictionaries

I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.

how to keep a list sorted as you read elements

What is the efficient way to read elements into a list and keep the list sorted apart from searching the place for a new element in the existing sorted list and inserting in there?
Use a specialised data structure, in Python you have the bisect module at your disposal:
This module provides support for maintaining a list in sorted order without having to sort the list after each insertion. For long lists of items with expensive comparison operations, this can be an improvement over the more common approach. The module is called bisect because it uses a basic bisection algorithm to do its work.
You're looking for the functions in heapq.
This module provides an implementation of the heap queue algorithm, also known as the priority queue algorithm.
#Aswin's comment is interesting. If you are sorting each time you insert an item, the call to sort() is O(n) rather than the usual O(n*log(n)). This is due to the way the sort(timsort) is implemented.
However on top of this, you'd need to shift a bunch of elements along the list to make space. This is also O(n), so overall - calling .sort() each time is O(n)
There isn't a way to keep a sorted list in better than O(n), because this shifting is always needed.
If you don't need an actual list, the heapq (as mentioned in #Ignacio's answer) often covers the properties you do need in an efficient manner.
Otherwise you can probably find one of the many tree data structures will suit your cause better than a list.
example with SortedList:
from sortedcontainers import SortedList
sl = SortedList()
sl.add(2)
sl.add(1)
# result sl = [1,2]

Time complexity of accessing a Python dict

I am writing a simple Python program.
My program seems to suffer from linear access to dictionaries,
its run-time grows exponentially even though the algorithm is quadratic.
I use a dictionary to memoize values. That seems to be a bottleneck.
The values I'm hashing are tuples of points.
Each point is: (x,y), 0 <= x,y <= 50
Each key in the dictionary is: A tuple of 2-5 points: ((x1,y1),(x2,y2),(x3,y3),(x4,y4))
The keys are read many times more often than they are written.
Am I correct that python dicts suffer from linear access times with such inputs?
As far as I know, sets have guaranteed logarithmic access times.
How can I simulate dicts using sets(or something similar) in Python?
edit As per request, here's a (simplified) version of the memoization function:
def memoize(fun):
memoized = {}
def memo(*args):
key = args
if not key in memoized:
memoized[key] = fun(*args)
return memoized[key]
return memo
See Time Complexity. The python dict is a hashmap, its worst case is therefore O(n) if the hash function is bad and results in a lot of collisions. However that is a very rare case where every item added has the same hash and so is added to the same chain which for a major Python implementation would be extremely unlikely. The average time complexity is of course O(1).
The best method would be to check and take a look at the hashs of the objects you are using. The CPython Dict uses int PyObject_Hash (PyObject *o) which is the equivalent of hash(o).
After a quick check, I have not yet managed to find two tuples that hash to the same value, which would indicate that the lookup is O(1)
l = []
for x in range(0, 50):
for y in range(0, 50):
if hash((x,y)) in l:
print "Fail: ", (x,y)
l.append(hash((x,y)))
print "Test Finished"
CodePad (Available for 24 hours)
You are not correct. dict access is unlikely to be your problem here. It is almost certainly O(1), unless you have some very weird inputs or a very bad hashing function. Paste some sample code from your application for a better diagnosis.
It would be easier to make suggestions if you provided example code and data.
Accessing the dictionary is unlikely to be a problem as that operation is O(1) on average, and O(N) amortized worst case. It's possible that the built-in hashing functions are experiencing collisions for your data. If you're having problems with has the built-in hashing function, you can provide your own.
Python's dictionary implementation
reduces the average complexity of
dictionary lookups to O(1) by
requiring that key objects provide a
"hash" function. Such a hash function
takes the information in a key object
and uses it to produce an integer,
called a hash value. This hash value
is then used to determine which
"bucket" this (key, value) pair should
be placed into.
You can overwrite the __hash__ method in your class to implement a custom hash function like this:
def __hash__(self):
return hash(str(self))
Depending on what your data actually looks like, you might be able to come up with a faster hash function that has fewer collisions than the standard function. However, this is unlikely. See the Python Wiki page on Dictionary Keys for more information.
To answer your specific questions:
Q1:
"Am I correct that python dicts suffer from linear access times with such inputs?"
A1: If you mean that average lookup time is O(N) where N is the number of entries in the dict, then it is highly likely that you are wrong. If you are correct, the Python community would very much like to know under what circumstances you are correct, so that the problem can be mitigated or at least warned about. Neither "sample" code nor "simplified" code are useful. Please show actual code and data that reproduce the problem. The code should be instrumented with things like number of dict items and number of dict accesses for each P where P is the number of points in the key (2 <= P <= 5)
Q2:
"As far as I know, sets have guaranteed logarithmic access times.
How can I simulate dicts using sets(or something similar) in Python?"
A2: Sets have guaranteed logarithmic access times in what context? There is no such guarantee for Python implementations. Recent CPython versions in fact use a cut-down dict implementation (keys only, no values), so the expectation is average O(1) behaviour. How can you simulate dicts with sets or something similar in any language? Short answer: with extreme difficulty, if you want any functionality beyond dict.has_key(key).
As others have pointed out, accessing dicts in Python is fast. They are probably the best-oiled data structure in the language, given their central role. The problem lies elsewhere.
How many tuples are you memoizing? Have you considered the memory footprint? Perhaps you are spending all your time in the memory allocator or paging memory.
My program seems to suffer from linear access to dictionaries, its run-time grows exponentially even though the algorithm is quadratic.
I use a dictionary to memoize values. That seems to be a bottleneck.
This is evidence of a bug in your memoization method.

Categories