Python hash table for fuzzy matching

Python hash table for fuzzy matching - python

I am trying to implement a data structure which allows rapid look-ups based on keys.
The python dict is great when my look-ups involve an equality
(e.g. key == somevalue translates to datadict[somevalue].
The problem is that I also need to be able to efficiently look up keys based on a more complex comparison, e.g. key > 50, or key.startswith('abc').
Obviously I can't use the same solution in both cases, but at the moment I can't figure out how to solve either case. Can anyone suggest a way of doing this?

It doesn't sound like you want a hash algorithm - instead some form of binary tree. Or even a list which you use the bisect module with. It'd be worth looking at: Python's standard library - is there a module for balanced binary tree?
Another option (depending on your data), would be to use use an in-memory sqlite3 database and create appropriate indices for possible lookups -- but you'll trade performance/memory and SQL syntax for flexibility...

Put all data items in a list.
Sort the list on the key.
Use binary search to efficiently find items where key > 50 or where key.startswith('abc').
Of course, this only pays off if you have really very many data items. If you have not so many, simply loop through the list and apply your condition to every key.

Related

Are nested sets and dictionaries anti-pattern in Python?

I need to create a sort of similarity matrix based on user_id values. I am currently using Pandas to store the majority of my data, but I know that iteration is very anti-pattern, so I am considering creating a set/dictionary nest to store the similarities, similar to some of the proposed structures here
I would only be storing N nearest similarities, so it would amount to something like this:
{
'user_1' : {'user_2':0.5, 'user_4':0.9, 'user_3':1.0},
'user_2' : ...
}
It would be allowing me to access a neighbourhood by doing dict_name[user_id] quite easily.
Essentially the outermost dictionary key would hold a user_id which returns another dictionary of its N closest neighbours with user_id- similarity_value key-value sets.
For more context, I'm just writing a simple KNN recommender. I am doing it from scratch as I've tried using Surpriselib and sklearn but they don't have the context-aware flexibility I require.
This seems like a reasonable way to store these values to me, but is it very anti-pythonic, or should I be looking to do this using some other structures (e.g. NumPy or Pandas or something else I don't yet know about)?

As the comment says, there is nothing inherently wrong or anti-pythonic with using (one level of) nested dictionaries and writing everything from scratch.
Performance-wise you can probably beat your self-written solution if you use an existing data structure whose API works well with the transformations/operations you intend to perform on them. Numpy/Pandas only will help if your operations can be expressed as vectorized operations that operate on all (pairs of) elements along a common axis, e.g. all users in your top-level dictionary.

Python fast duplicate detection, can I store only the hash but not the value

I have a method for creating an image "hash" which is useful for duplicate frame detection. (Doesn't really matter for the question)
Currently I put each frame of a video in a set, and can do things like find videos that contain intersections by comparing the sets. (I have billions of hashes)
Since I have my own "hash" I don't need the values of the set, only the ability to detect duplicate items.
This would reduce my memory footprint by like half (since I would only have the hashes).
I know internally a set is actually hash,value pairs. There must be a way to make a "SparseSet" or a "hashonly" set.
Something like
2 in sparset(1,2,3)
True
but where
for s in sparset(1,2,3)
would return nothing, or hashes not values.

That's not quite how sets work. Both the hash value and the value are required, because the values must be checked for equality in case of a hash collision.
If you don't care about collisions, you can use a Bloom filter instead of a set. These are very memory efficient, but give probabilistic answers (either definitely not in the set, or maybe in the set). There's no Bloom filter in the standard library, but there are several implementations on PyPI.
If you care more about optimizing space than time, you could just keep the hashes in a list and then when you need to check for an element, sort it in place and do a binary search. Python's Timsort is very efficient when the list is mostly sorted already, so subsequent sorts will be relatively fast. Python lists have a sort() method and you can implement a binary search fairly easily using the standard library bisect module.
You can combine both techniques, i.e. don't bother sorting if the Bloom filter indicates the element is not in the set. And of course, don't bother sorting again if you haven't added any elements since last time.

More efficient use of dictionaries

I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?

It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.

Fastest Get Python Data Structures

I am developing AI to perform MDP, I am getting states(just integers in this case) and assigning it a value, and I am going to be doing this a lot. So I am looking for a data structure that can hold(no need for delete) that information and will have a very fast get/update function. Is there something faster than the regular dictionary? I am looking for anything really so native python, open sourced, I just need fast getting.

Using a Python dictionary is the way to go.

You're saying that all your keys are integers? In that case, it might be faster to use a list and just treat the list indices as the key values. However, you'd have to make sure that you never delete or add list items; just start with as many as you think you'll need, setting them all equal to None, as shown:
mylist = [None for i in xrange(totalitems)]
Then, when you need to "add" an item, just set the corresponding value.
Note that this probably won't actually gain you much in terms of actual efficiency, and it might be more confusing than just using a dictionary.
For 10,000 items, it turns out (on my machine, with my particular test case) that accessing each one and assigning it to a variable takes about 334.8 seconds with a list and 565 seconds with a dictionary.

If you want a rapid prototype, use python. And don't worry about speed.
If you want to write fast scientific code (and you can't build on fast native libraries, like LAPACK for linear algebra stuff) write it in C, C++ (maybe only to call from Python). If fast instead of ultra-fast is enough, you can also use Java or Scala.

Efficient large dicts of dicts to represent M:M relationships in Python

I have a very large dataset - millions of records - that I want to store in Python. I might be running on 32-bit machines so I want to keep the dataset down in the hundreds-of-MB range and not ballooning much larger than that.
These records - represent a M:M relationship - two IDs (foo and bar) and some simple metadata like timestamps (baz).
Some foo have too nearly all bar in them, and some bar have nearly all foo. But there are many bar that have almost no foos and many foos that have almost no bar.
If this were a relational database, a M:M relationship would be modelled as a table with a compound key. You can of course search on either component key individually comfortably.
If you store the rows in a hashtable, however, you need to maintain three hashtables as the compound key is hashed and you can't search on the component keys with it.
If you have some kind of sorted index, you can abuse lexical sorting to iterate the first key in the compound key, and need a second index for the other key; but its less obvious to me what actual data-structure in the standard Python collections this equates to.
I am considering a dict of foo where each value is automatically moved from tuple (a single row) to list (of row tuples) to dict depending on some thresholds, and another dict of bar where each is a single foo, or a list of foo.
Are there more efficient - speedwise and spacewise - ways of doing this? Any kind of numpy for indices or something?
(I want to store them in Python because I am having performance problems with databases - both SQL and NoSQL varieties. You end up being IPC memcpy and serialisation-bound. That is another story; however the key point is that I want to move the data into the application rather than get recommendations to move it out of the application ;) )

Have you considered using a NoSQL database that runs in memory such at Redis? Redis supports a decent amount of familiar data structures.
I realize you don't want to move outside of the application, but not reinventing the wheel can save time and quite frankly it may be more efficient.

If you need to query the data in a flexible way, and maintain various relationships, I would suggest looking further into using a database, of which there are many options. How about using an in-memory databse, like sqlite (using ":memory:" as the file)? You're not really moving the data "outside" of your program, and you will have much more flexibility than with multi-layered dicts.
Redis is also an interesting alternative, as it has other data-structures to play with, rather than using a relational model with SQL.

What you describe sounds like a sparse matrix, where the foos are along one axis and the bars along the other one. Each non-empty cell represents a relationship between one foo and one bar, and contains the "simple metadata" you describe.
There are efficient sparse matrix packages for Python (scipy.sparse, PySparse) you should look at. I found these two just by Googling "python sparse matrix".
As to using a database, you claim that you've had performance problems. I'd like to suggest that you may not have chosen an optimal representation, but without more details on what your access patterns look like, and what database schema you used, it's awfully hard for anybody to contribute useful help. You might consider editing your post to provide more information.

NoSQL systems like redis don't provide MM tables.
In the end, a python dict keyed by pairs holding the values, and a dict of the set of pairings for each term was the best I could come up with.
class MM:
def __init__(self):
self._a = {} # Bs for each A
self._b = {} # As for each B
self._ab = {}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.