absolute fastest lookup in python / cython

absolute fastest lookup in python / cython - python

I'd like to do a lookup mapping 32bit integer => 32bit integer.
The input keys aren't necessary contiguous nor cover 2^32 -1 (nor do I want this in-memory to consume that much space!).
The use case is for a poker evaluator, so doing a lookup must be as fast as possible. Perfect hashing would be nice, but that might be a bit out of scope.
I feel like the answer is some kind of cython solution, but I'm not sure about the underpinnings of cython and if it really does any good with Python's dict() type. Of course a flat array with just a simple offset jump would be super fast, but then I'm allocating 2^32 - 1 places in memory for the table, which I don't want.
Any tips / strategies? Absolute speed with minimal memory footprint is the goal.

You aren't smart enough to write something faster than dict. Don't feel bad; 99.99999% of the people on the planet aren't. Use a dict.

First, you should actually define what "fast enough" means to you, before you do anything else. You can always make something faster, so you need to set a target so you don't go insane. It is perfectly reasonable for this target to be dual-headed - say something like "Mapping lookups must execute in these parameters (min/max/mean), and when/if we hit those numbers we're willing to spend X more development hours to optimize even further, but then we'll stop."
Second, the very first thing you should do to make this faster is to copy the code in Objects/dictobject.c in the Cpython source tree (make something new like intdict.c or something) and then modify it so that the keys are not python objects. Chasing after a better hash function will not likely be a good use of your time for integers, but eliminating INCREF/DECREF and PyObject_RichCompareBool calls for your keys will be a huge win. Since you're not deleting keys you could also elide any checks for dummy values (which exist to preserve the collision traversal for deleted entries), although it's possible that you'll get most of that win for free simply by having better branch prediction for your new object.

You are describing a perfect use case for a hash indexed collection. You are also describing a perfect scenario for the strategy of write it first, optimise it second.
So start with the Python dict. It's fast and it absolutely will do the job you need.
Then benchmark it. Figure out how fast it needs to go, and how near you are. Then 3 choices.
It's fast enough. You're done.
It's nearly fast enough, say within about a factor of two. Write your own hash indexing, paying attention to the hash function and the collision strategy.
It's much too slow. You're dead. There is nothing simple that will give you a 10x or 100x improvement. At least you didn't waste any time on a better hash index.

Related

Should I use two Hashmap for fast lookup on two entities instead of linear search of one hashmap?

I had an interview problem where I was asked to make an optimized solution to implement search on two instance: Student Number and class(only one per student).
sn_to_class() should return class for student number. Also, class_sns() should return list of student numbers for a given class.
My First solution was to use hashmap sn_to_class_map (number as key and student number as data) and hashmap class_to_sns_map(class as key and student number as data). So, the search will be minimized to O(1), but the data will be increased.
pseudo code:
sn_map = dict()
cl_map = dict()
fun addStudents(sn, cl):
sn_map[sn] = cl
cl_map[cl].add(sn) # List
fun getStudents(cl)
return cl_map[cl]
fun getClass(sn)
return sn_map[sn]
Is my approach correct?

It is not always possible to optimize for everything; there's very often a tradeoff between time and space, or between consistency and availability, or between the time needed for one operation and the time needed for a different operation, . . .
In your case, you have been asked to make an "optimized" solution, and you're faced with such a tradeoff:
If you keep a map from student-numbers to classes, then getClass and addStudents are fast, and you only use the space for that one representation of the data, but getStudents is slower because it needs to read the entire map.
If you keep a map from classes to lists of student-numbers, and don't worry about the order student-numbers in those lists, then getStudents and addStudents are fast, and you only use the space for that one representation of the data, but getClass is slower because it needs to read the entire map.
If you keep a map from classes to sorted lists of student-numbers, then getStudents is fast, getClass is a bit faster than with unsorted lists (it needs to examine every class in the map, but at least it can do binary search within each list), and you only use the space for that one representation of the data, but getClass is still relatively slow if classes are small, and addStudents is significantly slower because inserting a student into a list can take a lot of time.
If you keep two maps, as you propose, then all operations are pretty fast, but you now need the space for both representations of the data.
Your question is, what's the right tradeoff? And we can't answer that for you. Maybe memory is very limited, and one operation is only called very rarely, and only in non-time-sensitive contexts, such that it's better to make that operation slower than to waste memory; but maybe memory is not an issue at all, and speed is what matters. In a real program, I think it'd be much more likely that you'll care about speed than about a factor-of-two difference in memory usage, so your proposed two-maps solution would likely be the best one; but we can't know.
So in an interview situation like you describe, the best approach is to describe multiple options, explain the tradeoff, explain why you might choose one or the other, and optionally explain why the two-maps solution is likely to be best in a real program — but that last part is not the most important part IMHO.

Why hasn't iter.remove been implemented in python dicts?

Is there a good reason that iter.remove() is not currently implemented in python dicts?
Let us say I need to remove about half the elements in a set/dictionary. Then I'm forced to either:
Copy the entire set/dictionary (n space, n time)
Iterate over the copy to find elements to remove, remove it from the original dictionary (n/2 plus n/2 distinct lookups)
Or:
Iterate over the dictionary, add elements to remove to a new set (n space, n time)
Iterate over the new set, removing each element from the original dictionary (n/2 plus n/2 lookups)
While asymptotically everything is still "O(n)" time, this is horribly inefficient and about 3 times as slow when compared to the sane way of doing this:
Iterate over the dict, removing what you don't want as you go. This is truly n time, and O(1) space.
At least under the common implementation of hash sets as buckets of linked lists, the iterator should be able to remove the element it just visited without making a new lookup, by simply removing the node in the linked list.
More importantly, the bad solution also requires O(n) space, which really is bad even for those who tend to dismiss these kinds of optimization concerns in python.

In your comparison, you made two big mistakes. First, you neglected to even consider the idiomatic "don't delete anything, copy half the dict" option. Second, you didn't realize that deleting half the entries in a hash table at 2/3 load leaves you with a hash table of the exact same size at 1/3 load.
So, let's compare the actual choices (I'll ignore the 2/3 load to be consistent with your n/2 measures). For each one, there's the peak space, the final space, and the time:
2.0n, 1.0n, 1.5n: Copy, delete half the original
2.0n, 1.0n, 1.5n: Copy, delete half the copy
1.5n, 1.0n, 1.5n: Built a deletions set then delete
1.0n, 1.0n, 0.5n: Delete half in-place
1.5n, 0.5n, 1.0n: Delete half in-place, then compact
1.5n, 0.5n, 0.5n: Copy half
So, your proposed design would be worse than what we already do idiomatically. Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space.
And meanwhile, building a new dictionary, especially if you use a comprehension, means:
Effectively non-mutating (automatic thread/process safety, referential transparency, etc.).
Fewer places to make "small" mistakes that are hard to detect and debug.
Generally more compact and more readable.
Semantically restricted looping, dict building, and exception handling provides opportunities for optimization (which CPython takes; typically a comprehension is about 40% faster than an explicit loop).
For more information on how dictionaries are implemented in CPython, look at the source, which is comprehensively documented, and mostly pretty readable even if you're not a C expert.
If you think about how things work, some of the choices you assumed should obviously go the other way—e.g., consider that Python only stores references in containers, not actual values, and avoids malloc overhead wherever possible, so what are the odds that it would use chaining instead of open addressing?
You may also want to look at the PyPy implementation, which is in Python and has more clever tricks.
Before I respond to all of your comments, you should keep in mind that StackOverflow is not where Python changes get considered or made. If you really think something should be changed, you should post it on python-ideas, python-dev, and/or the bugs site. But before you do: You're pretty clearly still using 2.x; if you're not willing to learn 3.x to get any of the improvements or optimizations made over the past half-decade, nobody over there is going to take you seriously when you suggest additional changes. Also, familiarize yourself with the constructs you want to change; as soon as you start arguing on the basis of Python dicts probably using chaining, the only replies you're going to get will be corrections. Anyway:
Please explain to me how 'Delete half in place' takes 1.0n space and adds 1.0n space to the final space.
I can't explain something I didn't say and that isn't true. There's no "adds" anywhere. My numbers are total peak space and total final space. You're algorithm is clearly 1.0n for each. Which sounds great, until you compare it to the last two options, which have 0.5n total final space.
As your arguments in favor of not providing to the programmer the option of delete in place,
The argument not to make a change is never "that change is impossible", and rarely "that change is inherently bad", but usually "the costs of that change outweigh the benefits". The costs are obvious: there's the work involved; the added complexity of the language and each implementation; more differences between Python versions; potential TOOWTDI violations or attractive nuisances; etc. None of those things mean no change can go in; almost every change ever made to Python had almost all of those costs. But if the benefits of a change aren't worth the cost, it's not worth changing. And if the benefits are less than they initially appear because your hoped-for optimization (a) is actually a pessimization, and (b) would require giving up other benefits to use even if it weren't, that puts you a lot farther from the bar.
Also, I'm not sure, but it sounds like you believe that the idea of there being an obvious, one way to do things, and having a language designed to encourage that obvious way when possible, constitutes Python being a "nanny". If so, then you're seriously using the wrong language. There are people who hate Python for trying to get them to do things the Pythonic way, but those people are smart enough not to use Python, much less try to change it.
Your fourth point, which echoes the one presented in the mailing list about the issue, could easily be fixed … by simply providing a 'for (a,b) in mydict.iteritems() as iter', in the same way as it is currently done for file handles in a 'with open(...) as filehandle' context.
How would that "fix" anything? It sounds like the exact same semantics you could get by writing it = iter(mydict.items()) then for (a, b) in it:. But whatever the semantics are, how would they provide the same, or equivalent, easy opportunities for compiler optimization that comprehensions provide? In a comprehension, there is only one place in the scope that you can return from. It always returns the top value already on the stack. There is guaranteed to be no exception handling in the current scope except a stereotyped StopIteration handler. There is a very specific sequence of events in building the list/set/dict that makes it safe to use generally-unsafe and inflexible opcodes that short-circuit the usual behavior. How are you expecting to get any of those optimizations, much less all of them?
"Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space." Please explain how you think this works.
This works because 1.0 is double 0.5. More concretely, a hash table that's expanded to n elements and is now at about 1/3 load is twice as big as a hash table that's expanded to n/2 elements and is now at about 2/3 load. How is this not clear?
Delete in place takes O(1) space
OK, if you want to count extra final space instead of total final space, then yes, we can say that delete in place takes 0.0n space, and copying half takes -0.5n. Shifting the zero point doesn't change the comparison.
and none of the options can take less than 1.0n time
Sorry, this was probably unclear, because here I was talking about added cost, and probably shouldn't have been, and didn't mention it. But again, changing the scale or the zero point doesn't make any difference. It clearly takes just as much time to delete 0.5n keys from one dict as it does to add 0.5n keys to another one, and all of the other steps are identical, so there is no time difference. Whether you call them both 0.5n or both 1.0n, they're still equal.
The reason I didn't consider only copying half the dictionary, is that the requirement is to actually modify the dictionary, as is clearly stated.
No, it isn't clearly stated. All you said is "I need to remove about half the elements in a set/dictionary". In 99% of use cases, d = {k: v for k, v in d.items() if pred(k)} is the way to write that. And many of the cases people come up with where that isn't true ("but I need the background thread to see the changes immediately") are actively bad ideas. Of course there are some counterexamples, but you can't expect people to just assume you had one when you didn't even give a hint that you might.
But also, the final space of that is 1.5n, not .5n
No it isn't. The original hash table is garbage, so it gets cleaned up, so the final space is just the new, half-sized hash table. (If that isn't true, then you actually still need the original dict alongside the new one, in which case you had no choice but to copy in the first place.)
And if you're going to say, "Yeah, but until it gets cleaned up"—yes, that's why the peak space is 1.5n instead of 1.0n, because there is some non-zero time that both hash tables are alive.

There is another approach:
for key in list(mydict.keys()):
val = mydict[key]
if <decide drop>(val):
mydict.pop(key)
Which could be explained as:
Copy the keys of the original dictionary
Iterate the dictionary through individual lookups
Delete elements when required
I suspect that the overhead of invidual lookups will be too high, comparing to the straightforward iteration. But, I am curious (and have not tested it yet).

Converting lists to dictionaries to check existence?

If I instantiate/update a few lists very, very few times, in most cases only once, but I check for the existence of an object in that list a bunch of times, is it worth it to convert the lists into dictionaries and then check by key existence?
Or in other words is it worth it for me to convert lists into dictionaries to achieve possible faster object existence checks?

Dictionary lookups are faster the list searches. Also a set would be an option. That said:
If "a bunch of times" means "it would be a 50% performance increase" then go for it. If it doesn't but makes the code better to read then go for it. If you would have fun doing it and it does no harm then go for it. Otherwise it's most likely not worth it.

You should be using a set, since from your description I am guessing you wouldn't have a value to associate. See Python: List vs Dict for look up table for more info.

Usually it's not important to tune every line of code for utmost performance.
As a rule of thumb, if you need to look up more than a few times, creating a set is usually worthwhile.
However consider that pypy might do the linear search 100 times faster than CPython, then a "few" times might be "dozens". In other words, sometimes the constant part of the complexity matters.
It's probably safest to go ahead and use a set there. You're less likely to find that a bottleneck as the system scales than the other way around.
If you really need to microtune everything, keep in mind that the implementation, cpu cache, etc... can affect it, so you may need to remicrotune differently for different platforms, and if you need performance that badly, Python was probably a bad choice - although maybe you can pull the hotspots out into C. :)

random access (look up) in dictionary are faster but creating hash table consumes more memory.
more performance = more memory usages
it depends on how many items in your list.

FSharp runs my algorithm slower than Python

Years ago, I solved a problem via dynamic programming:
https://www.thanassis.space/fillupDVD.html
The solution was coded in Python.
As part of expanding my horizons, I recently started learning OCaml/F#. What better way to test the waters, than by doing a direct port of the imperative code I wrote in Python to F# - and start from there, moving in steps towards a functional programming solution.
The results of this first, direct port... are disconcerting:
Under Python:
bash$ time python fitToSize.py
....
real 0m1.482s
user 0m1.413s
sys 0m0.067s
Under FSharp:
bash$ time mono ./fitToSize.exe
....
real 0m2.235s
user 0m2.427s
sys 0m0.063s
(in case you noticed the "mono" above: I tested under Windows as well, with Visual Studio - same speed).
I am... puzzled, to say the least. Python runs code faster than F# ? A compiled binary, using the .NET runtime, runs SLOWER than Python's interpreted code?!?!
I know about startup costs of VMs (mono in this case) and how JITs improve things for languages like Python, but still... I expected a speedup, not a slowdown!
Have I done something wrong, perhaps?
I have uploaded the code here:
https://www.thanassis.space/fsharp.slower.than.python.tar.gz
Note that the F# code is more or less a direct, line-by-line translation of the Python code.
P.S. There are of course other gains, e.g. the static type safety offered by F# - but if the resulting speed of an imperative algorithm is worse under F# ... I am disappointed, to say the least.
EDIT: Direct access, as requested in the comments:
the Python code: https://gist.github.com/950697
the FSharp code: https://gist.github.com/950699

Dr Jon Harrop, whom I contacted over e-mail, explained what is going on:
The problem is simply that the program has been optimized for Python. This is common when the programmer is more familiar with one language than the other, of course. You just have to learn a different set of rules that dictate how F# programs should be optimized...
Several things jumped out at me such as the use of a "for i in 1..n do" loop rather than a "for i=1 to n do" loop (which is faster in general but not significant here), repeatedly doing List.mapi on a list to mimic an array index (which allocated intermediate lists unnecessarily) and your use of the F# TryGetValue for Dictionary which allocates unnecessarily (the .NET TryGetValue that accepts a ref is faster in general but not so much here)
... but the real killer problem turned out to be your use of a hash table to implement a dense 2D matrix. Using a hash table is ideal in Python because its hash table implementation has been extremely well optimized (as evidenced by the fact that your Python code is running as fast as F# compiled to native code!) but arrays are a much better way to represent dense matrices, particularly when you want a default value of zero.
The funny part is that when I first coded this algorithm, I DID use a table -- I changed the implementation to a dictionary for reasons of clarity (avoiding the array boundary checks made the code simpler - and much easier to reason about).
Jon transformed my code (back :-)) into its array version, and it runs at 100x speed.
Moral of the story:
F# Dictionary needs work... when using tuples as keys, compiled F# is slower than interpreted Python's hash tables!
Obvious, but no harm in repeating: Cleaner code sometimes means... much slower code.
Thank you, Jon -- much appreciated.
EDIT: the fact that replacing Dictionary with Array makes F# finally run at the speeds a compiled language is expected to run, doesn't negate the need for a fix in Dictionary's speed (I hope F# people from MS are reading this). Other algorithms depend on dictionaries/hashes, and can't be easily switched to using arrays; making programs suffer "interpreter-speeds" whenever one uses a Dictionary, is arguably, a bug. If, as some have said in the comments, the problem is not with F# but with .NET Dictionary, then I'd argue that this... is a bug in .NET!
EDIT2: The clearest solution, that doesn't require the algorithm to switch to arrays (some algorithms simply won't be amenable to that) is to change this:
let optimalResults = new Dictionary<_,_>()
into this:
let optimalResults = new Dictionary<_,_>(HashIdentity.Structural)
This change makes the F# code run 2.7x times faster, thus finally beating Python (1.6x faster). The weird thing is that tuples by default use structural comparison, so in principle, the comparisons done by the Dictionary on the keys are the same (with or without Structural). Dr Harrop theorizes that the speed difference may be attributed to virtual dispatch: "AFAIK, .NET does little to optimize virtual dispatch away and the cost of virtual dispatch is extremely high on modern hardware because it is a "computed goto" that jumps the program counter to an unpredictable location and, consequently, undermines branch prediction logic and will almost certainly cause the entire CPU pipeline to be flushed and reloaded".
In plain words, and as suggested by Don Syme (look at the bottom 3 answers), "be explicit about the use of structural hashing when using reference-typed keys in conjunction with the .NET collections". (Dr. Harrop in the comments below also says that we should always use Structural comparisons when using .NET collections).
Dear F# team in MS, if there is a way to automatically fix this, please do.

As Jon Harrop has pointed out, simply constructing the dictionaries using Dictionary(HashIdentity.Structural) gives a major performance improvement (a factor of 3 on my computer). This is almost certainly the minimally invasive change you need to make to get better performance than Python, and keeps your code idiomatic (as opposed to replacing tuples with structs, etc.) and parallel to the Python implementation.

Edit: I was wrong, it's not a question of value type vs reference type. The performance problem was related to the hash function, as explained in other comments. I keep my answer here because there's an interessant discussion. My code partially fixed the performance issue, but this is not the clean and recommended solution.
--
On my computer, I made your sample run twice as fast by replacing the tuple with a struct. This means, the equivalent F# code should run faster than your Python code. I don't agree with the comments saying that .NET hashtables are slow, I believe there's no significant difference with Python or other languages implementations. Also, I don't agree with the "You can't 1-to-1 translate code expect it to be faster": F# code will generally be faster than Python for most tasks (static typing is very helpful to the compiler). In your sample, most of the time is spent doing hashtable lookups, so it's fair to imagine that both languages should be almost as fast.
I think the performance issue is related to gabage collection (but I haven't checked with a profiler). The reason why using tuples can be slower here than structures has been discussed in a SO question ( Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)) and a MSDN page (Building tuples):
If they are reference types, this
means there can be lots of garbage
generated if you are changing elements
in a tuple in a tight loop. [...]
F# tuples were reference types, but
there was a feeling from the team that
they could realize a performance
improvement if two, and perhaps three,
element tuples were value types
instead. Some teams that had created
internal tuples had used value instead
of reference types, because their
scenarios were very sensitive to
creating lots of managed objects.
Of course, as Jon said in another comment, the obvious optimization in your example is to replace hashtables with arrays. Arrays are obviously much faster (integer index, no hashing, no collision handling, no reallocation, more compact), but this is very specific to your problem, and it doesn't explain the performance difference with Python (as far as I know, Python code is using hashtables, not arrays).
To reproduce my 50% speedup, here is the full code: http://pastebin.com/nbYrEi5d
In short, I replaced the tuple with this type:
type Tup = {x: int; y: int}
Also, it seems like a detail, but you should move the List.mapi (fun i x -> (i,x)) fileSizes out of the enclosing loop. I believe Python enumerate does not actually allocate a list (so it's fair to allocate the list only once in F#, or use Seq module, or use a mutable counter).

Hmm.. if the hashtable is the major bottleneck, then it is properly the hash function itself. Havn't look at the specific hash function but For one of the most common hash functions namely
((a * x + b) % p) % q
The modulus operation % is painfully slow, if p and q is of the form 2^k - 1, we can do modulus with an and, add and a shift operation.
Dietzfelbingers universal hash function h_a : [2^w] -> [2^l]
lowerbound(((a * x) % 2^w)/2^(w-l))
Where is a random odd seed of w-bit.
It can be computed by (a*x) >> (w-l), which is magnitudes of speed faster than the first hash function. I had to implement a hash table with linked list as collision handling. It took 10 minutes to implement and test, we had to test it with both functions, and analyse the differens of speed. The second hash function had as I remember around 4-10 times of speed gain dependend on the size of the table.
But the thing to learn here is if your programs bottleneck is hashtable lookup the hash function has to be fast too

Performance difference in alternative switches in Python

I have read a few articles around alternatives to the switch statement in Python. Mainly using dicts instead of lots of if's and elif's. However none really answer the question: is there one with better performance or efficiency? I have read a few arguments that if's and elifs would have to check each statement and becomes inefficient with many ifs and elif's. However using dicts gets around that, but you end up having to create new modules to call which cancels the performance gain anyways. The only difference in the end being readability.
Can anyone comment on this, is there really any difference in the long run? Does anyone regularly use the alternative? Only reason I ask is because I am going to end up having 30-40 elif/if's and possibly more in the future. Any input is appreciated. Thanks.

dict's perfomance is typically going to be unbeatable, because a lookup into a dict is going to be O(1) except in rare and practically never-observed cases (where they key involves user-coded types with lousy hashing;-). You don't have to "create new modules" as you say, just arbitrary callables, and that creation, which is performed just once to prep the dict, is not particularly costly anyway -- during operation, it's just one lookup and one call, greased lightning time.
As others have suggested, try timeit to experiment with a few micro-benchmarks of the alternatives. My prediction: with a few dozen possibilities in play, as you mention you have, you'll be slapping your forehead about ever considering anything but a dict of callables!-)
If you find it too hard to run your own benchmarks and can supply some specs, I guess we can benchmark the alternatives for you, but it would be really more instructive if you tried to do it yourself before you ask SO for help!-)

Your concern should be about the readability and maintainability of the code, rather than its efficiency. This applies in most scenarios, and particularly in the one you describe now. The efficiency difference is likely to be negligible (you can easily check it with a small amount of benchmarking code), but 30-40 elif's are a warning sign - perhaps something can be abstracted away and make the code more readable. Describe your case, and perhaps someone can come up with a better design.

With all performance/profiling questions, the right answer is "test each case yourself for your specific needs."
One great tool for this is timeit which you can learn about in the python docs.
In general I have seen no performance issues related to using a dictionary in place of other languages switch statement. My guess would be that the comparison in performance would depend on the number of alternatives. Who knows, there may be a tipping point where one becomes better than the other.
If you (or anyone else) tests it, feel free to post your results.

Times when you'd use a switch in many languages you would use a dict in Python. A switch statement, if added to Python (it's been considered), would not be able to give any real performance gain anyways.
dicts are used ubiquitously in Python. CPython dicts are an insanely-efficient, robust hashtable implementation. Lookup is O(1), as opposed to traversing an elif chain, which is O(n). (30-40 probably doesn't qualify as big enough for this to matter tons anyways). I am not sure what you mean about creating new modules to call, but using dicts is very scalable and easy.
As for actual performance gain, that is impossible to really tackle effectively abstractly. Write your code in the most straightforward and maintainable way (you're using Python forgoshsakes!) and then see if it's too slow. If it is, profile it and find out what places it needs to be sped up to make a real difference.

I think a dict will gain advantage over the alternative sequence of if statements as the number of cases goes up, since the key lookup only requires one hash operation. Otherwise if you only have a few cases, a few if statements are better. A dict is probably a more elegant solution for what you are doing. Either way, the performance difference wont really be noticeable in your case.

I ran a few benchmarks (see here). Using lists of function pointers is fastest,if your keys are sequential integers. For the general case: Alex Martelli is right, dictionary are fastest.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.