Retrieve List Index for all Items in a Set

Retrieve List Index for all Items in a Set - python

I have a really big, like huge, Dictionary (it isn't really but pretend because it is easier and not relevant) that contains the same strings over and over again. I have verified that I can store a lot more in memory if I do poor man's compression on the system and instead store INTs that correspond to the string.
animals = ['ape','butterfly,'cat','dog']
exists in a list and therefore has an index value such that animals.index('cat') returns 2
This allows me to store in my object BobsPets = set(2,3)
rather than Cat and Dog
For the number of items the memory savings are astronomical. (Really Don't try and dissuade me that is well tested.
Currently I then convert the INTs back to Strings with a FOR loop
tempWordList = set()
for IntegOfIndex in TempSet:
tempWordList.add(animals[IntegOfIndex])
return tempWordList
This code works. It feels "Pythonic," but it feels like there should be a better way. I am in Python 2.7 on AppEngine if that matters. It may since I wonder if Numpy has something I missed.
I have about 2.5 Million things in my object, and each has an average of 3 of these "pets" and there 7500-ish INTs that represent the pets. (no they aren't really pets)
I have considered using a Dictionary with the position instead of using Index. This doesn't seem faster, but am interested if anyone thinks it should be. (it took more memory and seemed to be the same speed or really close)
I am considering running a bunch of tests with Numpy and its array's rather than lists, but before I do, I thought I'd ask the audience and see if I would be wasting time on something that I have already reached the best solution for.
Last thing, The solution should be pickable since I do that for loading and transferring data.

It turns out that since my list of strings is fixed, and I just wish the index of the string, I am building what is essentially an index array that is immutable. Which is in short a Tuple.
Moving to a Tuple rather than a list gains about 30% improvement in speed. Far more than I would have anticipated.
The bonus is largest on very large lists. It seems that each time you cross a bit threshold the bonus increases, so in sub 1024 lists their is basically no bonus and at a million there is pretty significant.
The Tuple also uses very slightly less memory for the same data.
An aside, playing with the lists of integers, you can make these significantly smaller by using a NUMPY array, but the advantage doesn't extend to pickling. The Pickles will be about 15% larger. I think this is because of the object description being stored in the pickle, but I didn't spend much time looking.
So in short the only change was to make the Animals list a Tuple. I really was hoping the answer was something more exotic.

Related

Should I use two Hashmap for fast lookup on two entities instead of linear search of one hashmap?

I had an interview problem where I was asked to make an optimized solution to implement search on two instance: Student Number and class(only one per student).
sn_to_class() should return class for student number. Also, class_sns() should return list of student numbers for a given class.
My First solution was to use hashmap sn_to_class_map (number as key and student number as data) and hashmap class_to_sns_map(class as key and student number as data). So, the search will be minimized to O(1), but the data will be increased.
pseudo code:
sn_map = dict()
cl_map = dict()
fun addStudents(sn, cl):
sn_map[sn] = cl
cl_map[cl].add(sn) # List
fun getStudents(cl)
return cl_map[cl]
fun getClass(sn)
return sn_map[sn]
Is my approach correct?

It is not always possible to optimize for everything; there's very often a tradeoff between time and space, or between consistency and availability, or between the time needed for one operation and the time needed for a different operation, . . .
In your case, you have been asked to make an "optimized" solution, and you're faced with such a tradeoff:
If you keep a map from student-numbers to classes, then getClass and addStudents are fast, and you only use the space for that one representation of the data, but getStudents is slower because it needs to read the entire map.
If you keep a map from classes to lists of student-numbers, and don't worry about the order student-numbers in those lists, then getStudents and addStudents are fast, and you only use the space for that one representation of the data, but getClass is slower because it needs to read the entire map.
If you keep a map from classes to sorted lists of student-numbers, then getStudents is fast, getClass is a bit faster than with unsorted lists (it needs to examine every class in the map, but at least it can do binary search within each list), and you only use the space for that one representation of the data, but getClass is still relatively slow if classes are small, and addStudents is significantly slower because inserting a student into a list can take a lot of time.
If you keep two maps, as you propose, then all operations are pretty fast, but you now need the space for both representations of the data.
Your question is, what's the right tradeoff? And we can't answer that for you. Maybe memory is very limited, and one operation is only called very rarely, and only in non-time-sensitive contexts, such that it's better to make that operation slower than to waste memory; but maybe memory is not an issue at all, and speed is what matters. In a real program, I think it'd be much more likely that you'll care about speed than about a factor-of-two difference in memory usage, so your proposed two-maps solution would likely be the best one; but we can't know.
So in an interview situation like you describe, the best approach is to describe multiple options, explain the tradeoff, explain why you might choose one or the other, and optionally explain why the two-maps solution is likely to be best in a real program — but that last part is not the most important part IMHO.

Why hasn't iter.remove been implemented in python dicts?

Is there a good reason that iter.remove() is not currently implemented in python dicts?
Let us say I need to remove about half the elements in a set/dictionary. Then I'm forced to either:
Copy the entire set/dictionary (n space, n time)
Iterate over the copy to find elements to remove, remove it from the original dictionary (n/2 plus n/2 distinct lookups)
Or:
Iterate over the dictionary, add elements to remove to a new set (n space, n time)
Iterate over the new set, removing each element from the original dictionary (n/2 plus n/2 lookups)
While asymptotically everything is still "O(n)" time, this is horribly inefficient and about 3 times as slow when compared to the sane way of doing this:
Iterate over the dict, removing what you don't want as you go. This is truly n time, and O(1) space.
At least under the common implementation of hash sets as buckets of linked lists, the iterator should be able to remove the element it just visited without making a new lookup, by simply removing the node in the linked list.
More importantly, the bad solution also requires O(n) space, which really is bad even for those who tend to dismiss these kinds of optimization concerns in python.

In your comparison, you made two big mistakes. First, you neglected to even consider the idiomatic "don't delete anything, copy half the dict" option. Second, you didn't realize that deleting half the entries in a hash table at 2/3 load leaves you with a hash table of the exact same size at 1/3 load.
So, let's compare the actual choices (I'll ignore the 2/3 load to be consistent with your n/2 measures). For each one, there's the peak space, the final space, and the time:
2.0n, 1.0n, 1.5n: Copy, delete half the original
2.0n, 1.0n, 1.5n: Copy, delete half the copy
1.5n, 1.0n, 1.5n: Built a deletions set then delete
1.0n, 1.0n, 0.5n: Delete half in-place
1.5n, 0.5n, 1.0n: Delete half in-place, then compact
1.5n, 0.5n, 0.5n: Copy half
So, your proposed design would be worse than what we already do idiomatically. Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space.
And meanwhile, building a new dictionary, especially if you use a comprehension, means:
Effectively non-mutating (automatic thread/process safety, referential transparency, etc.).
Fewer places to make "small" mistakes that are hard to detect and debug.
Generally more compact and more readable.
Semantically restricted looping, dict building, and exception handling provides opportunities for optimization (which CPython takes; typically a comprehension is about 40% faster than an explicit loop).
For more information on how dictionaries are implemented in CPython, look at the source, which is comprehensively documented, and mostly pretty readable even if you're not a C expert.
If you think about how things work, some of the choices you assumed should obviously go the other way—e.g., consider that Python only stores references in containers, not actual values, and avoids malloc overhead wherever possible, so what are the odds that it would use chaining instead of open addressing?
You may also want to look at the PyPy implementation, which is in Python and has more clever tricks.
Before I respond to all of your comments, you should keep in mind that StackOverflow is not where Python changes get considered or made. If you really think something should be changed, you should post it on python-ideas, python-dev, and/or the bugs site. But before you do: You're pretty clearly still using 2.x; if you're not willing to learn 3.x to get any of the improvements or optimizations made over the past half-decade, nobody over there is going to take you seriously when you suggest additional changes. Also, familiarize yourself with the constructs you want to change; as soon as you start arguing on the basis of Python dicts probably using chaining, the only replies you're going to get will be corrections. Anyway:
Please explain to me how 'Delete half in place' takes 1.0n space and adds 1.0n space to the final space.
I can't explain something I didn't say and that isn't true. There's no "adds" anywhere. My numbers are total peak space and total final space. You're algorithm is clearly 1.0n for each. Which sounds great, until you compare it to the last two options, which have 0.5n total final space.
As your arguments in favor of not providing to the programmer the option of delete in place,
The argument not to make a change is never "that change is impossible", and rarely "that change is inherently bad", but usually "the costs of that change outweigh the benefits". The costs are obvious: there's the work involved; the added complexity of the language and each implementation; more differences between Python versions; potential TOOWTDI violations or attractive nuisances; etc. None of those things mean no change can go in; almost every change ever made to Python had almost all of those costs. But if the benefits of a change aren't worth the cost, it's not worth changing. And if the benefits are less than they initially appear because your hoped-for optimization (a) is actually a pessimization, and (b) would require giving up other benefits to use even if it weren't, that puts you a lot farther from the bar.
Also, I'm not sure, but it sounds like you believe that the idea of there being an obvious, one way to do things, and having a language designed to encourage that obvious way when possible, constitutes Python being a "nanny". If so, then you're seriously using the wrong language. There are people who hate Python for trying to get them to do things the Pythonic way, but those people are smart enough not to use Python, much less try to change it.
Your fourth point, which echoes the one presented in the mailing list about the issue, could easily be fixed … by simply providing a 'for (a,b) in mydict.iteritems() as iter', in the same way as it is currently done for file handles in a 'with open(...) as filehandle' context.
How would that "fix" anything? It sounds like the exact same semantics you could get by writing it = iter(mydict.items()) then for (a, b) in it:. But whatever the semantics are, how would they provide the same, or equivalent, easy opportunities for compiler optimization that comprehensions provide? In a comprehension, there is only one place in the scope that you can return from. It always returns the top value already on the stack. There is guaranteed to be no exception handling in the current scope except a stereotyped StopIteration handler. There is a very specific sequence of events in building the list/set/dict that makes it safe to use generally-unsafe and inflexible opcodes that short-circuit the usual behavior. How are you expecting to get any of those optimizations, much less all of them?
"Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space." Please explain how you think this works.
This works because 1.0 is double 0.5. More concretely, a hash table that's expanded to n elements and is now at about 1/3 load is twice as big as a hash table that's expanded to n/2 elements and is now at about 2/3 load. How is this not clear?
Delete in place takes O(1) space
OK, if you want to count extra final space instead of total final space, then yes, we can say that delete in place takes 0.0n space, and copying half takes -0.5n. Shifting the zero point doesn't change the comparison.
and none of the options can take less than 1.0n time
Sorry, this was probably unclear, because here I was talking about added cost, and probably shouldn't have been, and didn't mention it. But again, changing the scale or the zero point doesn't make any difference. It clearly takes just as much time to delete 0.5n keys from one dict as it does to add 0.5n keys to another one, and all of the other steps are identical, so there is no time difference. Whether you call them both 0.5n or both 1.0n, they're still equal.
The reason I didn't consider only copying half the dictionary, is that the requirement is to actually modify the dictionary, as is clearly stated.
No, it isn't clearly stated. All you said is "I need to remove about half the elements in a set/dictionary". In 99% of use cases, d = {k: v for k, v in d.items() if pred(k)} is the way to write that. And many of the cases people come up with where that isn't true ("but I need the background thread to see the changes immediately") are actively bad ideas. Of course there are some counterexamples, but you can't expect people to just assume you had one when you didn't even give a hint that you might.
But also, the final space of that is 1.5n, not .5n
No it isn't. The original hash table is garbage, so it gets cleaned up, so the final space is just the new, half-sized hash table. (If that isn't true, then you actually still need the original dict alongside the new one, in which case you had no choice but to copy in the first place.)
And if you're going to say, "Yeah, but until it gets cleaned up"—yes, that's why the peak space is 1.5n instead of 1.0n, because there is some non-zero time that both hash tables are alive.

There is another approach:
for key in list(mydict.keys()):
val = mydict[key]
if <decide drop>(val):
mydict.pop(key)
Which could be explained as:
Copy the keys of the original dictionary
Iterate the dictionary through individual lookups
Delete elements when required
I suspect that the overhead of invidual lookups will be too high, comparing to the straightforward iteration. But, I am curious (and have not tested it yet).

absolute fastest lookup in python / cython

I'd like to do a lookup mapping 32bit integer => 32bit integer.
The input keys aren't necessary contiguous nor cover 2^32 -1 (nor do I want this in-memory to consume that much space!).
The use case is for a poker evaluator, so doing a lookup must be as fast as possible. Perfect hashing would be nice, but that might be a bit out of scope.
I feel like the answer is some kind of cython solution, but I'm not sure about the underpinnings of cython and if it really does any good with Python's dict() type. Of course a flat array with just a simple offset jump would be super fast, but then I'm allocating 2^32 - 1 places in memory for the table, which I don't want.
Any tips / strategies? Absolute speed with minimal memory footprint is the goal.

You aren't smart enough to write something faster than dict. Don't feel bad; 99.99999% of the people on the planet aren't. Use a dict.

First, you should actually define what "fast enough" means to you, before you do anything else. You can always make something faster, so you need to set a target so you don't go insane. It is perfectly reasonable for this target to be dual-headed - say something like "Mapping lookups must execute in these parameters (min/max/mean), and when/if we hit those numbers we're willing to spend X more development hours to optimize even further, but then we'll stop."
Second, the very first thing you should do to make this faster is to copy the code in Objects/dictobject.c in the Cpython source tree (make something new like intdict.c or something) and then modify it so that the keys are not python objects. Chasing after a better hash function will not likely be a good use of your time for integers, but eliminating INCREF/DECREF and PyObject_RichCompareBool calls for your keys will be a huge win. Since you're not deleting keys you could also elide any checks for dummy values (which exist to preserve the collision traversal for deleted entries), although it's possible that you'll get most of that win for free simply by having better branch prediction for your new object.

You are describing a perfect use case for a hash indexed collection. You are also describing a perfect scenario for the strategy of write it first, optimise it second.
So start with the Python dict. It's fast and it absolutely will do the job you need.
Then benchmark it. Figure out how fast it needs to go, and how near you are. Then 3 choices.
It's fast enough. You're done.
It's nearly fast enough, say within about a factor of two. Write your own hash indexing, paying attention to the hash function and the collision strategy.
It's much too slow. You're dead. There is nothing simple that will give you a 10x or 100x improvement. At least you didn't waste any time on a better hash index.

Why are Lists faster than character arrays for string concatenation

In the article linked below, the author compares the efficiency of different string concatenation methodologies in python:
http://www.skymind.com/~ocrow/python_string/
One thing that I did not understand is, why does method 3 (Mutable Character Arrays) result in a significantly slower performance than method 4 (joining a list of strings)
both of them are mutable and I would think that they should have comparable performance.

"Both of them are mutable" is misleading you a bit.
It's true that in the list-append method, the list is mutable. But building up the list isn't the slow part. If you have 1000 strings of average length 1000, you're doing 1000000 mutations to the array, but only 1000 mutations to the list (plus 1000 increfs to string objects).
In particular, that means the array will have to spend 1000x as much time expanding (allocating new storage and copying the whole thing so far).
The slow part for the list method is the str.join call at the end. But that isn't mutable, and doesn't require any expanding. It uses two passes, to first calculate the size needed, then copy everything into it.
Also, that code inside str.join has had (and has continued to have since that article was written 9 years ago) a lot of work to optimize it, because it's a very common, and recommended, idiom that many real programs depend on every day; array has barely been touched since it was first added to the language.
But if you really want to understand the differences, you have to look at the source. In 2.7, the main work for the array method is in array_fromstring, while the main work for the list method is in string_join. You can see how the latter takes advantage of the fact that we already know all of the strings we're going to be joining up at the start, while the former can't.

Converting lists to dictionaries to check existence?

If I instantiate/update a few lists very, very few times, in most cases only once, but I check for the existence of an object in that list a bunch of times, is it worth it to convert the lists into dictionaries and then check by key existence?
Or in other words is it worth it for me to convert lists into dictionaries to achieve possible faster object existence checks?

Dictionary lookups are faster the list searches. Also a set would be an option. That said:
If "a bunch of times" means "it would be a 50% performance increase" then go for it. If it doesn't but makes the code better to read then go for it. If you would have fun doing it and it does no harm then go for it. Otherwise it's most likely not worth it.

You should be using a set, since from your description I am guessing you wouldn't have a value to associate. See Python: List vs Dict for look up table for more info.

Usually it's not important to tune every line of code for utmost performance.
As a rule of thumb, if you need to look up more than a few times, creating a set is usually worthwhile.
However consider that pypy might do the linear search 100 times faster than CPython, then a "few" times might be "dozens". In other words, sometimes the constant part of the complexity matters.
It's probably safest to go ahead and use a set there. You're less likely to find that a bottleneck as the system scales than the other way around.
If you really need to microtune everything, keep in mind that the implementation, cpu cache, etc... can affect it, so you may need to remicrotune differently for different platforms, and if you need performance that badly, Python was probably a bad choice - although maybe you can pull the hotspots out into C. :)

random access (look up) in dictionary are faster but creating hash table consumes more memory.
more performance = more memory usages
it depends on how many items in your list.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.