I have a dict containing about 50,000 integer values, and a set that contains the keys of 100 of them. My inner loop increments or decrements the values of the dict items in an unpredictable way.
Periodically I need to replace one member of the set with the key of the largest element not already in the set. As an aside, if the dict items were sorted, the order of the sort would change slightly, not dramatically, between invocations of this routine.
Re-sorting the entire dict every time seems wasteful, although maybe less so given that it's already "almost" sorted. While I may be guilty of premature optimization, performance will matter as this will run a very large number of iterations, so I thought it worth asking my betters whether there's an obviously more efficient and pythonic approach.
I'm aware of the notion of dict "views" - Windows onto the contents which are updated as the contents change. Is there such a thing as a "sorted view"?
Instead of using a dict you could use a Counter object which has a neat most_common(n) method which
Return a list of the n most common elements and their counts from the most common to the least.
Related
I understand hash tables: In principle, you store a key's data in a fixed-size array, with the index/slot to use given by the key's hash. However, Python's dict class has the method dict.keys() which returns a list of the dict's keys. Where does this list come from? (Also, iterating over a dictionary implicitly iterates over its keys).
I tried to think about this myself and I identified the following requirements:
insertion in O(1)
deletion in O(1)
iteration in O(n)
I thought maybe we could store the index of the next and previous non-empty slots for each slot, so we could jump to the next/prev element in O(1) and also clear a slot in O(1) (just update prev's next index and next's prev index). The problem is that insertion would be in O(log n) there because we would have to binary-search over the 'next' indexes.
Another explanation I considered is that maybe we just iterate over all the slots, and just ignore the empty slots and accept the runtime penalty from repeatedly checking empty slots every time we iterate over the keys. A disadvantage of this is that the hash table would need to be pretty full for this to be efficient, which would slow down insertions.
The related question "How are Python's Built In Dictionaries Implemented" never mentioned this aspect of dicts.
Edit: Source code for iteration seems to be here.
I looked at the CPython source code here and while I'm still not 100% sure I think my second guess is correct: We just iterate over all the slots. We ignore empty slots and return all keys in non-empty slots.
Python dictionaries get resized once they are 2/3 full and (assuming no deletions) double in size after which they will be 2/6 = 1/3 full* which means we can get iteration in O(3n) (this is O(n) after dropping the constants).
*: The comment seems to be outdated, I think it would be even fuller with the current constants
I have a list which is made up out of three layers, looking something like this for illustrative purposes:
a = [[['1'],['2'],['3'],['']],[['5'],['21','33']]]
Thus I have a top list which contains several other lists each of which again contains lists.
The first layer will contain in the tens of lists. The next layer could contain possibly millions of lists and the bottom layer will contain either an empty string, a single string, or a handful of values (each a string).
I now need to access the values in the bottom-most layer and store them in a new list in a particular order which is done inside a loop. What is the fastest way of accessing these values? The amount of memory used is not of primary concern to me (though I obviously don't want to squander it either).
I can think of two ways:
I access list a directly to retrieve the desired value, e.g. a[1][1][0] would return '21'.
I create a copy of the elements of a and then access these to flatten the list a bit more. In this case thus, e.g.: b=a[0], c=a[1] so instead of accessing a[1][1][0] I would now access b[1][0] to retrieve '21'.
Is there any performance penalty involved in accessing nested lists? Thus, is there any benefit to be gained in splitting list a it into separate lists or am I merely incurring a RAM penalty in doing so?
Accessing elements via their index (ie: a[1][1][0]) is a O(1) operation: source. You won't get much quicker than that.
Now, assignment is also a O(1) operation, so there's no difference between the two methods you've described as far as speed goes. The second one actually doesn't incur in any memory problems because assignments to lists are by reference, not by copy (except you explicitly tell it to do it otherwise).
The two methods are more or less identical, given that b=a[0] only binds another name to the list at that index. It does not copy the list. That said, the only difference is that, in your second method, the only difference is that you, in addition to access the nested lists, you end up throwing references around. So, in theory, it is a tiny little bit slower.
As pointed out by #joaquinlpereyra, the Python Wiki has a list of the complexity of such operations: https://wiki.python.org/moin/TimeComplexity
So, long answer cut short: Just accessing the list items is faster.
I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.
When working with dictionaries in Python, this page says that the time complexity of iterating through the element of the dictionary is O(n), where n is the largest size the dictionary has been.
However, I don't think that there is an obvious way to iterate through the elements of a hash table. Can I assume good performance of dict.iteritems() when iterating through element of a hash table, without too much overhead?
Since dictionaries are used a lot in Python, I assume that this is implemented in some smart way. Still, I need to make sure.
If you look at the notes on Python's dictionary source code, I think the relevant points are the following:
Those methods (iteration and key listing) loop over every potential entry
How many potential entries will there be, as a function of largest size (largest number of keys ever stored in that dictionary)? Look at the following two sections in the same document:
Maximum dictionary load in PyDict_SetItem. Currently set to 2/3
Growth rate upon hitting maximum load. Currently set to *2.
This would suggest that the sparsity of a dictionary is going to be somewhere around 1/3~2/3 (unless growth rate is set to *4, then it's 1/6~2/3). So basically you're going to be checking upto 3 (or 6 if *4) potential entries for every key.
Of course, whether it's 3 entries or 1000, it's still O(n) but 3 seems like a pretty acceptable constant factor.
Incidentally here are the rest of the source & documentation, including that of the DictObject:
http://svn.python.org/projects/python/trunk/Objects/
So I have a list of 85 items. I would like to continually reduce this list in half (essentially a binary search on the items) -- my question is then, what is the most efficient way to reduce the list? A list comprehension would continually create copies of the list which is not ideal. I would like in-place removal of ranges of my list until I am left with one element.
I'm not sure if this is relevant but I'm using collections.deque instead of a standard list. They probably work the same way more or less though so I doubt this matters.
For a mere 85 items, truthfully, almost any method you want to use would be more than fast enough. Don't optimize prematurely.
That said, depending on what you're actually doing, a list may be faster than a deque. A deque is faster for adding and removing items at either end, but it doesn't support slicing.
With a list, if you want to copy or delete a contiguous range of items (say, the first 42) you can do this with a slice. Assuming half the list is eliminated at each pass, copying items to a new list would be slower on average than deleting items from the existing list (deleting requires moving the half of the list that's not being deleted "leftward" in memory, which would be about the same time cost as copying the other half, but you won't always need to do this; deleting the latter half of a list won't need to move anything).
To do this with a deque efficiently, you would want to pop() or popleft() the items rather than slicing them (lots of attribute access and method calls, which are relatively expensive in Python), and you'd have to write the loop that controls the operation in Python, which will be slower than the native slice operation.
Since you said it's basically a binary search, it is probably actually fastest to simply find the item you want to keep without modifying the original container at all, and then return a new container holding that single item. A list is going to be faster for this than a deque since you will be doing a lot of accessing items by index. To do this in a deque will require Python to follow the linked list from the beginning each time you access an item, while accessing an item by index is a simple, fast calculation for a list.
collections.deque is implemented via a linked list, hence binary search would be much slower than a linear search. Rethink your approach.
Not sure that this is what you really need but:
x = range(100)
while len(x) > 1:
if condition:
x = x[:int(len(x)/2)]
else:
x = x[int(len(x)/2):]
85 items are not even worth thinking about. Computers are fast, really.
Why would you delete ranges from the list, instead of simply picking the one result?
If there is a good reason why you cant do (2): Keep the original list and change two indices only: The start and end index of the sublist you're looking at.
On a previous question I compared a number of techniques for removing a list of items given a predicate. (That is, I have a function which returns True or False for whether to keep a particular item.) As I recall using a list comprehension was the fastest. The fact is, copying is really really cheap.
The only thing that you can do to improve the speed depends on which items you are removing. But you haven't indicated anything about that so I can't suggest anything.