Dictionary versus Nested list/array? - python

In Python, dictionaries are used for key/value pairs. Nested lists, or arrays, however, can do the same thing with two-value lists inside a big list, for example [[1, 2], [3, 4]].
Arrays have more uses and are actually faster, but dictionaries are more straightforward. What are the pros and cons of using a dictionary versus an array?

If you use a list of key/value pairs, getting the value corresponding to a key requires a linear search, which is O(n). If the list is sorted by the keys you can improve that to O(log n), but then adding to the list becomes more complicated and expensive since you have to keep it sorted.
Dictionaries are implemented as hash tables, so getting the value corresponding to a key is amortized constant time.
Furthermore, Python provides convenient syntax for looking up keys in a dictionary. You can write dictname[key]. Since lists aren't intended to be used as lookup tables, there's no corresponding syntax for finding a value by key there. listname[index] gets an element by its numeric position, not looking up the key in a key/value pair.
Of course, if you want to use an association list, there's nothing stopping you from writing your own functions to do so. You could also embed them in a class and define appropriate methods so you can use [] syntax to get and set them.

Related

What is the fastest way to access elements in a nested list?

I have a list which is made up out of three layers, looking something like this for illustrative purposes:
a = [[['1'],['2'],['3'],['']],[['5'],['21','33']]]
Thus I have a top list which contains several other lists each of which again contains lists.
The first layer will contain in the tens of lists. The next layer could contain possibly millions of lists and the bottom layer will contain either an empty string, a single string, or a handful of values (each a string).
I now need to access the values in the bottom-most layer and store them in a new list in a particular order which is done inside a loop. What is the fastest way of accessing these values? The amount of memory used is not of primary concern to me (though I obviously don't want to squander it either).
I can think of two ways:
I access list a directly to retrieve the desired value, e.g. a[1][1][0] would return '21'.
I create a copy of the elements of a and then access these to flatten the list a bit more. In this case thus, e.g.: b=a[0], c=a[1] so instead of accessing a[1][1][0] I would now access b[1][0] to retrieve '21'.
Is there any performance penalty involved in accessing nested lists? Thus, is there any benefit to be gained in splitting list a it into separate lists or am I merely incurring a RAM penalty in doing so?
Accessing elements via their index (ie: a[1][1][0]) is a O(1) operation: source. You won't get much quicker than that.
Now, assignment is also a O(1) operation, so there's no difference between the two methods you've described as far as speed goes. The second one actually doesn't incur in any memory problems because assignments to lists are by reference, not by copy (except you explicitly tell it to do it otherwise).
The two methods are more or less identical, given that b=a[0] only binds another name to the list at that index. It does not copy the list. That said, the only difference is that, in your second method, the only difference is that you, in addition to access the nested lists, you end up throwing references around. So, in theory, it is a tiny little bit slower.
As pointed out by #joaquinlpereyra, the Python Wiki has a list of the complexity of such operations: https://wiki.python.org/moin/TimeComplexity
So, long answer cut short: Just accessing the list items is faster.

Maintaining a sorted view of dict values?

I have a dict containing about 50,000 integer values, and a set that contains the keys of 100 of them. My inner loop increments or decrements the values of the dict items in an unpredictable way.
Periodically I need to replace one member of the set with the key of the largest element not already in the set. As an aside, if the dict items were sorted, the order of the sort would change slightly, not dramatically, between invocations of this routine.
Re-sorting the entire dict every time seems wasteful, although maybe less so given that it's already "almost" sorted. While I may be guilty of premature optimization, performance will matter as this will run a very large number of iterations, so I thought it worth asking my betters whether there's an obviously more efficient and pythonic approach.
I'm aware of the notion of dict "views" - Windows onto the contents which are updated as the contents change. Is there such a thing as a "sorted view"?
Instead of using a dict you could use a Counter object which has a neat most_common(n) method which
Return a list of the n most common elements and their counts from the most common to the least.

More efficient use of dictionaries

I'm going to store on the order of 10,000 securities X 300 date pairs X 2 Types in some caching mechanism.
I'm assuming I'm going to use a dictionary.
Question Part 1:
Which is more efficient or Faster? Assume that I'll be generally looking up knowing a list of security IDs and the 2 dates plus type. If there is a big efficiency gain by tweaking my lookup, I'm happy to do that. Also assume I can be wasteful of memory to an extent.
Method 1: store and look up using keys that look like strings "securityID_date1_date2_type"
Method 2: store and look up using keys that look like tuples (securityID, date1, date2, type)
Method 3: store and look up using nested dictionaries of some variation mentioned in methods 1 and 2
Question Part 2:
Is there an easy and better way to do this?
It's going to depend a lot on your use case. Is lookup the only activity or will you do other things, e.g:
Iterate all keys/values? For simplicity, you wouldn't want to nest dictionaries if iteration is relatively common.
What about iterating a subset of keys with a given securityID, type, etc.? Nested dictionaries (each keyed on one or more components of your key) would be beneficial if you needed to iterate "keys" with one component having a given value.
What about if you need to iterate based on a different subset of the key components? If that's the case, plain dict is probably not the best idea; you may want relational database, either the built-in sqlite3 module or a third party module for a more "production grade" DBMS.
Aside from that, it matters quite a bit how you construct and use keys. Strings cache their hash code (and can be interned for even faster comparisons), so if you reuse a string for lookup having stored it elsewhere, it's going to be fast. But tuples are usually safer (strings constructed from multiple pieces can accidentally produce the same string from different keys if the separation between components in the string isn't well maintained). And you can easily recover the original components from a tuple, where a string would need to be parsed to recover the values. Nested dicts aren't likely to win (and require some finesse with methods like setdefault to populate properly) in a simple contest of lookup speed, so it's only when iterating a subset of the data for a single component of the key that they're likely to be beneficial.
If you want to benchmark, I'd suggest populating a dict with sample data, then use the timeit module (or ipython's %timeit magic) to test something approximating your use case. Just make sure it's a fair test, e.g. don't lookup the same key each time (using itertools.cycle to repeat a few hundred keys would work better) since dict optimizes for that scenario, and make sure the key is constructed each time, not just reused (unless reuse would be common in the real scenario) so string's caching of hash codes doesn't interfere.

Tried to implement lists as dictionary keys within algorithm, what's a quick solution?

I am trying to implement the Apriori algorithm... http://codeding.com/articles/apriori-algorithm in Python.
The highest level data structuring goes something like this:
frequentItemSets[ k-level : itemSetDictionary]
|
|__
itemSetDictionary[ listOfItems : supportValueOfItems]
|
|__
list of integers, sorted lexicographically
I need to keep track of an arbitrary number of sets, the cardinality (k-level) of those sets, and a value that I calculate for each of those sets. I thought that using a list for all of the sets would be a good idea as they maintain order and are iterable. I tried to use lists as the keys within the itemSetDictionary, as you can see above, but now I see that iterable data structures are not allowed to be keys wihtin Python dictionaries.
I am trying to figure out the quickest way to fix this issue. I know that I can just create some classes so that the keys are now objects, and not iterable data structures, but I feel like that would take a lot of time for me to change.
Any ideas?
Dictionary keys must be hashable, which usable requires them to be immutable. Whether they are iterable is immaterial.
In this particular case, you probably want to use frozensets as keys.

Best way to implement 2-D array of series elements in Python

I have a dynamic set consisting of a data series on the order of hundreds of objects, where each series must be identified (by integer) and consists of elements, also identified by an integer. Each element is a custom class.
I used a defaultdict to created a nested (2-D) dictionary. This enables me to quickly access a series and individual elements by key/ID. I needed to be able to add and delete elements and entire series, so the dict served me well. Also note that the IDs do not have to be sequential, due to add/delete. The IDs are important since they are unique and referenced elsewhere through my application.
For example, consider the following data set with keys/IDs,
[1][1,2,3,4,5]
[2][1,4,10]
[4][1]
However, now I realize I want to be able to insert elements in a series, but the dictionary doesn't quite support it. For example, I'd like to be able to insert a new element between 3 and 4 for series 1, causing the IDs above it (from 4,5) to increment (to 5,6):
[1][1,2,3,4,5] becomes
[1][1,2,3,4(new),5,6]
The order matters since the elements are part of a sequential series. I realize that this would be easier with a nested list since it supports insert(), but then I would be forced to iterate over the entire 2-D array to get element indices right?
What would be the most optimal way to implement this data structure in Python?
I think what you want is a dict with array values:
dict = {1:[...],3:[...], ....}
You can then operate on the arrays as you please. If the array values are sequential ints
just use:
dict[key].append(vals)
dict[key].sort()
Don't worry about the speed unless you find out it's a problem. Premature optimization
is the root of all evil.
In fact, don't even sort your dict vals until you have to, if you want to be really efficient.

Categories