I understand that a list is different from an array. But still, O(1)? That would mean accessing an element in a list would be as fast as accessing an element in a dict, which we all know is not true.
My question is based on this document:
list
----------------------------
| Operation | Average Case |
|-----------|--------------|
| ... | ... |
|-----------|--------------|
| Get Item | O(1) |
----------------------------
and this answer:
Lookups in lists are O(n), lookups in dictionaries are amortized O(1),
with regard to the number of items in the data structure.
If the first document is true, then why is accessing a dict faster than accessing a list if they have the same complexity?
Can anybody give a clear explanation on this please? I would say it always depends on the size of the list/dict, but I need more insight on this.
Get item is getting an item in a specific index, while lookup means searching if some element exists in the list. To do so, unless the list is sorted, you will need to iterate all elements, and have O(n) Get Item operations, which leads to O(n) lookup.
A dictionary is maintaining a smart data structure (hash table) under the hood, so you will not need to query O(n) times to find if the element exists, but a constant number of times (average case), leading to O(1) lookup.
accessing a list l at index n l[n] is O(1) because it is not implemented as a Vanilla linked list where one needs to jump between pointers (value, next-->) n times to reach cell index n.
If the memory is continuous and the entry size would had been fixed, reaching a specific entry would be trivial as we know to jump n times an entry size (like classic arrays in C).
However, since list is variable in entries size, the python implementation uses a continuous memory list just for the pointers to the values. This makes indexing a list (l[n]) an operation whose cost is independent of the size of the list or the value of the index.
For more information see http://effbot.org/pyfaq/how-are-lists-implemented.htm
That is because Python stores the address of each node of a list into a separate array. When we want an element at nth node, all we have to do is look up the nth element of the address array which gives us the address of nth node of the list by which we can get the value of that node in O(1) complexity.
Python does some neat tricks to make these arrays expandable as the list grows. Thus we are getting the flexibility of lists and and the speed of arrays. The trade off here is that the compiler has to reallocate memory for the address array whenever the list grows to a certain extent.
amit has explained in his answer to this question about why lookups in dictionaries are faster than in lists.
The minimum possible I can see is O(log n) from a strict computer science standpoint
Related
Today in class, we learned that retrieving an element from a list is O(1) in Python. Why is this the case? Suppose I have a list of four items, for example:
li = ["perry", 1, 23.5, "s"]
These items have different sizes in memory. And so it is not possible to take the memory location of li[0] and add three times the size of each element to get the memory location of li[3]. So how does the interpreter know where li[3] is without having to traverse the list in order to retrieve the element?
A list in Python is implemented as an array of pointers1. So, what's really happening when you create the list:
["perry", 1, 23.5, "s"]
is that you are actually creating an array of pointers like so:
[0xa3d25342, 0x635423fa, 0xff243546, 0x2545fade]
Each pointer "points" to the respective objects in memory, so that the string "perry" will be stored at address 0xa3d25342 and the number 1 will be stored at 0x635423fa, etc.
Since all pointers are the same size, the interpreter can in fact add 3 times the size of an element to the address of li[0] to get to the pointer stored at li[3].
1 Get more details from: the horse's mouth (CPython source code on GitHub).
When you say a = [...], a is effectively a pointer to a PyObject containing an array of pointers to PyObjects.
When you ask for a[2], the interpreter first follows the pointer to the list's PyObject, then adds 2 to the address of the array inside it, then returns that pointer. The same happens if you ask for a[0] or a[9999].
Basically, all Python objects are accessed by reference instead of by value, even integer literals like 2. There are just some tricks in the pointer system to keep this all efficient. And pointers have a known size, so they can be stored conveniently in C-style arrays.
Short answer: Python lists are arrays.
Long answer: The computer science term list usually means either a singly-linked list (as used in functional programming) or a doubly-linked list (as used in procedural programming). These data structures support O(1) insertion at either the head of the list (functionally) or at any position that does not need to be searched for (procedurally). A Python ``list'' has none of these characteristics. Instead it supports (amortized) O(1) appending at the end of the list (like a C++ std::vector or Java ArrayList). Python lists are really resizable arrays in CS terms.
The following comment from the Python documentation explains some of the performance characteristics of Python ``lists'':
It is also possible to use a list as a queue, where the first element added is the first element retrieved (“first-in, first-out”); however, lists are not efficient for this purpose. While appends and pops from the end of list are fast, doing inserts or pops from the beginning of a list is slow (because all of the other elements have to be shifted by one).
I have a number of objects (roughly 530,000). These objects are randomly assigned to a set of lists (not actually random but let's assume it is). These lists are indexed consecutively and assigned to a dictionary, called groups, according to their index. I know the total number of objects but I do not know the length of each list ahead of time (which in this particular case happens to vary between 1 and 36000).
Next I have to process each object contained in the lists. In order to speed up this operation I am using MPI to send them to different processes. The naive way to do this is to simply assign each process len(groups)/size (where size contains the number of processes used) lists, assign any possible remainder, have it process the contained objects, return the results and wait. This obviously means, however, that if one process gets, say, a lot of very short lists and another all the very long lists the first process will sit idle most of the time and the performance gain will not be very large.
What would be the most efficient way to assign the lists? One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible. But I am not sure how to best implement this. Does anybody have any suggestions?
One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible.
Assuming that processing time scales exactly with the sum of list lengths, and your processor capacity is homogeneous, this is in fact what you want. This is called the multiprocessor scheduling problem, which is very close to the bin packing problem, but with a constant number of bins minimizing the maximum capacity.
Generally this is a NP-hard problem, so you will not get a perfect solution. The simplest reasonable approach is to greedily pick the largest chunk of work for the processor that has the smallest work assigned to it yet.
It is trivial to implement this in python (examples uses a list of lists):
greedy = [[] for _ in range(nprocs)]
for group in sorted(groups, key=len, reverse=True):
smallest_index = np.argmin([sum(map(len, assignment)) for assignment in greedy])
greedy[smallest_index].append(group)
If you have a large number of processors you may want to optimize the smallest_index computation by using a priority queue. This will produce significantly better results than the naive sorted split as recommended by Attersson:
(https://gist.github.com/Zulan/cef67fa436acd8edc5e5636482a239f8)
On the assumption that a longer list has a larger memory size,your_list has a memory size retrievable by the following code:
import sys
sys.getsizeof(your_list)
(Note: it depends on Python implementation. Please read How many bytes per element are there in a Python list (tuple)?)
There are several ways you can proceed then. If your original "pipeline" of lists can be sorted by key=sys.getSizeof you can then slice and assign to process N every Nth element (Pythonic way to return list of every nth item in a larger list).
Example:
sorted_pipeline = [list1,list2,list3,.......]
sorted_pipeline[0::10] # every 10th item, assign to the first sub-process of 10
This will balance loads in a fair manner, while keeping complexity O(NlogN) due to the original sort and then constant (or linear if the lists are copied) to assign the lists.
Illustration (as requested) of splitting 10 elements into 3 groups:
>>> my_list = [0,1,2,3,4,5,6,7,8,9]
>>> my_list[0::3]
[0, 3, 6, 9]
>>> my_list[1::3]
[1, 4, 7]
>>> my_list[2::3]
[2, 5, 8]
And the final solution:
assigned_groups = {}
for i in xrange(size):
assigned_groups[i] = sorted_pipeline[i::size]
If this is not possible, you can always keep a counter of total queue size, per sub-process pipeline, and tweak probability or selection logic to take that into account.
I have a list of ~30 floats. I want to see if a specific float is in my list. For example:
1 >> # For the example below my list has integers, not floats
2 >> list_a = range(30)
3 >> 5.5 in list_a
False
4 >> 1 in list_a
True
The bottleneck in my code is line 3. I search if an item is in my list numerous times, and I require a faster alternative. This bottleneck takes over 99% of my time.
I was able to speed up my code by making list_a a set instead of a list. Are there any other ways to significantly speed up this line?
The best possible time to check if an element is in list if the list is not sorted is O(n) because the element may be anywhere and you need to look at each item and check if it is what you are looking for
If the array was sorted, you could've used binary search to have O(log n) look up time. You also can use hash maps to have average O(1) lookup time (or you can use built-in set, which is basically a dictionary that accomplishes the same task).
That does not make much sense for a list of length 30, though.
In my experience, Python indeed slows down when we search something in a long list.
To complement the suggestion above, my suggestion will be subsetting the list, of course only if the list can be subset and the query can be easily assigned to the correct subset.
Example is searching for a word in an English dictionary, first subsetting the dictionary into 26 "ABCD" sections based on each word's initials. If the query is "apple", you only need to search the "A" section. The advantage of this is that you have greatly limited the search space and hence the speed boost.
For numerical list, either subset it based on range, or on the first digit.
Hope this helps.
I have 40.000 documents, 93.08 words per doc. on avg., where every word is a number (which can index a dictionary) and every word has a count (frequency). Read more here.
I am between two data structures to store the data and was wondering which one I should choose, which one the Python people would choose!
Triple-list:
A list, where every node:
__ is a list, where every node:
__.... is a list of two values; word_id and count.
Double-dictionary:
A dictionary, with keys the doc_id and values dictionaries.
That value dictionary would have a word_id as a key and the count as a value.
I feel that the first will require less space (since it doesn't store the doc_id), while the second will be more easy to handle and access. I mean, accessing the i-element in the list is O(n), while it is constant in the dictionary, I think. Which one should I choose?
You should use a dictionary. It will make handling your code easier to understand and to program and it will have a lower complexity as well.
The only reason you would use a list, is if you cared about the order of the documents.
If you don't care about the order of the items you should definitely use a dictionary because dictionaries are used to group associated data while lists are generally used to group more generic items.
Moreover lookups in dictionaries are faster than that of a list.
Lookups in lists are O(n) while lookups in dictionaries are O(1). though lists are considerably larger in Memory than lists
Essentially you just want to store a large amount of numbers, for which the most space efficient choice is an array. These are one-dimensional so you could write a class which takes in three indices (the last being 0 for word_id and 1 for count) and does some basic addition and multiplication to find the correct 1D index.
I've noticed the table of the time complexity of set operations on the python official website. But i just wanna ask what's the time complexity of converting a list to a set, for instance,
l = [1, 2, 3, 4, 5]
s = set(l)
I kind of know that this is actually a hash table, but how exactly does it work? Is it O(n) then?
Yes. Iterating over a list is O(n) and adding each element to the hash set is O(1), so the total operation is O(n).
I was asked the same question in my last interview and didn't get it right.
As Trilarion commented in the first solution, the worst-case complexity is O(n^2). Iterating through the list will need O(n), but you can not just add each element to the hash table (sets are implemented using hashtables). In the worst case, our hash function will hash each element to the same value, thus adding each element to the hash set is not O(1). In such a case, we need to add each element to a linked list - (note that hash sets have a linked list in case of collision). When adding to the linked list we need to make sure that the element doesn't already exist (as a Set by definition doesn't have duplicates). To do that we need to iterate through the same linked list for each element, which takes a total of n*(n-1)/2 = O(n^2).