Today in class, we learned that retrieving an element from a list is O(1) in Python. Why is this the case? Suppose I have a list of four items, for example:
li = ["perry", 1, 23.5, "s"]
These items have different sizes in memory. And so it is not possible to take the memory location of li[0] and add three times the size of each element to get the memory location of li[3]. So how does the interpreter know where li[3] is without having to traverse the list in order to retrieve the element?
A list in Python is implemented as an array of pointers1. So, what's really happening when you create the list:
["perry", 1, 23.5, "s"]
is that you are actually creating an array of pointers like so:
[0xa3d25342, 0x635423fa, 0xff243546, 0x2545fade]
Each pointer "points" to the respective objects in memory, so that the string "perry" will be stored at address 0xa3d25342 and the number 1 will be stored at 0x635423fa, etc.
Since all pointers are the same size, the interpreter can in fact add 3 times the size of an element to the address of li[0] to get to the pointer stored at li[3].
1 Get more details from: the horse's mouth (CPython source code on GitHub).
When you say a = [...], a is effectively a pointer to a PyObject containing an array of pointers to PyObjects.
When you ask for a[2], the interpreter first follows the pointer to the list's PyObject, then adds 2 to the address of the array inside it, then returns that pointer. The same happens if you ask for a[0] or a[9999].
Basically, all Python objects are accessed by reference instead of by value, even integer literals like 2. There are just some tricks in the pointer system to keep this all efficient. And pointers have a known size, so they can be stored conveniently in C-style arrays.
Short answer: Python lists are arrays.
Long answer: The computer science term list usually means either a singly-linked list (as used in functional programming) or a doubly-linked list (as used in procedural programming). These data structures support O(1) insertion at either the head of the list (functionally) or at any position that does not need to be searched for (procedurally). A Python ``list'' has none of these characteristics. Instead it supports (amortized) O(1) appending at the end of the list (like a C++ std::vector or Java ArrayList). Python lists are really resizable arrays in CS terms.
The following comment from the Python documentation explains some of the performance characteristics of Python ``lists'':
It is also possible to use a list as a queue, where the first element added is the first element retrieved (“first-in, first-out”); however, lists are not efficient for this purpose. While appends and pops from the end of list are fast, doing inserts or pops from the beginning of a list is slow (because all of the other elements have to be shifted by one).
Related
I was reading about Recursive Data Type which has the following quote:
In computer programming languages, a recursive data type (also known as a recursively-defined, inductively-defined or inductive data type) is a data type for values that may contain other values of the same type
I understand that Linked List and Trees can be recursive data type because it contains smaller version of the same data structure, like a Tree can have a subtree.
However, it go me really confused because isn't fixed size array also contains sub-array? which is still the same type?
Can someone explain with examples that what makes a data structure recursive?
However, it go me really confused because isn't fixed size array also contains sub-array?
Conceptually, you could say that every array "contains" sub-arrays, but the arrays themselves aren't made up of smaller arrays in the code. An array is made up of a continuous chunk of elements, not other arrays.
A recursive structure (like, as you mentioned, a linked list), literally contains versions of itself:
class Node {
Node head = null; // <-- Nodes can literally hold other Nodes
}
Whereas, if you think of an array represented as a class with fixed fields for elements, it contains elements, not other arrays:
class Array<E> {
E elem1 = ...; // <-- In the code, an array isn't made up of other arrays,
E elem2 = ...; // it's made up of elements.
...
}
(This is a bad, inaccurate representation of an array, but it's the best I can communicate in simple code).
A structure is recursive if while navigating through it, you come across "smaller" versions of the whole structure. While navigating through an array, you'll only come across the elements that the array holds, not smaller arrays.
Note though, that this depends entirely on the implementation of the structure. In Clojure for example, "vectors" behave essentially identical to arrays, and can be thought of as arrays while using them, but internally, they're actually a tree (essentially a multi-child linked list).
A data type describes how data is stored (at least logically; on deeper layers there are only numbers, bits, transistors, etc. anyway). While an array (but only a non-empty one) can be considered as something plus another (sub-)array (such an approach is used commonly for lists in languages like Lisp and Prolog), it is usually stored element-wise.
For example, a list in Prolog is defined as either an empty list (a special case) or an element (called head) concatenated with another list (called tail). This makes a recursive data type.
On the other hand, a list in C can be defined as struct list { char[100] data; int length; }. Here list in not used in how list is defined (only char[] and int are used), so it's not a recursive data type.
This question already has answers here:
How is Python's List Implemented?
(10 answers)
Closed 4 years ago.
Let's say we have
a = [1, 2, 3]
Whenever I use index in list like 0, 1, 2 in this case, How does python3 retrieve element by knowing the index? Is there any specific address for each element inside the list under the hood other than index?
the arrays (list in python) technically store pointers rather than the objects themselves, which allows the array to contain only elements of a specific size even with mixed types list in python.
from python docs:
CPython’s lists are really variable-length arrays, not Lisp-style
linked lists. The implementation uses a contiguous array of references
to other objects, and keeps a pointer to this array and the array’s
length in a list head structure.
This makes indexing a list a[i] an operation whose cost is independent
of the size of the list or the value of the index.
When items are appended or inserted, the array of references is
resized. Some cleverness is applied to improve the performance of
appending items repeatedly; when the array must be grown, some extra
space is allocated so the next few times don’t require an actual
resize.
source:
https://docs.python.org/3/faq/design.html#how-are-lists-implemented-in-cpython
more explanation :
what is a pointer?
A pointer is a variable that stores a memory address. Pointers are used to store the addresses of other variables or memory items.
and how indexing work ?
a[i] means the same as (p + i) when p represents pointer to the first element of an array:
*(a + i)
so if a pointer p points to an element of an array, then adding n to the pointer makes it point at the nth element following the original one. this involved adding or subtracting the correct offset (based on the size of a reference) in bytes between objects.
the size of a reference same as word size for the CPU 4 bytes on a 32bit system, and 8 bytes on a 64bit system
memory representation of array of pointers
hope this clear things to you ..
this is my first answer for me in stackoverflow, vote up if it is helpful. thank you.
I understand that a list is different from an array. But still, O(1)? That would mean accessing an element in a list would be as fast as accessing an element in a dict, which we all know is not true.
My question is based on this document:
list
----------------------------
| Operation | Average Case |
|-----------|--------------|
| ... | ... |
|-----------|--------------|
| Get Item | O(1) |
----------------------------
and this answer:
Lookups in lists are O(n), lookups in dictionaries are amortized O(1),
with regard to the number of items in the data structure.
If the first document is true, then why is accessing a dict faster than accessing a list if they have the same complexity?
Can anybody give a clear explanation on this please? I would say it always depends on the size of the list/dict, but I need more insight on this.
Get item is getting an item in a specific index, while lookup means searching if some element exists in the list. To do so, unless the list is sorted, you will need to iterate all elements, and have O(n) Get Item operations, which leads to O(n) lookup.
A dictionary is maintaining a smart data structure (hash table) under the hood, so you will not need to query O(n) times to find if the element exists, but a constant number of times (average case), leading to O(1) lookup.
accessing a list l at index n l[n] is O(1) because it is not implemented as a Vanilla linked list where one needs to jump between pointers (value, next-->) n times to reach cell index n.
If the memory is continuous and the entry size would had been fixed, reaching a specific entry would be trivial as we know to jump n times an entry size (like classic arrays in C).
However, since list is variable in entries size, the python implementation uses a continuous memory list just for the pointers to the values. This makes indexing a list (l[n]) an operation whose cost is independent of the size of the list or the value of the index.
For more information see http://effbot.org/pyfaq/how-are-lists-implemented.htm
That is because Python stores the address of each node of a list into a separate array. When we want an element at nth node, all we have to do is look up the nth element of the address array which gives us the address of nth node of the list by which we can get the value of that node in O(1) complexity.
Python does some neat tricks to make these arrays expandable as the list grows. Thus we are getting the flexibility of lists and and the speed of arrays. The trade off here is that the compiler has to reallocate memory for the address array whenever the list grows to a certain extent.
amit has explained in his answer to this question about why lookups in dictionaries are faster than in lists.
The minimum possible I can see is O(log n) from a strict computer science standpoint
I am a novice in Python. I was just wondering if one can find the length of list or tuple in O(1) time. (len() is O(n))
In C, I can achieve similar thing as follows:
int a[] = {1, 2, 3, 4 ,5};
printf("Length of Array a is :: %d\n", sizeof(a)/sizeof(a[0]));
I know that the above concept works on addresses and that is why it is possible in C, while as per my understanding, Python does not deal in addresses. However I still wanted to ask this question for the sake of curiosity.
Calling len() on a list in Python is O(1). See here.
The reason this works in O(1) in C is not because it deals in "addresses" but rather because the size is known at compilation time. In fact if you were to use sizeof() on a dynamically allocated array you would always get the size of a pointer.
In python though you want to use len().
For example, how much memory is required to store a list of one million (32-bit) integers?
alist = range(1000000) # or list(range(1000000)) in Python 3.0
"It depends." Python allocates space for lists in such a way as to achieve amortized constant time for appending elements to the list.
In practice, what this means with the current implementation is... the list always has space allocated for a power-of-two number of elements. So range(1000000) will actually allocate a list big enough to hold 2^20 elements (~ 1.045 million).
This is only the space required to store the list structure itself (which is an array of pointers to the Python objects for each element). A 32-bit system will require 4 bytes per element, a 64-bit system will use 8 bytes per element.
Furthermore, you need space to store the actual elements. This varies widely. For small integers (-5 to 256 currently), no additional space is needed, but for larger numbers Python allocates a new object for each integer, which takes 10-100 bytes and tends to fragment memory.
Bottom line: it's complicated and Python lists are not a good way to store large homogeneous data structures. For that, use the array module or, if you need to do vectorized math, use NumPy.
PS- Tuples, unlike lists, are not designed to have elements progressively appended to them. I don't know how the allocator works, but don't even think about using it for large data structures :-)
Useful links:
How to get memory size/usage of python object
Memory sizes of python objects?
if you put data into dictionary, how do we calculate the data size?
However they don't give a definitive answer. The way to go:
Measure memory consumed by Python interpreter with/without the list (use OS tools).
Use a third-party extension module which defines some sort of sizeof(PyObject).
Update:
Recipe 546530: Size of Python objects (revised)
import asizeof
N = 1000000
print asizeof.asizeof(range(N)) / N
# -> 20 (python 2.5, WinXP, 32-bit Linux)
# -> 33 (64-bit Linux)
Addressing "tuple" part of the question
Declaration of CPython's PyTuple in a typical build configuration boils down to this:
struct PyTuple {
size_t refcount; // tuple's reference count
typeobject *type; // tuple type object
size_t n_items; // number of items in tuple
PyObject *items[1]; // contains space for n_items elements
};
Size of PyTuple instance is fixed during it's construction and cannot be changed afterwards. The number of bytes occupied by PyTuple can be calculated as
sizeof(size_t) x 2 + sizeof(void*) x (n_items + 1).
This gives shallow size of tuple. To get full size you also need to add total number of bytes consumed by object graph rooted in PyTuple::items[] array.
It's worth noting that tuple construction routines make sure that only single instance of empty tuple is ever created (singleton).
References:
Python.h,
object.h,
tupleobject.h,
tupleobject.c
A new function, getsizeof(), takes a
Python object and returns the amount
of memory used by the object, measured
in bytes. Built-in objects return
correct results; third-party
extensions may not, but can define a
__sizeof__() method to return the object’s size.
kveretennicov#nosignal:~/py/r26rc2$ ./python
Python 2.6rc2 (r26rc2:66712, Sep 2 2008, 13:11:55)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
>>> import sys
>>> sys.getsizeof(range(1000000))
4000032
>>> sys.getsizeof(tuple(range(1000000)))
4000024
Obviously returned numbers don't include memory consumed by contained objects (sys.getsizeof(1) == 12).
This is implementation specific, I'm pretty sure. Certainly it depends on the internal representation of integers - you can't assume they'll be stored as 32-bit since Python gives you arbitrarily large integers so perhaps small ints are stored more compactly.
On my Python (2.5.1 on Fedora 9 on core 2 duo) the VmSize before allocation is 6896kB, after is 22684kB. After one more million element assignment, VmSize goes to 38340kB. This very grossly indicates around 16000kB for 1000000 integers, which is around 16 bytes per integer. That suggests a lot of overhead for the list. I'd take these numbers with a large pinch of salt.