I am learning how to program in python and am also learning theory as part of a computer science course. In programming i know that i can add additional variables to an array just by using the .append function, however in my theory classes we are told that arrays can neither be increase nor decreased in size.
How does this work in python?
Python uses resizable vectors under the hood. They maintain knowledge of how many elements are in the list as well as what the current total capacity is. When you try to add another element beyond the size of the collection, it allocates a new array with more capacity and populates it with the pointers to items from the original backing array. This is similar to java's ArrayList type, except that there's no way to specify the capacity in python
A detailed post on the implementation is here: http://www.laurentluce.com/posts/python-list-implementation/
They are not linked lists; there's no linked list type built into python, and the performance patterns are different.
Its not python but at one point in your future you will see this in other languages as well. Another common way this is solved that doesn't involve using a vector or a linked list is with dynamic arrays.
Essentially you create an array with a finite size. If the user calls append and you have no more room in your array. You create a new array that is 2x larger than the old array. Then copy all the elements over and append the new element.
The 2x is actually important because it keeps the insert time amortized constant. (That is more advanced algorithms though)
A list in python is akin to a linked list. They can grow dynamically and each element can point to anything.
If you're curious about what id dynamic and what isn't in Python then you should read about mutability vs immutability:
https://codehabitude.com/2013/12/24/python-objects-mutable-vs-immutable/
In the theory class, you learned about static arrays. We see these types of arrays in C usually. But in python, we have dynamic arrays which are extensible. Search for Linked List in google and you will gain further knowledge
Related
In python a Stack implementation can be created using list. However, what are the benefits of creating stacks using Linked List
It doesn't matter that much in Python, since it can expand the list capacity when needed in constant time. A linked list implementation allows for the stack to dynamically change size without redefining the structure. So in this case, the benefits are pretty minimal, but it would be more important in other languages where an array would have to be reallocated to change size.
I am no expert in how Python lists are implemented but from what I understand, they are implemented as dynamic arrays rather than linked lists. My question is therefore, if python lists are implemented as arrays, why are they called 'lists' and not 'arrays'.
Is this just a semantic issue or is there some deeper technical reason behind this. Is the dynamic array implementation in Python close to a list implementation? Or is it because the dynamic array implementation makes its behaviour closer to a list's behaviour than an array? Or some other reason I do not understand?
To be clear, I am not asking specifically how or why Python lists are implemented as dynamic arrays, although that might be relevant to the answer.
They're named after the list abstract data type, not linked lists. This is similar to the naming of Java's List interface and C#'s List<T>.
To further elaborate on user2357112's answer, as pointed out in the wikipedia article:
In computer science, a list or sequence is an abstract data type that
represents a countable number of ordered values, where the same value
may occur more than once.
Further,
List data types are often implemented using array data structures or linked lists of some sort, but other data structures may be more appropriate for some applications.
In CPython, lists are implemented as dynamic arrays of pointers, and their behaviour is much closer to the List abstract data type than the Array abstract data type. From this perspective, the naming of 'List' is accurate.
At the end of the day when implementing a list what you want is constant(O(1)) access(a[i]), insert(a.append(i)) and delete(a.remove(i)) times. With a linked list some of this operations could be as slow as O(n), i.e. deleting the last element of linked lists if you don't have a pointer to the tail.
With dynamic arrays you get constant delete and access times but what about deleting? Here we get amortized constant time. What is that? If the array is full of N elements, the insert will take O(N) and you'll end up with an array of size 2N. This is a rare event, thus we say we have amortized O(1).
Hope it helps.
Sources:
https://docs.python.org/2/faq/design.html
Actually, I am doing a sequence operation about numpy array, therefore, I want to know how to access a[i] quickly?
(Because I accessa[i-1] in the last loop, therefore, in c++, we may simply access a[i] by adding 1 to the address of a[i-1],but I don't know whether it is possible in numpy. Thanks.
I don't think this is possible/a[i] is the fastest way.
Python is a programming language that is easier to learn (and use) than c++, this of course comes at a cost, one of these costs is, that it's slower.
The references you're talking about can be "dangerous", hence python makes them not (easily) available to people, to protect them from things the do not understand.
While references are faster, you can't use them in python (as it is anyway slower, the difference in using references or not doesn't matter that much)
It's best not to think of a Python NumPy: ndarray as a C++ array. They're much different. Python also offers its own native list objects and includes an array module in its standard libraries.
A Python list behaves mostly like a generic array (as found in many programming languages). It's an ordered sequence; elements can be accessed by integer index from 0 up through (but not including) the length of the list (len(mylist)); ranges of elements can be access using "slice" notation (mylist[start_offset:end_offset]) returning another list object; negative indexes are treated as offsets from the end of the list (mylist[-1] is the last item of the list) and so on.
Additionally they support a number of methods such as .count(), .find() and .replace().
Unlike the arrays in most programming languages, Python lists are heterogenous. The elements can be any mixture of any object types in Python, including references to nested lists, dictionaries, code, classes, generator objects and other callable first class objects and, of course, instances of custom objects.
The Python array module allows one to instantiate homogenous list-like objects. That is you instantiate them using any of a dozen primitive data types (character or Unicode, signed or unsigned short or long integers, floating point or double precision floating point). Indexing and slicing are identical to Python native lists.
The primary advantage of Python array.array() instances is that they can store large numbers of their elements far more compactly than the more generalized list objects. Various operations on these arrays are likely to be somewhat faster than similar operations performed by iterating over or otherwise referencing elements in a native Python list because there's greater locality of reference in the more compact array layout (in memory) and because the type constraint obviates some dispatch overhead that's incurred when handling generalized objects.
NumPy, on the other hand, is far more sophisticated than the Python array module.
For one thing the ndarray can be multi-dimensional and can be dynamically reshaped. It's common to start with a linear ndarray and to reshape it into a matrix or other higher dimensional structure. Also the ndarray supports a much richer set of data types than the Python array module. NumPy also implements some rather advanced fancy indexingfeatures.
But the real performance advantages of NumPy relate to how it "vectorizes" most operations, broadcasts them across the data structures (possibly using any SIMD features supported by your CPU or even your GPU in the process. At the very list many common matrix operations, when properly written in Python for NumPy, are execute as native machine code speed. This performance edge does well beyond the minor effects of locality of references and obviating dispatch tables that one gains using the simple array module.
I have done over the time many things that require me using the list's .append() function, and also numpy.append() function for numpy arrays. I noticed that both grow really slow when sizes of the arrays are big.
I need an array that is dynamically growing for sizes of about 1 million elements. I can implement this myself, just like std::vector is made in C++, by adding buffer length (reserve length) that is not accessible from the outside. But do I have to reinvent the wheel? I imagine it should be implemented somewhere. So my question is: Does such a thing exist already in Python?
What I mean: Is there in Python an array type that is capable of dynamically growing with time complexity of O(C) most of the time?
The memory of numpy arrays is well described in its docs, and has been discussed here a lot. List memory layout has also been discussed, though usually just contrast to numpy.
A numpy array has a fixed size data buffer. 'growing' it requires creating a new array, and copying data to it. np.concatenate does that in compiled code. np.append as well as all the stack functions use concatenate.
A list has, as I understand it, a contiguous data buffer that contains pointers to objects else where in memeory. Python maintains some freespace in that buffer, so additions with list.append are relatively fast and easy. But when the freespace fills up, it has to create a new buffer and copy pointers. I can see where that could get expensive with large lists.
So a list will have store a pointer for each element, plus the element itself (e.g. a float) somewhere else in memory. In contrast the array of floats stores the floats themselves as contiguous bytes in its buffer. (Object dtype arrays are more like lists).
The recommended way to create an array iteratively is to build the list with append, and create the array once at the end. Repeated np.append or np.concatenate is relatively expensive.
deque was mentioned. I don't know much about how it stores its data. The docs say it can add elements at the start just as easily as at the end, but random access is slower than for a list. That implies that it stores data in some sort of linked list, so that finding the nth element requires traversing the n-1 links before it. So there's a trade off between growth ease and access speed.
Adding elements to the start of a list requires making a new list of pointers, with the new one(s) at the start. So adding, and removing elements from the start of a regular list, is much more expensive than doing that at the end.
Recommending software is outside of the core SO purpose. Others may make suggestions, but don't be surprised if this gets closed.
There are file formats like HDF5 that a designed for large data sets. They accommodate growth with features like 'chunking'. And there are all kinds of database packages.
Both use an underlying array. Instead, you can use collections.deque which is made for specifically adding and removing elements at both ends with O(1) complexity
Just learning Python. Reading through the official tutorials. I ran across this:
While appends and pops from the end of list are fast, doing inserts or pops from the beginning of a list is slow (because all of the other elements have to be shifted by one).
I would have guessed that a mature language like Python would have all sorts of optimizations, so why doesn't Python [seem to] use linked lists so that inserts can be fast?
Python uses a linear list layout in memory so that indexing is fast (O(1)).
As Greg Hewgill has already pointed out, python lists use contiguous blocks of memory to make indexing fast. You can use a deque if you want the performance characteristics of a linked list. But your initial premise seems flawed to me. Indexed insertion into the middle of a (standard) linked list is also slow.
What Python calls "lists" aren't actually linked lists; they're more like arrays. See the list entry from the Python glossary and also How do you make an array in Python? from the Python FAQ.
list is implemented as an arraylist. If you want to insert frequently, you can use a deque (but note that traversal to the middle is expensive).
Alternatively, you can use a heap. It's all there if you take the time to look at the docs.
Python lists are implemented using a resizeable array of references to other objects. This provides O(1) lookup compared to O(n) lookup for a linked list implementation.
See How are lists implemented?
As you mentioned, this implementation makes insertions into the beginning or middle of a Python list slow because every element in the array to the right of the insertion point has to be shifted over one element. Also, sometimes the array will have to be resized to accommodate more elements. For inserting into a linked list, you'll still need O(n) time to find the location where you will insert, but the actual insertion itself will be O(1), since you only need to change the references in the nodes immediately before and after your insertion point (assuming a doubly-linked list).
So the decision to make Python lists use dynamic arrays rather than linked lists has nothing to do with the "maturity" of the language implementation. There are simply trade-offs between different data structures and the designers of Python decided that dynamic arrays were the best option overall. They may have assumed indexing a list is more common than inserting data into it, thus making dynamic arrays a better choice in this case.
See the following table in the Dynamic Array wikipedia article for a comparison of various data structure performance characteristics:
https://en.wikipedia.org/wiki/Dynamic_array#Performance