How does python implement the indexing of strings?

How does python implement the indexing of strings? - python

The Data Structures and Algorithms written by Goodrich says that the python array is to store a group of related variables one after another in a continuous area of the computer memory, so the index can be accessed directly by calculating the address.For example,if the memory address of the first element of the array is 2146, and each element occupies two bytes of memory, then the memory address of the sixth element is 2146+2*5=2156, so the computer can directly access address 2156 to get the sixth elements.
But I tried to verify it,only to found that the results didn't accord with the theory.
str1 = "example"
for i in range(1,6):
print(id(str1[i])-id(str1[i-1]))
The output is as follows
-336384
471680
-492352
313664
178944
Why does this happen, if the memory address is not continuous, how does python get its memory address through index and then access the element?

The closest thing to a "Python array" I know of is a numpy.array, and indeed:
In [1]: import numpy as np
In [2]: a = np.array([12, 4, 120, 24, 3, 0, 13, 13], dtype='int8')
In [3]: asint64 = a.view('int64')[0]
In [4]: for i in range(8):
...: print(asint64 % 2**(8*(i+1)) // 2**(8*(i)))
...:
12
4
120
24
3
0
13
13
What is happening here is that you are first building an array of 8 numbers using each 8 bits; when you later ask numpy to consider them as a single 64 bit number, you get that it is composed by the 8 bit representation of each of the 8 numbers, juxtaposed. So indeed the original 8 integers were adiacent in memory.
In general, asking Python "tell me what is at this arbitrary memory position", or "tell me where exactly in memory is this item of a string or array" is... slightly less straigthforward.
EDIT: ... it is slightly less straightforward, but at least it alleviates any suspect that a.view is doing strange things, so here we are loooking at the exact positions in memory of sub-arrays of our array:
In [5]: for i in range(8):
...: print(a[i:].__array_interface__['data'][0])
...:
45993728
45993729
45993730
45993731
45993732
45993733
45993734
45993735
(as long as you trust .__array_interface__['data'][0] not do do strange things!)

As far as I understand, Python (or at least CPython) implements lists as dynamic arrays.
It might be worth checking out this article for a quick intro to the concept of dynamic arrays if you are unfamiliar.
However, the way that these dynamic arrays are implemented in Python is different from a "standard implementation" of the data structure - see below quote from the Wikipedia page on dynamic arrays.
... in languages like Python or Java that enforce reference semantics, the dynamic array generally will not store the actual data, but rather it will store references to the data that resides in other areas of memory. In this case, accessing items in the array sequentially will actually involve accessing multiple non-contiguous areas of memory, so the many advantages of the cache-friendliness of this data structure are lost. - Wikipedia
I have been unable to find any supporting articles or documentation around this at first glance, however I would think that strings are implemented in a similar way to how lists are within Python - with some major modifications but similar underlying logic in parts.
Strings being implemented in such a way would neatly explain why you are seeing sub-strings (Python does not have an explicit char class, a char variable is simply a string with length 1 as far as Python is concerned) within a string at non-contiguous locations within memory at runtime.

Related

What are 'recursive data types' in Python, as referenced by the shelve doc? [duplicate]

I was reading about Recursive Data Type which has the following quote:
In computer programming languages, a recursive data type (also known as a recursively-defined, inductively-defined or inductive data type) is a data type for values that may contain other values of the same type
I understand that Linked List and Trees can be recursive data type because it contains smaller version of the same data structure, like a Tree can have a subtree.
However, it go me really confused because isn't fixed size array also contains sub-array? which is still the same type?
Can someone explain with examples that what makes a data structure recursive?

However, it go me really confused because isn't fixed size array also contains sub-array?
Conceptually, you could say that every array "contains" sub-arrays, but the arrays themselves aren't made up of smaller arrays in the code. An array is made up of a continuous chunk of elements, not other arrays.
A recursive structure (like, as you mentioned, a linked list), literally contains versions of itself:
class Node {
Node head = null; // <-- Nodes can literally hold other Nodes
}
Whereas, if you think of an array represented as a class with fixed fields for elements, it contains elements, not other arrays:
class Array<E> {
E elem1 = ...; // <-- In the code, an array isn't made up of other arrays,
E elem2 = ...; // it's made up of elements.
...
}
(This is a bad, inaccurate representation of an array, but it's the best I can communicate in simple code).
A structure is recursive if while navigating through it, you come across "smaller" versions of the whole structure. While navigating through an array, you'll only come across the elements that the array holds, not smaller arrays.
Note though, that this depends entirely on the implementation of the structure. In Clojure for example, "vectors" behave essentially identical to arrays, and can be thought of as arrays while using them, but internally, they're actually a tree (essentially a multi-child linked list).

A data type describes how data is stored (at least logically; on deeper layers there are only numbers, bits, transistors, etc. anyway). While an array (but only a non-empty one) can be considered as something plus another (sub-)array (such an approach is used commonly for lists in languages like Lisp and Prolog), it is usually stored element-wise.
For example, a list in Prolog is defined as either an empty list (a special case) or an element (called head) concatenated with another list (called tail). This makes a recursive data type.
On the other hand, a list in C can be defined as struct list { char[100] data; int length; }. Here list in not used in how list is defined (only char[] and int are used), so it's not a recursive data type.

How come list element lookup is O(1) in Python?

Today in class, we learned that retrieving an element from a list is O(1) in Python. Why is this the case? Suppose I have a list of four items, for example:
li = ["perry", 1, 23.5, "s"]
These items have different sizes in memory. And so it is not possible to take the memory location of li[0] and add three times the size of each element to get the memory location of li[3]. So how does the interpreter know where li[3] is without having to traverse the list in order to retrieve the element?

A list in Python is implemented as an array of pointers1. So, what's really happening when you create the list:
["perry", 1, 23.5, "s"]
is that you are actually creating an array of pointers like so:
[0xa3d25342, 0x635423fa, 0xff243546, 0x2545fade]
Each pointer "points" to the respective objects in memory, so that the string "perry" will be stored at address 0xa3d25342 and the number 1 will be stored at 0x635423fa, etc.
Since all pointers are the same size, the interpreter can in fact add 3 times the size of an element to the address of li[0] to get to the pointer stored at li[3].
1 Get more details from: the horse's mouth (CPython source code on GitHub).

When you say a = [...], a is effectively a pointer to a PyObject containing an array of pointers to PyObjects.
When you ask for a[2], the interpreter first follows the pointer to the list's PyObject, then adds 2 to the address of the array inside it, then returns that pointer. The same happens if you ask for a[0] or a[9999].
Basically, all Python objects are accessed by reference instead of by value, even integer literals like 2. There are just some tricks in the pointer system to keep this all efficient. And pointers have a known size, so they can be stored conveniently in C-style arrays.

Short answer: Python lists are arrays.
Long answer: The computer science term list usually means either a singly-linked list (as used in functional programming) or a doubly-linked list (as used in procedural programming). These data structures support O(1) insertion at either the head of the list (functionally) or at any position that does not need to be searched for (procedurally). A Python ``list'' has none of these characteristics. Instead it supports (amortized) O(1) appending at the end of the list (like a C++ std::vector or Java ArrayList). Python lists are really resizable arrays in CS terms.
The following comment from the Python documentation explains some of the performance characteristics of Python ``lists'':
It is also possible to use a list as a queue, where the first element added is the first element retrieved (“first-in, first-out”); however, lists are not efficient for this purpose. While appends and pops from the end of list are fast, doing inserts or pops from the beginning of a list is slow (because all of the other elements have to be shifted by one).

Repeatedly appending to a large list (Python 2.6.6)

I have a project where I am reading in ASCII values from a microcontroller through a serial port (looks like this : AA FF BA 11 43 CF etc)
The input is coming in quickly (38 two character sets / second).
I'm taking this input and appending it to a running list of all measurements.
After about 5 hours, my list has grown to ~ 855000 entries.
I'm given to understand that the larger a list becomes, the slower list operations become. My intent is to have this test run for 24 hours, which should yield around 3M results.
Is there a more efficient, faster way to append to a list then list.append()?
Thanks Everyone.

I'm given to understand that the larger a list becomes, the slower list operations become.
That's not true in general. Lists in Python are, despite the name, not linked lists but arrays. There are operations that are O(n) on arrays (copying and searching, for instance), but you don't seem to use any of these. As a rule of thumb: If it's widely used and idiomatic, some smart people went and chose a smart way to do it. list.append is a widely-used builtin (and the underlying C function is also used in other places, e.g. list comprehensions). If there was a faster way, it would already be in use.
As you will see when you inspect the source code, lists are overallocating, i.e. when they are resized, they allocate more than needed for one item so the next n items can be appended without need to another resize (which is O(n)). The growth isn't constant, it is proportional with the list size, so resizing becomes rarer as the list grows larger. Here's the snippet from listobject.c:list_resize that determines the overallocation:
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
As Mark Ransom points out, older Python versions (<2.7, 3.0) have a bug that make the GC sabotage this. If you have such a Python version, you may want to disable the gc. If you can't because you generate too much garbage (that slips refcounting), you're out of luck though.

One thing you might want to consider is writing your data to a file as it's collected. I don't know (or really care) if it will affect performance, but it will help ensure that you don't lose all your data if power blips. Once you've got all the data, you can suck it out of the file and jam it in a list or an array or a numpy matrix or whatever for processing.

Appending to a python list has a constant cost. It is not affected by the number of items in the list (in theory). In practice appending to a list will get slower once you run out of memory and the system starts swapping.
http://wiki.python.org/moin/TimeComplexity
It would be helpful to understand why you actually append things into a list. What are you planning to do with the items. If you don't need all of them you could build a ring buffer, if you don't need to do computation you could write the list to a file, etc.

First of all, 38 two-character sets per second, 1 stop bit, 8 data bits, and no parity, is only 760 baud, not fast at all.
But anyway, my suggestion, if you're worried about having overly large lists/don't want to use one huge list, is just to store store a list on disk once it reaches a certain size and start a new list, repeating until you've gotten all the data, then combining all the lists into one once you're done receiving the data.
Though you may skip the sublists completely and just go with nmichaels' suggestion, writing the data to a file as you get it and using a small circular buffer to hold the received data that has not yet been written.

It might be faster to use numpy if you know how long the array is going to be and you can convert your hex codes to ints:
import numpy
a = numpy.zeros(3000000, numpy.int32)
for i in range(3000000):
a[i] = int(scanHexFromSerial(),16)
This will leave you with an array of integers (which you could convert back to hex with hex()), but depending on your application maybe that will work just as well for you.

Why do dicts of defaultdict(int)'s use so much memory? (and other simple python performance questions)

I do understand that querying a non-existent key in a defaultdict the way I do will add items to the defaultdict. That is why it is fair to compare my 2nd code snippet to my first one in terms of performance.
import numpy as num
from collections import defaultdict
topKeys = range(16384)
keys = range(8192)
table = dict((k,defaultdict(int)) for k in topKeys)
dat = num.zeros((16384,8192), dtype="int32")
print "looping begins"
#how much memory should this use? I think it shouldn't use more that a few
#times the memory required to hold (16384*8192) int32's (512 mb), but
#it uses 11 GB!
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
print "done"
What is going on here? Furthermore, this similar script takes eons to run compared to the first one, and also uses an absurd quantity of memory.
topKeys = range(16384)
keys = range(8192)
table = [(j,0) for k in topKeys for j in keys]
I guess python ints might be 64 bit ints, which would account for some of this, but do these relatively natural and simple constructions really produce such a massive overhead?
I guess these scripts show that they do, so my question is: what exactly is causing the high memory usage in the first script and the long runtime and high memory usage of the second script and is there any way to avoid these costs?
Edit:
Python 2.6.4 on 64 bit machine.
Edit 2: I can see why, to a first approximation, my table should take up 3 GB
16384*8192*(12+12) bytes
and 6GB with a defaultdict load factor that forces it to reserve double the space.
Then inefficiencies in memory allocation eat up another factor of 2.
So here are my remaining questions:
Is there a way for me to tell it to use 32 bit ints somehow?
And why does my second code snippet take FOREVER to run compared to the first one? The first one takes about a minute and I killed the second one after 80 minutes.

Python ints are internally represented as C longs (it's actually a bit more complicated than that), but that's not really the root of your problem.
The biggest overhead is your usage of dicts. (defaultdicts and dicts are about the same in this description). dicts are implemented using hash tables, which is nice because it gives quick lookup of pretty general keys. (It's not so necessary when you only need to look up sequential numerical keys, since they can be laid out in an easy way to get to them.)
A dict can have many more slots than it has items. Let's say you have a dict with 3x as many slots as items. Each of these slots needs room for a pointer to a key and a pointer serving as the end of a linked list. That's 6x as many points as numbers, plus all the pointers to the items you're interested in. Consider that each of these pointers is 8 bytes on your system and that you have 16384 defaultdicts in this situation. As a rough, handwavey look at this, 16384 occurrences * (8192 items/occurance) * 7 (pointers/item) * 8 (bytes/pointer) = 7 GB. This is before I've gotten to the actual numbers you're storing (each unique number of which is itself a Python dict), the outer dict, that numpy array, or the stuff Python's keeping track of to try to optimize some.
Your overhead sounds a little higher than I suspect and I would be interested in knowing whether that 11GB was for a whole process or whether you calculated it for just table. In any event, I do expect the size of this dict-of-defaultdicts data structure to be orders of magnitude bigger than the numpy array representation.
As to "is there any way to avoid these costs?" the answer is "use numpy for storing large, fixed-size contiguous numerical arrays, not dicts!" You'll have to be more specific and concrete about why you found such a structure necessary for better advice about what the best solution is.

Well, look at what your code is actually doing:
topKeys = range(16384)
table = dict((k,defaultdict(int)) for k in topKeys)
This creates a dict holding 16384 defaultdict(int)'s. A dict has a certain amount of overhead: the dict object itself is between 60 and 120 bytes (depending on the size of pointers and ssize_t's in your build.) That's just the object itself; unless the dict is less than a couple of items, the data is a separate block of memory, between 12 and 24 bytes, and it's always between 1/2 and 2/3rds filled. And defaultdicts are 4 to 8 bytes bigger because they have this extra thing to store. And ints are 12 bytes each, and although they're reused where possible, that snippet won't reuse most of them. So, realistically, in a 32-bit build, that snippet will take up 60 + (16384*12) * 1.8 (fill factor) bytes for the table dict, 16384 * 64 bytes for the defaultdicts it stores as values, and 16384 * 12 bytes for the integers. So that's just over a megabyte and a half without storing anything in your defaultdicts. And that's in a 32-bit build; a 64-bit build would be twice that size.
Then you create a numpy array, which is actually pretty conservative with memory:
dat = num.zeros((16384,8192), dtype="int32")
This will have some overhead for the array itself, the usual Python object overhead plus the dimensions and type of the array and such, but it wouldn't be much more than 100 bytes, and only for the one array. It does store 16384*8192 int32's in your 512Mb though.
And then you have this rather peculiar way of filling this numpy array:
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
The two loops themselves don't use much memory, and they re-use it each iteration. However, table[k][j] creates a new Python integer for each value you request, and stores it in the defaultdict. The integer created is always 0, and it so happens that that always gets reused, but storing the reference to it still uses up space in the defaultdict: the aforementioned 12 bytes per entry, times the fill factor (between 1.66 and 2.) That lands you close to 3Gb of actual data right there, and 6Gb in a 64-bit build.
On top of that the defaultdicts, because you keep adding data, have to keep growing, which means they have to keep reallocating. Because of Python's malloc frontend (obmalloc) and how it allocates smaller objects in blocks of its own, and how process memory works on most operating systems, this means your process will allocate more and not be able to free it; it won't actually use all of the 11Gb, and Python will re-use the available memory inbetween the large blocks for the defaultdicts, but the total mapped address space will be that 11Gb.

Mike Graham gives a good explanation of why dictionaries use more memory, but I thought that I'd explain why your table dict of defaultdicts starts to take up so much memory.
The way that the defaultdict (DD) is set-up right now, whenever you retrieve an element that isn't in the DD, you get the default value for the DD (0 for your case) but also the DD now stores a key that previously wasn't in the DD with the default value of 0. I personally don't like this, but that's how it goes. However, it means that for every iteration of the inner loop, new memory is being allocated which is why it is taking forever. If you change the lines
for k in topKeys:
for j in keys:
dat[k,j] = table[k][j]
to
for k in topKeys:
for j in keys:
if j in table[k]:
dat[k,j] = table[k][j]
else:
dat[k,j] = 0
then default values aren't being assigned to keys in the DDs and so the memory stays around 540 MB for me which is mostly just the memory allocated for dat. DDs are decent for sparse matrices though you probably should just use the sparse matrices in Scipy if that's what you want.

How many bytes per element are there in a Python list (tuple)?

For example, how much memory is required to store a list of one million (32-bit) integers?
alist = range(1000000) # or list(range(1000000)) in Python 3.0

"It depends." Python allocates space for lists in such a way as to achieve amortized constant time for appending elements to the list.
In practice, what this means with the current implementation is... the list always has space allocated for a power-of-two number of elements. So range(1000000) will actually allocate a list big enough to hold 2^20 elements (~ 1.045 million).
This is only the space required to store the list structure itself (which is an array of pointers to the Python objects for each element). A 32-bit system will require 4 bytes per element, a 64-bit system will use 8 bytes per element.
Furthermore, you need space to store the actual elements. This varies widely. For small integers (-5 to 256 currently), no additional space is needed, but for larger numbers Python allocates a new object for each integer, which takes 10-100 bytes and tends to fragment memory.
Bottom line: it's complicated and Python lists are not a good way to store large homogeneous data structures. For that, use the array module or, if you need to do vectorized math, use NumPy.
PS- Tuples, unlike lists, are not designed to have elements progressively appended to them. I don't know how the allocator works, but don't even think about using it for large data structures :-)

Useful links:
How to get memory size/usage of python object
Memory sizes of python objects?
if you put data into dictionary, how do we calculate the data size?
However they don't give a definitive answer. The way to go:
Measure memory consumed by Python interpreter with/without the list (use OS tools).
Use a third-party extension module which defines some sort of sizeof(PyObject).
Update:
Recipe 546530: Size of Python objects (revised)
import asizeof
N = 1000000
print asizeof.asizeof(range(N)) / N
# -> 20 (python 2.5, WinXP, 32-bit Linux)
# -> 33 (64-bit Linux)

Addressing "tuple" part of the question
Declaration of CPython's PyTuple in a typical build configuration boils down to this:
struct PyTuple {
size_t refcount; // tuple's reference count
typeobject *type; // tuple type object
size_t n_items; // number of items in tuple
PyObject *items[1]; // contains space for n_items elements
};
Size of PyTuple instance is fixed during it's construction and cannot be changed afterwards. The number of bytes occupied by PyTuple can be calculated as
sizeof(size_t) x 2 + sizeof(void*) x (n_items + 1).
This gives shallow size of tuple. To get full size you also need to add total number of bytes consumed by object graph rooted in PyTuple::items[] array.
It's worth noting that tuple construction routines make sure that only single instance of empty tuple is ever created (singleton).
References:
Python.h,
object.h,
tupleobject.h,
tupleobject.c

A new function, getsizeof(), takes a
Python object and returns the amount
of memory used by the object, measured
in bytes. Built-in objects return
correct results; third-party
extensions may not, but can define a
__sizeof__() method to return the object’s size.
kveretennicov#nosignal:~/py/r26rc2$ ./python
Python 2.6rc2 (r26rc2:66712, Sep 2 2008, 13:11:55)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
>>> import sys
>>> sys.getsizeof(range(1000000))
4000032
>>> sys.getsizeof(tuple(range(1000000)))
4000024
Obviously returned numbers don't include memory consumed by contained objects (sys.getsizeof(1) == 12).

This is implementation specific, I'm pretty sure. Certainly it depends on the internal representation of integers - you can't assume they'll be stored as 32-bit since Python gives you arbitrarily large integers so perhaps small ints are stored more compactly.
On my Python (2.5.1 on Fedora 9 on core 2 duo) the VmSize before allocation is 6896kB, after is 22684kB. After one more million element assignment, VmSize goes to 38340kB. This very grossly indicates around 16000kB for 1000000 integers, which is around 16 bytes per integer. That suggests a lot of overhead for the list. I'd take these numbers with a large pinch of salt.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.