I was wondering whether someone might know the answer to the following.
I'm using Python to build a character-based suffix tree. There are over 11 million nodes in the tree which fits in to approximately 3GB of memory. This was down from 7GB by using the slot class method rather than the Dict method.
When I serialise the tree (using the highest protocol) the resulting file is more than a hundred times smaller.
When I load the pickled file back in, it again consumes 3GB of memory. Where does this extra overhead come from, is it something to do with Pythons handling of memory references to class instances?
Update
Thank you larsmans and Gurgeh for your very helpful explanations and advice. I'm using the tree as part of an information retrieval interface over a corpus of texts.
I originally stored the children (max of 30) as a Numpy array, then tried the hardware version (ctypes.py_object*30), the Python array (ArrayType), as well as the dictionary and Set types.
Lists seemed to do better (using guppy to profile the memory, and __slots__['variable',...]), but I'm still trying to squash it down a bit more if I can. The only problem I had with arrays is having to specify their size in advance, which causes a bit of redundancy in terms of nodes with only one child, and I have quite a lot of them. ;-)
After the tree is constructed I intend to convert it to a probabilistic tree with a second pass, but may be I can do this as the tree is constructed. As construction time is not too important in my case, the array.array() sounds like something that would be useful to try, thanks for the tip, really appreciated.
I'll let you know how it goes.
If you try to pickle an empty list, you get:
>>> s = StringIO()
>>> pickle.dump([], s)
>>> s.getvalue()
'(l.'
and similarly '(d.' for an empty dict. That's three bytes. The in-memory representation of a list, however, contains
a reference count
a type ID, in turn containing a pointer to the type name and bookkeeping info for memory allocation
a pointer to a vector of pointers to actual elements
and yet more bookkeeping info.
On my machine, which has 64-bit pointers, the sizeof a Python list header object is 40 bytes, so that's one order of magnitude. I assume an empty dict will have similar size.
Then, both list and dict use an overallocation strategy to obtain amortized O(1) performance for their main operations, malloc introduces overhead, there's alignment, member attributes that you may or may not even be aware of and various other factors that get you the second order of magnitude.
Summing up: pickle is a pretty good compression algorithm for Python objects :)
Do you construct your tree once and then use it without modifying it further? In that case you might want to consider using separate structures for the dynamic construction and the static usage.
Dicts and objects are very good for dynamic modification, but they are not very space efficient in a read-only scenario. I don't know exactly what you are using your suffix tree for, but you could let each node be represented by a 2-tuple of a sorted array.array('c') and an equally long tuple of subnodes (a tuple instead of a vector to avoid overallocation). You traverse the tree using the bisect-module for lookup in the array. The index of a character in the array will correspond to a subnode in the subnode-tuple. This way you avoid dicts, objects and vector.
You could do something similar during the construction process, perhaps using a subnode-vector instead of subnode-tuple. But this will of course make construction slower, since inserting new nodes in a sorted vector is O(N).
Related
How are lists implemented to be heterogeneous? I know that a list is a dynamic array of pointers that point to the memory location of the required element but how does this work when the elements the indexes point to are of different sizes.
Does the location the pointer points to contain information about what type this location holds and if so, how is that kind of information formatted, used, returned?
I understand how [1,2,3] should look in memory but not how [1,2,"abcdefg"] would look in memory.
Everything is an object in python: As you mention, lists are collections of 'pointers' to the memory location on the objects contained in the list.
As objects, they each know their type, attributes, properties, etc.
They therefore cannot exceed their own memory boundary, and a tally/accounting of the number of pointers pointed at them is invisibly kept by the python interpreter.
From a technical perspective (i.e. how to implement something like this), look up the Abstract Datatype (ADT) concept in C, e.g. here.
I am no expert in how Python lists are implemented but from what I understand, they are implemented as dynamic arrays rather than linked lists. My question is therefore, if python lists are implemented as arrays, why are they called 'lists' and not 'arrays'.
Is this just a semantic issue or is there some deeper technical reason behind this. Is the dynamic array implementation in Python close to a list implementation? Or is it because the dynamic array implementation makes its behaviour closer to a list's behaviour than an array? Or some other reason I do not understand?
To be clear, I am not asking specifically how or why Python lists are implemented as dynamic arrays, although that might be relevant to the answer.
They're named after the list abstract data type, not linked lists. This is similar to the naming of Java's List interface and C#'s List<T>.
To further elaborate on user2357112's answer, as pointed out in the wikipedia article:
In computer science, a list or sequence is an abstract data type that
represents a countable number of ordered values, where the same value
may occur more than once.
Further,
List data types are often implemented using array data structures or linked lists of some sort, but other data structures may be more appropriate for some applications.
In CPython, lists are implemented as dynamic arrays of pointers, and their behaviour is much closer to the List abstract data type than the Array abstract data type. From this perspective, the naming of 'List' is accurate.
At the end of the day when implementing a list what you want is constant(O(1)) access(a[i]), insert(a.append(i)) and delete(a.remove(i)) times. With a linked list some of this operations could be as slow as O(n), i.e. deleting the last element of linked lists if you don't have a pointer to the tail.
With dynamic arrays you get constant delete and access times but what about deleting? Here we get amortized constant time. What is that? If the array is full of N elements, the insert will take O(N) and you'll end up with an array of size 2N. This is a rare event, thus we say we have amortized O(1).
Hope it helps.
Sources:
https://docs.python.org/2/faq/design.html
I have done over the time many things that require me using the list's .append() function, and also numpy.append() function for numpy arrays. I noticed that both grow really slow when sizes of the arrays are big.
I need an array that is dynamically growing for sizes of about 1 million elements. I can implement this myself, just like std::vector is made in C++, by adding buffer length (reserve length) that is not accessible from the outside. But do I have to reinvent the wheel? I imagine it should be implemented somewhere. So my question is: Does such a thing exist already in Python?
What I mean: Is there in Python an array type that is capable of dynamically growing with time complexity of O(C) most of the time?
The memory of numpy arrays is well described in its docs, and has been discussed here a lot. List memory layout has also been discussed, though usually just contrast to numpy.
A numpy array has a fixed size data buffer. 'growing' it requires creating a new array, and copying data to it. np.concatenate does that in compiled code. np.append as well as all the stack functions use concatenate.
A list has, as I understand it, a contiguous data buffer that contains pointers to objects else where in memeory. Python maintains some freespace in that buffer, so additions with list.append are relatively fast and easy. But when the freespace fills up, it has to create a new buffer and copy pointers. I can see where that could get expensive with large lists.
So a list will have store a pointer for each element, plus the element itself (e.g. a float) somewhere else in memory. In contrast the array of floats stores the floats themselves as contiguous bytes in its buffer. (Object dtype arrays are more like lists).
The recommended way to create an array iteratively is to build the list with append, and create the array once at the end. Repeated np.append or np.concatenate is relatively expensive.
deque was mentioned. I don't know much about how it stores its data. The docs say it can add elements at the start just as easily as at the end, but random access is slower than for a list. That implies that it stores data in some sort of linked list, so that finding the nth element requires traversing the n-1 links before it. So there's a trade off between growth ease and access speed.
Adding elements to the start of a list requires making a new list of pointers, with the new one(s) at the start. So adding, and removing elements from the start of a regular list, is much more expensive than doing that at the end.
Recommending software is outside of the core SO purpose. Others may make suggestions, but don't be surprised if this gets closed.
There are file formats like HDF5 that a designed for large data sets. They accommodate growth with features like 'chunking'. And there are all kinds of database packages.
Both use an underlying array. Instead, you can use collections.deque which is made for specifically adding and removing elements at both ends with O(1) complexity
I'm using python to set up a computationally intense simulation, then running it in a custom built C-extension and finally processing the results in python. During the simulation, I want to store a fixed-length number of floats (C doubles converted to PyFloatObjects) representing my variables at every time step, but I don't know how many time steps there will be in advance. Once the simulation is done, I need to pass back the results to python in a form where the data logged for each individual variable is available as a list-like object (for example a (wrapper around a) continuous array, piece-wise continuous array or column in a matrix with a fixed stride).
At the moment I'm creating a dictionary mapping the name of each variable to a list containing PyFloatObject objects. This format is perfect for working with in the post-processing stage but I have a feeling the creation stage could be a lot faster.
Time is quite crucial since the simulation is a computationally heavy task already. I expect that a combination of A. buying lots of memory and B. setting up your experiment wisely will allow the entire log to fit in the RAM. However, with my current dict-of-lists solution keeping every variable's log in a continuous section of memory would require a lot of copying and overhead.
My question is: What is a clever, low-level way of quickly logging gigabytes of doubles in memory with minimal space/time overhead, that still translates to a neat python data structure?
Clarification: when I say "logging", I mean storing until after the simulation. Once that's done a post-processing phase begins and in most cases I'll only store the resulting graphs. So I don't actually need to store the numbers on disk.
Update: In the end, I changed my approach a little and added the log (as a dict mapping variable names to sequence types) to the function parameters. This allows you to pass in objects such as lists or array.arrays or anything that has an append method. This adds a little time overhead because I'm using the PyObject_CallMethodObjArgs function to call the Append method instead of PyList_Append or similar. Using arrays allows you to reduce the memory load, which appears to be the best I can do short of writing my own expanding storage type. Thanks everyone!
You might want to consider doing this in Cython, instead of as a C extension module. Cython is smart, and lets you do things in a pretty pythonic way, even though it at the same time lets you use C datatypes and python datatypes.
Have you checked out the array module? It allows you to store lots of scalar, homogeneous types in a single collection.
If you're truly "logging" these, and not just returning them to CPython, you might try opening a file and fprintf'ing them.
BTW, realloc might be your friend here, whether you go with a C extension module or Cython.
This is going to be more a huge dump of ideas rather than a consistent answer, because it sounds like that's what you're looking for. If not, I apologize.
The main thing you're trying to avoid here is storing billions of PyFloatObjects in memory. There are a few ways around that, but they all revolve on storing billions of plain C doubles instead, and finding some way to expose them to Python as if they were sequences of PyFloatObjects.
To make Python (or someone else's module) do the work, you can use a numpy array, a standard library array, a simple hand-made wrapper on top of the struct module, or ctypes. (It's a bit odd to use ctypes to deal with an extension module, but there's nothing stopping you from doing it.) If you're using struct or ctypes, you can even go beyond the limits of your memory by creating a huge file and mmapping in windows into it as needed.
To make your C module do the work, instead of actually returning a list, return a custom object that meets the sequence protocol, so when someone calls, say, foo.getitem(i) you convert _array[i] to a PyFloatObject on the fly.
Another advantage of mmap is that, if you're creating the arrays iteratively, you can create them by just streaming to a file, and then use them by mmapping the resulting file back as a block of memory.
Otherwise, you need to handle the allocations. If you're using the standard array, it takes care of auto-expanding as needed, but otherwise, you're doing it yourself. The code to do a realloc and copy if necessary isn't that difficult, and there's lots of sample code online, but you do have to write it. Or you may want to consider building a strided container that you can expose to Python as if it were contiguous even though it isn't. (You can do this directly via the complex buffer protocol, but personally I've always found that harder than writing my own sequence implementation.) If you can use C++, vector is an auto-expanding array, and deque is a strided container (and if you've got the SGI STL rope, it may be an even better strided container for the kind of thing you're doing).
As the other answer pointed out, Cython can help for some of this. Not so much for the "exposing lots of floats to Python" part; you can just move pieces of the Python part into Cython, where they'll get compiled into C. If you're lucky, all of the code that needs to deal with the lots of floats will work within the subset of Python that Cython implements, and the only things you'll need to expose to actual interpreted code are higher-level drivers (if even that).
GAE has various limitations, one of which is size of biggest allocatable block of memory amounting to 1Mb (now 10 times more, but that doesn't change the question). The limitation means that one cannot put more then some number of items in list() as CPython would try to allocate contiguous memory block for element pointers. Having huge list()s can be considered bad programming practice, but even if no huge structure is created in program itself, CPython maintains some behind the scenes.
It appears that CPython is maintaining single global list of objects or something. I.e. application that has many small objects tend to allocate bigger and bigger single blocks of memory.
First idea was gc, and disabling it changes application behavior a bit but still some structures are maintained.
A simplest short application that experience the issue is:
a = b = []
number_of_lists = 8000000
for i in xrange(number_of_lists):
b.append([])
b = b[0]
Can anyone enlighten me how to prevent CPython from allocating huge internal structures when having many objects in application?
On a 32-bit system, each of the 8000000 lists you create will allocate 20 bytes for the list object itself, plus 16 bytes for a vector of list elements. So you are trying to allocate at least (20+16) * 8000000 = 20168000000 bytes, about 20 GB. And that's in the best case, if the system malloc only allocates exactly as much memory as requested.
I calculated the size of the list object as follows:
2 Pointers in the PyListObject structure itself (see listobject.h)
1 Pointer and one Py_ssize_t for the PyObject_HEAD part of the list object (see object.h)
one Py_ssize_t for the PyObject_VAR_HEAD (also in object.h)
The vector of list elements is slightly overallocated to avoid having to resize it at each append - see list_resize in listobject.c. The sizes are 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ... Thus, your one-element lists will allocate room for 4 elements.
Your data structure is a somewhat pathological example, paying the price of a variable-sized list object without utilizing it - all your lists have only a single element. You could avoid the 12 bytes overallocation by using tuples instead of lists, but to further reduce the memory consumption, you will have to use a different data structure that uses fewer objects. It's hard to be more specific, as I don't know what you are trying to accomplish.
I'm a bit confused as to what you're asking. In that code example, nothing should be garbage collected, as you're never actually killing off any references. You're holding a reference to the top level list in a and you're adding nested lists (held in b at each iteration) inside of that. If you remove the 'a =', then you've got unreferenced objects.
Edit: In response to the first part, yes, Python holds a list of objects so it can know what to cull. Is that the whole question? If not, comment/edit your question and I'll do my best to help fill in the gaps.
What are you trying to accomplish with the
a = b = []
and
b = b[0]
statements? It's certainly odd to see statements like that in Python, because they don't do what you might naively expect: in that example, a and b are two names for the same list (think pointers in C). If you're doing a lot of manipulation like that, it's easy to confuse the garbage collector (and yourself!) because you've got a lot of strange references floating around that haven't been properly cleared.
It's hard to diagnose what's wrong with that code without knowing why you want to do what it appears to be doing. Sure, it exposes a bit of interpreter weirdness... but I'm guessing you're approaching your problem in an odd way, and a more Pythonic approach might yield better results.
So that you're aware of it, Python has its own allocator. You can disable it using --without-pyalloc during the configure step.
However, the largest arena is 256KB so that shouldn't be the problem. You can also compile Python with debugging enabled, using --with-pydebug. This would give you more information about memory use.
I suspect your hunch and am sure that oefe's diagnosis are correct. A list uses contiguous memory, so if your list gets too large for a system arena then you're out of luck. If you're really adventurous you can reimplement PyList to use multiple blocks, but that's going to be a lot of work since various bits of Python expect contiguous data.