I am trying to adapt the underlying structure of plotting code (matplotlib) that is updated on a timer to go from using Python lists for the plot data to using numpy arrays. I want to be able to lower the time step for the plot as much as possible, and since the data may get up into the thousands of points, I start to lose valuable time fast if I can't. I know that numpy arrays are preferred for this sort of thing, but I am having trouble figuring out when I need to think like a Python programmer and when I need to think like a C++ programmer maximize my efficiency of memory access.
It says in the scipy.org docs for the append() function that it returns a copy of the arrays appended together. Do all these copies get garbage-collected properly? For example:
import numpy as np
a = np.arange(10)
a = np.append(a,10)
print a
This is my reading of what is going on on the C++-level, but if I knew what I was talking about, I wouldn't be asking the question, so please correct me if I'm wrong! =P
First a block of 10 integers gets allocated, and the symbol a points to the beginning of that block. Then a new block of 11 integers is allocated, for a total of 21 ints (84 bytes) being used. Then the a pointer is moved to the start of the 11-int block. My guess is that this would result in the garbage-collection algorithm decrementing the reference count of the 10-int block to zero and de-allocating it. Is this right? If not, how do I ensure I don't create overhead when appending?
I also am not sure how to properly delete a numpy array when I am done using it. I have a reset button on my plots that just flushes out all the data and starts over. When I had lists, this was done using del data[:]. Is there an equivalent function for numpy arrays? Or should I just say data = np.array([]) and count on the garbage collector to do the work for me?
The point of automatic memory management is that you don't think about it. In the code that you wrote, the copies will be garbage-collected fine (it's nigh on impossible to confuse Python's memory management). However, because np.append is not in-place, the code will create a new array in memory (containing the concatenation of a and 10) and then the variable a will be updated to point to this new array. Since a now no longer points to the original array, which had a refcount of 1, its refcount is decremented to 0 and it will be cleaned up automatically. You can use gc.collect to force a full cleanup.
Python's strength does not lie in fine-tuning memory access, although it is possible to optimise. You are probably best sorted pre-allocating a (using e.g. a = np.zeros( <size> )); if you need finer tuning than that it starts to get a bit hairy. You could have a look at the Cython + Numpy tutorial for a very neat and easy way to integrate C with Python for efficiency.
Variables in Python just point to the location where their contents are stored; you can del any variable and it will decrease the reference count of its target by one. The target will be cleaned automatically after its reference count hits zero. The moral of this is, don't worry about cleaning up your memory. It will happen automatically.
Related
I have put my full LMDB database into a single extremly large key/value bytestring array associated with a single key (the only one in my LMDB database). Thus I access the values that I need by an offset, the offset is an index at the array as you can see in the code snippet. With such a structure my access time should be O(1). The problem is that when I query my database it is so slow. I have absolutely no idea why does is it take so long. Is it a good idea to store my huge array in a single key in the first place? Is there a particular mechanism in python that makes accessening an element by its index in an array so slow? Is the data not contiguous? I am struglling figuring out what is wrong, please help!
env = lmdb.open('light')
with env.begin(write=False,buffers=True) as txn:
cursor=txn.cursor()
cursor.first()
for i in range(18000000): #I have around 180000 element
cursor.value()[4*i:4*i+4] #this loop last an eternity
I think the problem is that cursor.value() is expensive. I don't know enough about the guts of LMDB or its Python bindings to know how much work it has to do, but it could be doing a partial B-tree traversal, invoking the OS to set up memory mappings, constructing complicated proxy objects, perhaps even copying the entire array out of LMDB into a Python buffer object. And you're calling it on every iteration of the loop, so it has to repeat that work every time. Destroying the object returned by cursor.value() may also be expensive and you're repeating that work every time too.
If I'm right, you should be able to get a substantial speedup by hoisting the invocation of value() out of the loop:
env = lmdb.open('light')
with env.begin(write=False,buffers=True) as txn:
cursor=txn.cursor()
if cursor.first():
data = cursor.value()
for i in range(18000000):
data[4*i:4*i+4]
Python's interpreter is not very efficient and its bytecode compiler doesn't do very many optimizations, so you will probably see a small but measurable further speedup from using three-argument range to avoid having to multiply by 4 twice on every loop iteration:
env = lmdb.open('light')
with env.begin(write=False,buffers=True) as txn:
cursor=txn.cursor()
if cursor.first():
data = cursor.value()
for i in range(0, 18000000*4, 4):
data[i:i+4]
I have done over the time many things that require me using the list's .append() function, and also numpy.append() function for numpy arrays. I noticed that both grow really slow when sizes of the arrays are big.
I need an array that is dynamically growing for sizes of about 1 million elements. I can implement this myself, just like std::vector is made in C++, by adding buffer length (reserve length) that is not accessible from the outside. But do I have to reinvent the wheel? I imagine it should be implemented somewhere. So my question is: Does such a thing exist already in Python?
What I mean: Is there in Python an array type that is capable of dynamically growing with time complexity of O(C) most of the time?
The memory of numpy arrays is well described in its docs, and has been discussed here a lot. List memory layout has also been discussed, though usually just contrast to numpy.
A numpy array has a fixed size data buffer. 'growing' it requires creating a new array, and copying data to it. np.concatenate does that in compiled code. np.append as well as all the stack functions use concatenate.
A list has, as I understand it, a contiguous data buffer that contains pointers to objects else where in memeory. Python maintains some freespace in that buffer, so additions with list.append are relatively fast and easy. But when the freespace fills up, it has to create a new buffer and copy pointers. I can see where that could get expensive with large lists.
So a list will have store a pointer for each element, plus the element itself (e.g. a float) somewhere else in memory. In contrast the array of floats stores the floats themselves as contiguous bytes in its buffer. (Object dtype arrays are more like lists).
The recommended way to create an array iteratively is to build the list with append, and create the array once at the end. Repeated np.append or np.concatenate is relatively expensive.
deque was mentioned. I don't know much about how it stores its data. The docs say it can add elements at the start just as easily as at the end, but random access is slower than for a list. That implies that it stores data in some sort of linked list, so that finding the nth element requires traversing the n-1 links before it. So there's a trade off between growth ease and access speed.
Adding elements to the start of a list requires making a new list of pointers, with the new one(s) at the start. So adding, and removing elements from the start of a regular list, is much more expensive than doing that at the end.
Recommending software is outside of the core SO purpose. Others may make suggestions, but don't be surprised if this gets closed.
There are file formats like HDF5 that a designed for large data sets. They accommodate growth with features like 'chunking'. And there are all kinds of database packages.
Both use an underlying array. Instead, you can use collections.deque which is made for specifically adding and removing elements at both ends with O(1) complexity
If I am using C-Python or jython (in Python 2.7), and for list ([]) data structure, if I continue to add new elements, will there be memory reallocation issue like how Java ArrayList have (since Java ArrayList requires continuous memory space, if current pre-allocated space is full, it needs re-allocate new larger continuous large memory space, and move existing elements to the new allocated space)?
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
regards,
Lin
The basic story, at least for the main Python, is that a list contains pointers to objects elsewhere in memory. The list is created with a certain free space (eg. for 8 pointers). When that fills up, it allocates more memory, and so on. Whether it moves the pointers from one block of memory to another, is a detail that most users ignore. In practice we just append/extend a list as needed and don't worry about memory use.
Why does creating a list from a list make it larger?
I assume jython uses the same approach, but you'd have to dig into its code to see how that translates to Java.
I mostly answer numpy questions. This is a numerical package that creates fixed sized multidimensional arrays. If a user needs to build such an array incrementally, we often recommend that they start with a list and append values. At the end they create the array. Appending to a list is much cheaper than rebuilding an array multiple times.
Internally python lists are Array of pointers as mentioned by hpaulj
The next question then is how can you extend the an Array in C as explained in the answer. Which explains this can be done using realloc function in C.
This lead me to look in to the behavior of realloc which mentions
The function may move the memory block to a new location (whose address is returned by the function).
From this my understanding is the array object is extended if contiguous memory is available, else memory block (containing the Array object not List object) is copied to newly allocated memory block with greater size.
This is my understanding, corrections are welcome if I am wrong.
I read that one could manually collect garbage using
gc.collect()
Now I'm wondering when it is useful to do so. I suppose it is to some extent general Python logic. Say I have a large loop and in each loop will use big matrices Z and rewrite them again and again. Is it useful to remove the matrices and collect garbage in the end, if I don't change the size of Z?
The general question Under which circumstances can one actually observe the impact of forced garbage collection, especially when doing lots of numerical computation within numpy?
As you can see in the answers in the comment, the simplest way to release memory is del array and let the garbage collector do its job.
I'm using python to set up a computationally intense simulation, then running it in a custom built C-extension and finally processing the results in python. During the simulation, I want to store a fixed-length number of floats (C doubles converted to PyFloatObjects) representing my variables at every time step, but I don't know how many time steps there will be in advance. Once the simulation is done, I need to pass back the results to python in a form where the data logged for each individual variable is available as a list-like object (for example a (wrapper around a) continuous array, piece-wise continuous array or column in a matrix with a fixed stride).
At the moment I'm creating a dictionary mapping the name of each variable to a list containing PyFloatObject objects. This format is perfect for working with in the post-processing stage but I have a feeling the creation stage could be a lot faster.
Time is quite crucial since the simulation is a computationally heavy task already. I expect that a combination of A. buying lots of memory and B. setting up your experiment wisely will allow the entire log to fit in the RAM. However, with my current dict-of-lists solution keeping every variable's log in a continuous section of memory would require a lot of copying and overhead.
My question is: What is a clever, low-level way of quickly logging gigabytes of doubles in memory with minimal space/time overhead, that still translates to a neat python data structure?
Clarification: when I say "logging", I mean storing until after the simulation. Once that's done a post-processing phase begins and in most cases I'll only store the resulting graphs. So I don't actually need to store the numbers on disk.
Update: In the end, I changed my approach a little and added the log (as a dict mapping variable names to sequence types) to the function parameters. This allows you to pass in objects such as lists or array.arrays or anything that has an append method. This adds a little time overhead because I'm using the PyObject_CallMethodObjArgs function to call the Append method instead of PyList_Append or similar. Using arrays allows you to reduce the memory load, which appears to be the best I can do short of writing my own expanding storage type. Thanks everyone!
You might want to consider doing this in Cython, instead of as a C extension module. Cython is smart, and lets you do things in a pretty pythonic way, even though it at the same time lets you use C datatypes and python datatypes.
Have you checked out the array module? It allows you to store lots of scalar, homogeneous types in a single collection.
If you're truly "logging" these, and not just returning them to CPython, you might try opening a file and fprintf'ing them.
BTW, realloc might be your friend here, whether you go with a C extension module or Cython.
This is going to be more a huge dump of ideas rather than a consistent answer, because it sounds like that's what you're looking for. If not, I apologize.
The main thing you're trying to avoid here is storing billions of PyFloatObjects in memory. There are a few ways around that, but they all revolve on storing billions of plain C doubles instead, and finding some way to expose them to Python as if they were sequences of PyFloatObjects.
To make Python (or someone else's module) do the work, you can use a numpy array, a standard library array, a simple hand-made wrapper on top of the struct module, or ctypes. (It's a bit odd to use ctypes to deal with an extension module, but there's nothing stopping you from doing it.) If you're using struct or ctypes, you can even go beyond the limits of your memory by creating a huge file and mmapping in windows into it as needed.
To make your C module do the work, instead of actually returning a list, return a custom object that meets the sequence protocol, so when someone calls, say, foo.getitem(i) you convert _array[i] to a PyFloatObject on the fly.
Another advantage of mmap is that, if you're creating the arrays iteratively, you can create them by just streaming to a file, and then use them by mmapping the resulting file back as a block of memory.
Otherwise, you need to handle the allocations. If you're using the standard array, it takes care of auto-expanding as needed, but otherwise, you're doing it yourself. The code to do a realloc and copy if necessary isn't that difficult, and there's lots of sample code online, but you do have to write it. Or you may want to consider building a strided container that you can expose to Python as if it were contiguous even though it isn't. (You can do this directly via the complex buffer protocol, but personally I've always found that harder than writing my own sequence implementation.) If you can use C++, vector is an auto-expanding array, and deque is a strided container (and if you've got the SGI STL rope, it may be an even better strided container for the kind of thing you're doing).
As the other answer pointed out, Cython can help for some of this. Not so much for the "exposing lots of floats to Python" part; you can just move pieces of the Python part into Cython, where they'll get compiled into C. If you're lucky, all of the code that needs to deal with the lots of floats will work within the subset of Python that Cython implements, and the only things you'll need to expose to actual interpreted code are higher-level drivers (if even that).