I read that one could manually collect garbage using
gc.collect()
Now I'm wondering when it is useful to do so. I suppose it is to some extent general Python logic. Say I have a large loop and in each loop will use big matrices Z and rewrite them again and again. Is it useful to remove the matrices and collect garbage in the end, if I don't change the size of Z?
The general question Under which circumstances can one actually observe the impact of forced garbage collection, especially when doing lots of numerical computation within numpy?
As you can see in the answers in the comment, the simplest way to release memory is del array and let the garbage collector do its job.
Related
I have been using .pop() and .append() extensively for Leetcode-style programming problems, especially in cases where you have to accumulate palindromes, subsets, permutations, etc.
Would I get a substantial performance gain from migrating to using a fixed size list instead? My concern is that internally the python list reallocates to a smaller internal array when I execute a bunch of pops, and then has to "allocate up" again when I append.
I know that the amortized time complexity of append and pop is O(1), but I want to get better performance if I can.
Yes.
Python (at least the CPython implementation) uses magic under the hood to make lists as efficient as possible. According to this blog post (2011), calls to append and pop will dynamically allocate and deallocate memory in chunks (overallocating where necessary) for efficiency. The list will only deallocate memory if it shrinks below the chunk size. So, for most cases if you are doing a lot of appends and pops, no memory allocation/deallocation will be performed.
Basically the idea with these high level languages is that you should be able to use the data structure most suited to your use case and the interpreter will ensure that you don't have to worry about the background workings. (eg. avoid micro-optimisation and instead focus on the efficiency of the algorithms in general) If you're that worried about performance, I'd suggest using a language where you have more control over the memory, like C/C++ or Rust.
Python guarantees O(1) complexity for append and pops as you noted, so it sounds like it will be perfectly suited for your case. If you wanted to use it like a queue and using things like list.pop(1) or list.insert(0, obj) which are slower, then you could look into a dedicated queue data structure, for example.
This is sort of a general question related to a specific implementation I have in mind, about whether it's safe to use python routines designed for use inside the GIL in a shared memory environment. Specifically what I'd like to do is use scipy.optimize.curve_fit on a large array inside a cython function.
The data can be expressed as a 2d numpy array (say, of floats) with the axis to be fit along and the other the serialized axis to be parallelized over. Then I'd just like to release the GIL and start looping through the data with a cython.parallel.prange (the idea being then that I can have all my cores working on fitting at once).
The main issue I can foresee is that curve_fit does not operate "in place"; it returns the fit values of the parameters (and optionally their covariance matrix) and so has to allocate that memory at some point. (Of course I also have no idea about any intermediate memory allocation the routine performs.) I'm worried about how this will operate outside the GIL with many threads working concurrently.
I realize that the answer could just be "it should work fine go try it," but I'm hoping to get some idea of what to look out for. I also realize that this question is similar to others about parallelizing scipy/numpy routines, but I think this one is worded differently in that falls within the cython scope of a C environment for python.
Thanks for any help/suggestions.
Not safe. If CPython could safely run that kind of code without the GIL, we wouldn't have the GIL in the first place.
You may find the following discussion to be of interest on Parallel Programming in SciPy.
[I would have posted this as merely a comment, but I lack the requisite reputation.]
It is a simple question, but since I didn't find any answers for it, I assume the answer would be negative. However, to make sure, I'm asking it:
Does it make the python code more efficient to set the variables to None after we're done with them in a function?
So as an example:
def foo(fname):
temp_1, temp_2 = load_file_data(fname)
# do some processing on temp_1, temp_2
temp_1 = None
temp_2 = None
# continue with the rest of the function
Does the answer change if we do this at the end of the function (since I assume python itself would do it at that point)?
It depends on what you mean by "more efficient".
Setting the variables to None, assuming they're the only references to their values, will allow the garbage collector to collect them. And in CPython (which uses ref counting for its garbage collector), it will even do so right away.
But on the other hand, you're also adding more bytecodes to the function that have to be executed by the interpreter, and that make the code object harder to keep in cache, and so on.
And keep in mind that freeing up memory almost never means actually freeing memory to the OS. Most Python implementations have multiple levels of free lists, and it usually sits on top of something like malloc that does as well. So, if you were about to allocate enough additional memory to increase your peak memory size, having a lot of stuff on the free list may prevent that; if you've already hit your peak, releasing values is unlikely to make any difference. (And that's assuming peak memory usage is what matters to your app—just because it's by far the easiest thing to measure doesn't mean it's what's most relevant to every problem.)
In almost all real-life code, this is unlikely to make any difference either way. If it does, you'll need to test, and to understand how things like memory pressure and cache locality are affecting your application. You may be making your code better, you may be making it worse (at least assuming that some particular memory measurement is not the only thing you care about optimizing), most likely you're having no effect but to make it longer and therefore less readable. This is a perfect example of the maxim "premature optimization is the root of all evil".
Does the answer change if we do this at the end of the function (since I assume python itself would do it at that point)?
You're right that Python frees local variables when the function returns. So yes, in that case, you're still getting almost all of the negatives with almost none of the positives, which probably changes the answer.
But, all those caveats aside, there are cases where this could improve things.* So, if you've profiled your app and discovered that holding onto that memory too long is causing a real problem, by all means, fix it!
Still, note that del temp_1 will have the same effect you're looking for, and it's a lot more explicit in what you're doing and why. And in most cases, it would probably be better to refactor your code into smaller functions, so that temp_1 and friends go out of scope as soon as you're done with them naturally, without the need for any extra work.
* For example, imagine that the rest of the function is just an exact copy of the first half, with three new values. Having a perfect set of candidates at the top of the free lists is probably better than having to search the free lists more deeply—and definitely better than having to allocate more memory and possibly trigger swapping…
I disagree that it would be faster, unless you are running into a situation where you are running out of memory.
In a normal application as soon as the variables in your function leave scope they will be flagged as no longer used, freed, or whatever the specific python interpreter does. Setting to None would mean more work for python as this would allow the memory pointed to by your variable to be free'd, but not the variable itself.
Also, in general python uses reference counting, not garbage collection so once the reference count falls to zero the object would be free'd.
I'm using python to set up a computationally intense simulation, then running it in a custom built C-extension and finally processing the results in python. During the simulation, I want to store a fixed-length number of floats (C doubles converted to PyFloatObjects) representing my variables at every time step, but I don't know how many time steps there will be in advance. Once the simulation is done, I need to pass back the results to python in a form where the data logged for each individual variable is available as a list-like object (for example a (wrapper around a) continuous array, piece-wise continuous array or column in a matrix with a fixed stride).
At the moment I'm creating a dictionary mapping the name of each variable to a list containing PyFloatObject objects. This format is perfect for working with in the post-processing stage but I have a feeling the creation stage could be a lot faster.
Time is quite crucial since the simulation is a computationally heavy task already. I expect that a combination of A. buying lots of memory and B. setting up your experiment wisely will allow the entire log to fit in the RAM. However, with my current dict-of-lists solution keeping every variable's log in a continuous section of memory would require a lot of copying and overhead.
My question is: What is a clever, low-level way of quickly logging gigabytes of doubles in memory with minimal space/time overhead, that still translates to a neat python data structure?
Clarification: when I say "logging", I mean storing until after the simulation. Once that's done a post-processing phase begins and in most cases I'll only store the resulting graphs. So I don't actually need to store the numbers on disk.
Update: In the end, I changed my approach a little and added the log (as a dict mapping variable names to sequence types) to the function parameters. This allows you to pass in objects such as lists or array.arrays or anything that has an append method. This adds a little time overhead because I'm using the PyObject_CallMethodObjArgs function to call the Append method instead of PyList_Append or similar. Using arrays allows you to reduce the memory load, which appears to be the best I can do short of writing my own expanding storage type. Thanks everyone!
You might want to consider doing this in Cython, instead of as a C extension module. Cython is smart, and lets you do things in a pretty pythonic way, even though it at the same time lets you use C datatypes and python datatypes.
Have you checked out the array module? It allows you to store lots of scalar, homogeneous types in a single collection.
If you're truly "logging" these, and not just returning them to CPython, you might try opening a file and fprintf'ing them.
BTW, realloc might be your friend here, whether you go with a C extension module or Cython.
This is going to be more a huge dump of ideas rather than a consistent answer, because it sounds like that's what you're looking for. If not, I apologize.
The main thing you're trying to avoid here is storing billions of PyFloatObjects in memory. There are a few ways around that, but they all revolve on storing billions of plain C doubles instead, and finding some way to expose them to Python as if they were sequences of PyFloatObjects.
To make Python (or someone else's module) do the work, you can use a numpy array, a standard library array, a simple hand-made wrapper on top of the struct module, or ctypes. (It's a bit odd to use ctypes to deal with an extension module, but there's nothing stopping you from doing it.) If you're using struct or ctypes, you can even go beyond the limits of your memory by creating a huge file and mmapping in windows into it as needed.
To make your C module do the work, instead of actually returning a list, return a custom object that meets the sequence protocol, so when someone calls, say, foo.getitem(i) you convert _array[i] to a PyFloatObject on the fly.
Another advantage of mmap is that, if you're creating the arrays iteratively, you can create them by just streaming to a file, and then use them by mmapping the resulting file back as a block of memory.
Otherwise, you need to handle the allocations. If you're using the standard array, it takes care of auto-expanding as needed, but otherwise, you're doing it yourself. The code to do a realloc and copy if necessary isn't that difficult, and there's lots of sample code online, but you do have to write it. Or you may want to consider building a strided container that you can expose to Python as if it were contiguous even though it isn't. (You can do this directly via the complex buffer protocol, but personally I've always found that harder than writing my own sequence implementation.) If you can use C++, vector is an auto-expanding array, and deque is a strided container (and if you've got the SGI STL rope, it may be an even better strided container for the kind of thing you're doing).
As the other answer pointed out, Cython can help for some of this. Not so much for the "exposing lots of floats to Python" part; you can just move pieces of the Python part into Cython, where they'll get compiled into C. If you're lucky, all of the code that needs to deal with the lots of floats will work within the subset of Python that Cython implements, and the only things you'll need to expose to actual interpreted code are higher-level drivers (if even that).
I am trying to adapt the underlying structure of plotting code (matplotlib) that is updated on a timer to go from using Python lists for the plot data to using numpy arrays. I want to be able to lower the time step for the plot as much as possible, and since the data may get up into the thousands of points, I start to lose valuable time fast if I can't. I know that numpy arrays are preferred for this sort of thing, but I am having trouble figuring out when I need to think like a Python programmer and when I need to think like a C++ programmer maximize my efficiency of memory access.
It says in the scipy.org docs for the append() function that it returns a copy of the arrays appended together. Do all these copies get garbage-collected properly? For example:
import numpy as np
a = np.arange(10)
a = np.append(a,10)
print a
This is my reading of what is going on on the C++-level, but if I knew what I was talking about, I wouldn't be asking the question, so please correct me if I'm wrong! =P
First a block of 10 integers gets allocated, and the symbol a points to the beginning of that block. Then a new block of 11 integers is allocated, for a total of 21 ints (84 bytes) being used. Then the a pointer is moved to the start of the 11-int block. My guess is that this would result in the garbage-collection algorithm decrementing the reference count of the 10-int block to zero and de-allocating it. Is this right? If not, how do I ensure I don't create overhead when appending?
I also am not sure how to properly delete a numpy array when I am done using it. I have a reset button on my plots that just flushes out all the data and starts over. When I had lists, this was done using del data[:]. Is there an equivalent function for numpy arrays? Or should I just say data = np.array([]) and count on the garbage collector to do the work for me?
The point of automatic memory management is that you don't think about it. In the code that you wrote, the copies will be garbage-collected fine (it's nigh on impossible to confuse Python's memory management). However, because np.append is not in-place, the code will create a new array in memory (containing the concatenation of a and 10) and then the variable a will be updated to point to this new array. Since a now no longer points to the original array, which had a refcount of 1, its refcount is decremented to 0 and it will be cleaned up automatically. You can use gc.collect to force a full cleanup.
Python's strength does not lie in fine-tuning memory access, although it is possible to optimise. You are probably best sorted pre-allocating a (using e.g. a = np.zeros( <size> )); if you need finer tuning than that it starts to get a bit hairy. You could have a look at the Cython + Numpy tutorial for a very neat and easy way to integrate C with Python for efficiency.
Variables in Python just point to the location where their contents are stored; you can del any variable and it will decrease the reference count of its target by one. The target will be cleaned automatically after its reference count hits zero. The moral of this is, don't worry about cleaning up your memory. It will happen automatically.