Python list implementation and pympler measurement

Python list implementation and pympler measurement - python

I need to parse file (~500 Mb) and partially load it to list, I don't need the entire file.
I had a feeling that python allocate much more memory for the list that the size of the data it contains.
I tried to use asizeof of pympler in order to estimate the overkill however it fails with MemoryError which is strange for me, I thought if I have a list in the memory asizeof should just run over it sum the sizes of all entities and that it.
Then I took the chunk of the initial file, and I was shocked by the size of the list asizeof showed me. The list size was three times bigger that the file size.
The question is if the size given by asizeof is correct, what the more efficient way to use list in python. How to check the size of the bigger list when asizeof fails with memoryerror.

It would be helpful to see the code you use for reading/parsing the file and also how you invoke pympler.asizeof.
asizeof and all other facilities in Pympler work inside the profiled process (using Python's introspection facilities to navigate reference graphs). That means that the profiling overhead might become a problem when sizing reference graphs with large number of nodes (objects) - especially if you are already tight on memory before you start profiling. Be sure to set all=False and code=False when calling asizeof. In any case, please file a bug on GitHub. Maybe one can avoid running out of memory in this scenario.
To the best of my knowledge, the sizes reported by asizeof are accurate as long as sys.getsizeof returns the correct size for the individual objects (assuming Python >= 2.6). You could set align=1 when calling asizeof and see if the numbers are more in line with what you expect.
You could also check the virtual size of your process via your platform's tools or pympler.process:
from pympler.process import ProcessMemoryInfo
pmi = ProcessMemoryInfo()
print ("Process virtual size [Byte]: " + str(pmi.vsz))
This metric should always be higher than what asizeof reports when sizing objects.

Related

At what point am I using too much memory on a Mac?

I've tried really hard to figure out why my python is using 8 gigs of memory. I've even use gc.get_object() and measured the size of each object and only one of them was larger than 10 megs. Still, all of the objects, and there were about 100,000 of them, added up to 5.5 gigs. On the other hand, my computer is working fine, and the program is running at a reasonable speed. So is the fact that I'm using so much memory cause for concern?

As #bnaecker said this doesn't have a simple (i.e., yes/no) answer. It's only a problem if the combined RSS (resident set size) of all running processes exceeds the available memory thus causing excessive demand paging.
You didn't say how you calculated the size of each object. Hopefully it was by using sys.getsizeof() which should accurately include the overhead associated with each object. If you used some other method (such as calling the __sizeof() method directly) then your answer will be far lower than the correct value. However, even sys.getsizeof() won't account for wasted space due to memory alignment. For example, consider this experiment (using python 3.6 on macOS):
In [25]: x='x'*8193
In [26]: sys.getsizeof(x)
Out[26]: 8242
In [28]: 8242/4
Out[28]: 2060.5
Notice that last value. It implies that the object is using 2060 and 1/2 words of memory. Which is wrong since all allocations consume a multiple of a word. In fact, it looks to me like sys.getsizeof() does not correctly account for word alignment and padding of either the underlying object or the data structure that describes the object. Which means the value is smaller than the amount of memory actually used by the object. Multiplied by 100,000 objects that could represent a substantial amount of memory.
Also, many memory allocators will round up large allocations to a page size (typically a multiple of 4 KiB). Which results in "wasted" space that is probably not going to be included in the sys.getsizeof() return value.

What is python's strategy to manage allocation/freeing of large variables?

As a follow-up to this question, it appears that there are different allocation/deallocation strategies for little and big variables in (C)Python.
More precisely, there seems to be a boundary in the object size above which the memory used by the allocated object can be given back to the OS. Below this size, the memory is not given back to the OS.
To quote the answer taken from the Numpy policy for releasing memory:
The exception is that for large single allocations (e.g. if you create a multi-megabyte array), a different mechanism is used. Such large memory allocations can be released back to the OS. So it might specifically be the non-numpy parts of your program that are producing the issues you see.
Indeed, these two allocations strategies are easy to show. For example:
1st strategy: no memory is given back to the OS
import numpy as np
import psutil
import gc
# Allocate array
x = np.random.uniform(0,1, size=(10**4))
# gc
del x
gc.collect()
# We go from 41295.872 KB to 41295.872 KB
# using psutil.Process().memory_info().rss / 10**3; same behavior for VMS
=> No memory given back to the OS
2nd strategy: freed memory is given back to the OS
When doing the same experiment, but with a bigger array:
x = np.random.uniform(0,1, size=(10**5))
del x
gc.collect()
# We go from 41582.592 KB to 41017.344 KB
=> Memory is released to the OS
It seems that objects approximately bigger than 8*10**4 bytes get allocated using the 2nd strategy.
So:
Is this behavior documented? (And what is the exact boundary at which the allocation strategy changes?)
What are the internals of these strategies (more than assuming the use of an mmap/munmap to release the memory back to the OS)
Is this 100% done by the Python runtime or does Numpy have a specific way of handling this? (The numpy doc mentions the NPY_USE_PYMEM that switches between the memory allocator)

What you observe isn't CPython's strategy, but the strategy of the memory allocator which comes with the C-runtime your CPython-version is using.
When CPython allocates/deallocates memory via malloc/free, it doesn't not communicate directly with the underlying OS, but with a concrete implementation of memory allocator. In my case on Linux, it is the GNU Allocator.
The GNU Allocator has different so called arenas, where the memory isn't returned to OS, but kept so it can be reused without the need to comunicate with OS. However, if a large amout of memory is requested (whatever the definition of "large"), the allocator doesn't use the memory from arenas but requests the memory from OS and as consequence can give it directly back to OS, once free is called.
CPython has its own memory allocator - pymalloc, which is built atop of the C-runtime-allocator. It is optimized for small objects, which live in a special arena; there is less overhead when creating/freeing these objects as compared to the underlying C-runtime-allocator. However, objects bigger than 512 bytes don't use this arena, but are managed directly by the C-runtime-allocator.
The situation is even more complex with numpy's array, because different memory-allocators are used for the meta-data (like shape, datatype and other flags) and for the the actual data itself:
For meta-data PyArray_malloc, the CPython's memory allocator (i.e. pymalloc) is used.
For data itself, PyDataMem_NEW is used, which utilzes the underlying C-runtimme-functionality directly:
NPY_NO_EXPORT void *
PyDataMem_NEW(size_t size)
{
void *result;
result = malloc(size);
...
return result;
}
I'm not sure, what was the exact idea behind this design: obviously one would like to prifit from small object optimization of pymalloc, and for data this optimization would never work, but then one could use PyMem_RawMalloc instead of malloc. Maybe the goal was to be able to wrap numpy arrays around memory allocated by C-routines and take over the ownership of memory (but this will not work in some circumstances, see my comment at the end of this post).
This explains the behavior you are observing: For data (whose size is changing depending on the passed size-argument in) PyDataMem_NEW is used, which bypasses CPython's memory allocator and you see the original behavior of C-runtime's allocators.
One should try to avoid to mix different allocations/deallocations routines PyArray_malloc/PyDataMem_NEW'/mallocandPyArray_free/PyDataMem_FREE/free`: even if it works at OS+Python version at hand, it might fail for another combinations.
For example on Windows, when an extension is built with a different compiler version, one executable might have different memory allocators from different C-run-times and malloc/free might communicate with different C-memory-allocators, which could lead to hard to track down errors.

Faster repetitive uses of bz2.BZ2File for pickling

I'm pickling multiple objects repeatedly, but not consecutively. But as it turned out, pickled output files were too large (about 256MB each).
So I tried bz2.BZ2File instead of open, and each file became 1.3MB. (Yeah, wow.) The problem is that it takes too long (like 95 secs pickling one object) and I want to speed it up.
Each object is a dictionary, and most of them have similar structures (or hierarchies, if that describes it better: almost the same set of keys, and each value that corresponds to each key normally has some specific structure, and so on). Many of the dictionary values are numpy arrays, and I think many zeros will appear there.
Can you give me some advice to make it faster?
Thank you!

I ended up using lz4, which is a blazingly fast compression algorithm.
There is a python wrapper, which can be installed easily:
pip install lz4

Prime number hard drive storage for very large primes - Sieve of Atkin

I have implemented the Sieve of Atkin and it works great up to primes nearing 100,000,000 or so. Beyond that, it breaks down because of memory problems.
In the algorithm, I want to replace the memory based array with a hard drive based array. Python's "wb" file functions and Seek functions may do the trick. Before I go off inventing new wheels, can anyone offer advice? Two issues appear at the outset:
Is there a way to "chunk" the Sieve of Atkin to work on segment in memory, and
is there a way to suspend the activity and come back to it later - suggesting I could serialize the memory variables and restore them.
Why am I doing this? An old geezer looking for entertainment and to keep the noodle working.

Implementing the SoA in Python sounds fun, but note it will probably be slower than the SoE in practice. For some good monolithic SoE implementations, see RWH's StackOverflow post. These can give you some idea of the speed and memory use of very basic implementations. The numpy version will sieve to over 10,000M on my laptop.
What you really want is a segmented sieve. This lets you constrain memory use to some reasonable limit (e.g. 1M + O(sqrt(n)), and the latter can be reduced if needed). A nice discussion and code in C++ is shown at primesieve.org. You can find various other examples in Python. primegen, Bernstein's implementation of SoA, is implemented as a segmented sieve (Your question 1: Yes the SoA can be segmented). This is closely related (but not identical) to sieving a range. This is how we can use a sieve to find primes between 10^18 and 10^18+1e6 in a fraction of a second -- we certainly don't sieve all numbers to 10^18+1e6.
Involving the hard drive is, IMO, going the wrong direction. We ought to be able to sieve faster than we can read values from the drive (at least with a good C implementation). A ranged and/or segmented sieve should do what you need.
There are better ways to do storage, which will help some. My SoE, like a few others, uses a mod-30 wheel so has 8 candidates per 30 integers, hence uses a single byte per 30 values. It looks like Bernstein's SoA does something similar, using 2 bytes per 60 values. RWH's python implementations aren't quite there, but are close enough at 10 bits per 30 values. Unfortunately it looks like Python's native bool array is using about 10 bytes per bit, and numpy is a byte per bit. Either you use a segmented sieve and don't worry about it too much, or find a way to be more efficient in the Python storage.

First of all you should make sure that you store your data in an efficient manner. You could easily store the data for up to 100,000,000 primes in 12.5Mb of memory by using bitmap, by skipping obvious non-primes (even numbers and so on) you could make the representation even more compact. This also helps when storing the data on hard drive. You getting into trouble at 100,000,000 primes suggests that you're not storing the data efficiently.
Some hints if you don't receive a better answer.
1.Is there a way to "chunk" the Sieve of Atkin to work on segment in memory
Yes, for the Eratosthenes-like part what you could do is to run multiple elements in the sieve list in "parallell" (one block at a time) and that way minimize the disk accesses.
The first part is somewhat more tricky, what you would want to do is to process the 4*x**2+y**2, 3*x**2+y**2 and 3*x**2-y**2 in a more sorted order. One way is to first compute them and then sort the numbers, there are sorting algorithms that work well on drive storage (still being O(N log N)), but that would hurt the time complexity. A better way would be to iterate over x and y in such a way that you run on a block at a time, since a block is determined by an interval you could for example simply iterate over all x and y such that lo <= 4*x**2+y**2 <= hi.
2.is there a way to suspend the activity and come back to it later - suggesting I could serialize the memory variables and restore them
In order to achieve this (no matter how and when the program is terminated) you have to first have journalizing disk accesses (fx use a SQL database to keep the data, but with care you could do it yourself).
Second since the operations in the first part are not indempotent you have to make sure that you don't repeat those operations. However since you would be running that part block by block you could simply detect which was the last block processed and resume there (if you can end up with partially processed block you'd just discard that and redo that block). For the Erastothenes part it's indempotent so you could just run through all of it, but for increasing speed you could store a list of produced primes after the sieving of them has been done (so you would resume with sieving after the last produced prime).
As a by-product you should even be able to construct the program in a way that makes it possible to keep the data from the first step even when the second step is running and thereby at a later moment extending the limit by continuing the first step and then running the second step again. Perhaps even having two program where you terminate the first when you've got tired of it and then feeding it's output to the Eratosthenes part (thereby not having to define a limit).

You could try using a signal handler to catch when your application is terminated. This could then save your current state before terminating. The following script shows a simple number count continuing when it is restarted.
import signal, os, cPickle
class MyState:
def __init__(self):
self.count = 1
def stop_handler(signum, frame):
global running
running = False
signal.signal(signal.SIGINT, stop_handler)
running = True
state_filename = "state.txt"
if os.path.isfile(state_filename):
with open(state_filename, "rb") as f_state:
my_state = cPickle.load(f_state)
else:
my_state = MyState()
while running:
print my_state.count
my_state.count += 1
with open(state_filename, "wb") as f_state:
cPickle.dump(my_state, f_state)
As for improving disk writes, you could try experimenting with increasing Python's own file buffering with a 1Mb or more sized buffer, e.g. open('output.txt', 'w', 2**20). Using a with handler should also ensure your file gets flushed and closed.

There is a way to compress the array. It may cost some efficiency depending on the python interpreter, but you'll be able to keep more in memory before having to resort to disk. If you search online, you'll probably find other sieve implementations that use compression.
Neglecting compression though, one of the easier ways to persist memory to disk would be through a memory mapped file. Python has an mmap module that provides the functionality. You would have to encode to and from raw bytes, but it is fairly straightforward using the struct module.
>>> import struct
>>> struct.pack('H', 0xcafe)
b'\xfe\xca'
>>> struct.unpack('H', b'\xfe\xca')
(51966,)

Error saving and loading a list of matrices

I have a list "data_list", and I would save it in order to load it in another script.
First of all I converted it in an array, in this way:
data_array = np.array(data_list)
Then I saved it:
np.savez("File", data_array)
Then, in another script I want to access to "File"; so:
a = np.load("File.npz")
b = a['arr_0']
I used this code until two weeks ago and it worked fine. In these days I am trying to work with my program, but it ends with an error identified in the line
b = a['arr_0']
"File" is a 300 MB file. The strangest thing is that it has stopped suddenly to work.
Any idea about what can be happened?
Ps: I give you some information. My list contains 180 matrices 511x511. Each matrix contains decimal numbers (I tried to create 180 matrices of zeros, and the error occurs in the same way). If I reduce the number of matrices, the script works fine: in particular down to 130 matrices it is ok, while up to the program doesn't work.
Here I report the error message
b = a['arr_0']
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 241, in
__getitem__
return format.read_array(value)
File "C:\Python27\lib\site-packages\numpy\lib\format.py", line 459, in
read_array
array = numpy.fromstring(data, dtype=dtype, count=count)
MemoryError

MemoryError is an out of memory condition. This explains why it happens with objects of at least a certain size - more and bigger arrays, as you would expect, require more memory. What the max size is, and why it seems to have changed, is harder. This can be highly specific to your system, especially in regard to considerations like:
How much memory (physical RAM and swap space) exists and is available to the operating system
How much virtual memory the OS gives to Python
How much of that you're already using
The implementation of the C library, especially of its malloc function, which can affect how Python uses the memory it is allocated
And possibly quite a few other things.
Per the comments, it seems the biggest problem here is that you are running a 32 bit build of Python. On Windows, 32 bit processes apparently have an effective maximum memory address space of around 2GB. By my tests, the list of arrays you are using by itself might take around a quarter of that. The fact that your error only comes up when reading the file back in suggests that numpy deserialisation is relatively memory intensive, but I don't know enough about its implementation to be able to say why that would be. In any case, it seems like installing a 64 bit build of Python is your best bet.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.