I'm attempting to parallelize a program that reads chunked numpy arrays over the network using shared memory. It seems to work (my data comes out the other side), but the memory of all my child processes is blowing up to about the size of the shared memory (~100-250MB each) and it happens when I write to it. Is there some way to avoid these copies being created? They seem to be unnecessary since the data is propagating back to the actual shared memory array.
Here's how I've set up my array using posix_ipc, mmap, and numpy (np):
shared = posix_ipc.SharedMemory(vol.uuid, flags=O_CREAT, size=int(nbytes))
array_like = mmap.mmap(shared.fd, shared.size)
renderbuffer = np.ndarray(buffer=array_like, dtype=vol.dtype, shape=mcshape)
The memory increases when I do this:
renderbuffer[ startx:endx, starty:endy, startz:endz, : ] = 1
Thanks for your help!
Your actual data has 4 dimensions, but I'm going to work through a simpler 2D example.
Imagine you have this array (renderbuffer):
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Now imagine your startx/endx/starty/endy parameters select this slice in one process:
8 9
13 14
18 19
The entire array is 4x5 times 8 bytes, so 160 bytes. The "window" is 3x2 times 8 bytes, so 48 bytes.
Your expectation seems to be that accessing this 48 byte slice would require 48 bytes of memory in this one process. But actually it requires closer to the full 160 bytes. Why?
The reason is that memory is mapped in pages, which are commonly 4096 bytes each. So when you access the first element (here, the number 8), you will map the entire page of 4096 bytes containing that element.
Memory mapping on many systems is guaranteed to start on a page boundary, so the first element in your array will be at the beginning of a page. So if your array is 4096 bytes or smaller, accessing any element of it will map the entire array into memory in each process.
In your actual use case, each element you access in the slice will result in mapping the entire page which contains it. Elements adjacent in memory (meaning either the first or the last index in the slice increments by one, depending on your array order) will usually reside in the same page, but elements which are adjacent in other dimensions will likely be in separate pages.
But take heart: the memory mappings are shared between processes, so if your entire array is 200 MB, even though each process will end up mapping most or all of it, the total memory usage is still 200 MB combined for all processes. Many memory measurement tools will report that each process uses 200 MB, which is sort of true but useless: they are sharing a 200 MB view of the same memory.
Related
As seen in Find the memory size of a set of strings vs. set of bytestrings, it's difficult to precisely measure the memory used by a set or list containing strings. But here is a good estimation/upper bound:
import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a%09i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10]) # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes
i.e. 569 MB for 100 MB of actual data.
Solutions to improve this (for example with other data structures) have been found in Memory-efficient data structure for a set of short bytes-strings and Set of 10-char strings in Python is 10 times bigger in RAM as expected, so here my question is not "how to improve", but:
How can we precisely explain this size in the case of a standard list of byte-string?
How many bytes for each byte-string, for each (linked?) list item to finally obtain 569 MB?
This will help to understand the internals of lists and bytes-strings in CPython (platform: Windows 64 bit).
Summary:
89 MB for the list object
480 MB for the string objects
=> total 569 MB
sys.getsizeof(L) will tell you the list object itself is about 89 MB. That's a few dozen organizational bytes, 8 bytes per bytestring reference, and up to 12.5% overallocation to allow efficient insertions.
sys.getsizeof(one_of_your_bytestrings) will tell you they're 43 bytes each. That's:
8 bytes for the reference counter
8 bytes for the pointer to the type
8 bytes for the length (since bytestrings aren't fixed size)
8 bytes hash
10 bytes for your actual bytestring content
1 byte for a terminating 0-byte.
Storing the objects every 43 bytes in memory would cross memory word boundaries, which is slower. So they're actually stored usually every 48 bytes. You can use id(one_of_your_bytestrings) to get the addresses to check.
(There's some variance here and there, partly due to the exact memory allocations that happen, but 569 MB is about what's expected knowing the above reasons, and it matches what you measured.)
Which object is generally smaller in memory given the exact same data: a numpy array with dtype int64 or a C++ vector of type int? For example:
v = np.array([34, 23])
std::vector<int> v { 34,23 };
There effectively 2 parts to an np.array - the object overhead plus attributes like shape and strides, and a data buffer. The first has roughly the same size for all arrays, the second scales with the number of elements (and the size of each element). In numpy the data buffer is 1d, regardless of the array shape.
With only 2 elements the overhead part of your example array is probably larger than the databuffer. But with 1000s of elements the size proportion goes the other way.
Saving the array with np.save will give a rough idea of the memory use. That file format writes a header buffer (256 bytes?), and the rest is the databuffer.
I'm less familiar with C++ storage, though I think that's more transparent (if you know the language).
But remember efficiency in storing one array is only part of the story. In practice you need to think about the memory use when doing math and indexing. The ndarray distinction between view and copy makes it harder to predict just how much memory is being used.
In [1155]: np.save('test.npy',np.array([1,2]))
In [1156]: ls -l test.npy
-rw-rw-r-- 1 paul paul 88 Jun 30 17:08 test.npy
In [1157]: np.save('test.npy',np.arange(1000))
In [1158]: ls -l test.npy
-rw-rw-r-- 1 paul paul 4080 Jun 30 17:08 test.npy
This looks like 80 bytes of header, and 4*len bytes for the data.
I am new to python and I want to have a list with 2 elements the first one is an integer between 0 and 2 billion, the other one is a number between 0 to 10. I have a large number of these lists (billions).
Suppose I use chr() function to add the second argument for the list. For example:
first_number = 123456678
second_number = chr(1)
mylist = [first_number,second_number]
In this case how does python allocate memory? Will it assume that the second argument is a char and give it (1 byte + overheads) or will it assume that the second argument is a string? If it thinks that it is a string is there any way that I can define and enforce something as char or make this some how more memory efficient?
Edit --> added some more information about why I need this data structure
Here is some more information about what I want to do:
I have a sparse weighted graph with 2 billion edges and 25 million nodes. To represent this graph I tried to create a dictionary (because I need a fast lookup) in which the keys are the nodes (as integers). These nodes are represented by a number between 0 to 2 billion (there is no relation between this and the number of edges). The edges are represented like this: For each of the nodes (or the keys in the dictionary ) I am keeping a list of list. Each element of this list of list is a list that I have explained above. The first one represent the other node and the second argument represents the weight of the edge between the key and the first argument. For example, for a graph that contain 5 nodes, if I have something like
{1: [[2, 1], [3, 1], [4, 2], [5, 1]], 2: [[5, 1]], 3: [[5, 2]], 4: [[6, 1]], 5: [[6, 1]]}
it means that node 1 has 4 edges: one that goes to node 2 with weight 1, one that goes to node 3, with weight 1, one that goes to node 4 with weight 2, etc.
I was looking to see if I could make this more memory efficient by making the second argument of the edge smaller.
Using a single character string will take up about the same amount memory as a small integer because CPython will only create one object of each value, and use that object every time it needs a string or integer of that value. Using strings will take up a bit more space, but it'll be insignificant.
But lets answer you real question, how can you reduce the amount of memory your Python program uses? First I'll calculate about how much memory the objects you want to create will use. I'm using the 64-bit version of Python 2.7 to get my numbers but other 64-bit versions of Python should be similar.
Starting off you have only one dict object, but it has 25 million nodes. Python will use 2^26 hash buckets for a dict of this size, and each bucket is 24 bytes. That comes to about 1.5 GB for the dict itself.
The dict will have 25 million keys, all of them int objects, and each of them is 24 bytes. That comes to total of about 570 MB for all the integers that represent nodes. It will also have 25 million list objects as values. Each list will take up 72 bytes plus 8 bytes per element in the list. These lists will have a total of 2 billion elements, so they'll take up a total of 16.6 GB.
Each of these 2 billion list elements will refer to another list object that's two elements long. That comes to whopping 164 GB. Each of the two element lists will refer two different int objects. Now the good news, while that appears to be a total of about 4 billion integer objects, it's actually only 2 billion different integer objects. There will be only one object created for each of the small integer values used in the second element. So that's a total 44.7 GB of memory used by the integer objects referred to by the first element.
That comes to at least 227 GB of memory you'll need for the data structure as you plan to implement it. Working back through this list I'll explain how its possible for you reduce the memory you'll need to something more practical.
The 44.7 GB of memory used by the int objects that represent nodes in your two element edge lists is the easiest to deal with. Since there are only 25 million nodes, you don't need 2 billion different objects, just one for each node value. Also since you're already using the node values as keys you can just reuse those objects. So that's 44.7 GB gone right there, and depending on how you build your data structure it might not take much effort to ensure only that no redudant node value objects are created. That brings the total down to 183 GB.
Next lets tackle the 164 GB needed for all the two element edge list objects. It's possible that you can share list objects that happen to have the same node value and weighting, but you can do better. Eliminate all the edges lists by flatting the lists of lists. You'll have to do a bit arithmetic access the correct elements, but unless you have a system with a huge amount of memory you're going to have to make compromises. The list objects used as dict values will have to double in length, increasing their total size from 16.1 GB to 31.5 GB. That makes your net savings from flatting the lists a nice 149 GB, bringing the total down to a more reasonable 33.5 GB.
Going farther than this is trickier. One possibility is to use arrays. Unlike lists their elements don't refer to other objects, the value is stored in each element. An array.array object is 56 bytes long plus the size of the elements which in this case are 32-bit integers. That adds up to 16.2 GB for a net savings of 15.3 GB. The total is now only 18.3 GB.
It's possible to squeeze a little more space by taking advantage of the fact that your weights are small integers that fit in single byte characters. Create two array.array objects for each node, one with 32-bit integers for the node values, and the other with 8-bit integers for the weights. Because there are now two array objects, use a tuple object to hold the pair. The total size of all these objects is 13.6 GB. Not a big savings over a single array but now you don't need to any arithmetic to access elements, you just need switch how you index them. The total is down to 15.66 GB.
Finally the last thing I can think of to save memory is to only have two array.array objects. The dict values then become tuple objects that refer to two int objects. The first is an index into the two arrays, the second is a length. This representation takes up 11.6 GB of memory, another small net decrease, with the total becoming 13.6 GB.
That final total of 13.6 GB should work on machine with 16 GB of RAM without much swapping, but it won't leave much room for anything else.
I recall reading that it is hard to pin down the exact memory usage of objects in Python. However, that thread is from 2009, and since then I have read about various memory profilers in Python (see the examples in this thread). Also, IPython has matured substantially in recent months (version 1.0 was released a few days ago)
IPython already has a magic called whos, that prints the variable names, their types and some basic Data/Info.
In a similar fashion, is there any way to get the size in memory of each of the objects returned by who
? Any utilities available for this purpose already in IPython?
Using Guppy
Guppy (suggested in this thread) has a command that allows one to get the cummulative memory usage per object type, but unfortunately:
It does not show memory usage per object
It prints the sizes in bytes (not in human readable format)
For the second one, it may be possible to apply bytes2human from this answer, but I would need to first collect the output of h.heap() in a format that I can parse.
But for the first one (the most important one), is there any way to have Guppy show memory usage per object?
In [6]: import guppy
In [7]: h = guppy.hpy()
In [8]: h.heap()
Out[8]:
Partition of a set of 2871824 objects. Total size = 359064216 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 522453 18 151469304 42 151469304 42 dict (no owner)
1 451503 16 36120240 10 187589544 52 numpy.ndarray
2 425700 15 34056000 9 221645544 62 sklearn.grid_search._CVScoreTuple
3 193439 7 26904688 7 248550232 69 unicode
4 191061 7 22696072 6 271246304 76 str
5 751128 26 18027072 5 289273376 81 numpy.float64
6 31160 1 12235584 3 301508960 84 list
7 106035 4 9441640 3 310950600 87 tuple
8 3300 0 7260000 2 318210600 89 dict of 0xb8670d0
9 1255 0 3788968 1 321999568 90 dict of module
<1716 more rows. Type e.g. '_.more' to view.>
Why not use something like:
h.heap().byid
But this will only show you immediate sizes (i.e. not the total size of a list including the other lists it might refer to).
If you have a particular object you wish to get the size of you can use:
h.iso(object).domisize
To find the approximate amount of memory that would freed if it were deleted.
The program I've written stores a large amount of data in dictionaries. Specifically, I'm creating 1588 instances of a class, each of which contains 15 dictionaries with 1500 float to float mappings. This process has been using up the 2GB of memory on my laptop pretty quickly (I start writing to swap at about the 1000th instance of the class).
My question is, which of the following is using up my memory?
34 million some pairs of floats?
The overhead of 22,500 dictionaries?
the overhead of 1500 classes?
To me it seems like the memory hog should be the huge number of floating point numbers that I'm holding in memory. However, If what I've read so far is correct, each of my floating point numbers take up 16 bytes. Since I have 34 million pairs, this should be about 108 million bytes, which should be just over a gigabyte.
Is there something I'm not taking into consideration here?
The floats do take up 16 bytes apiece, and a dict with 1500 entries about 100k:
>> sys.getsizeof(1.0)
16
>>> d = dict.fromkeys((float(i) for i in range(1500)), 2.0)
>>> sys.getsizeof(d)
98444
so the 22,500 dicts take over 2GB all by themselves, the 68 million floats another GB or so. Not sure how you compute 68 million times 16 equal only 100M -- you may have dropped a zero somewhere.
The class itself takes up a negligible amount, and 1500 instances thereof (net of the objects they refer to of course, just as getsizeof gives us such net amounts for the dicts) not much more than a smallish dict each, so, that's hardly the problem. I.e.:
>>> sys.getsizeof(Sic)
452
>>> sys.getsizeof(Sic())
32
>>> sys.getsizeof(Sic().__dict__)
524
452 for the class, (524 + 32) * 1550 = 862K for all the instances, as you see that's not the worry when you have gigabytes each in dicts and floats.