python memory usage dict and variables large dataset

python memory usage dict and variables large dataset - python

So, I'm making a game in Python 3.4. In the game I need to keep track of a map. It is a map of joined rooms, starting at (0,0) and continuing in every direction, generated in a filtered-random way(only correct matches for the next position are used for a random list select).
I have several types of rooms, which have a name, and a list of doors:
RoomType = namedtuple('Room','Type,EntranceLst')
typeA = RoomType("A",["Bottom"])
...
For the map at the moment I keep a dict of positions and the type of room:
currentRoomType = typeA
currentRoomPos = (0,0)
navMap = {currentRoomPos: currentRoomType}
I have loop that generates 9.000.000 rooms, to test the memory usage.
I get around 600 and 800Mb when I run it.
I was wondering if there is a way to optimize that.
I tried with instead of doing
navMap = {currentRoomPos: currentRoomType}
I would do
navMap = {currentRoomPos: "A"}
but this doesn't have a real change in usage.
Now I was wondering if I could - and should - keep a list of all the types, and for every type keep the positions on which it occurs. I do not know however if it will make a difference with the way python manages its variables.
This is pretty much a thought-experiment, but if anything useful comes from it I will probably implement it.

You can use sys.getsizeof(object) to get the size of a Python object. However, you have to be careful when calling sys.getsizeof on containers: it only gives the size of the container, not the content -- see this recipe for an explanation of how to get the total size of a container, including contents. In this case, we don't need to go quite so deep: we can just manually add up the size of the container and the size of its contents.
The sizes of the types in question are:
# room type size
>>> sys.getsizeof(RoomType("A",["Bottom"])) + sys.getsizeof("A") + sys.getsizeof(["Bottom"]) + sys.getsizeof("Bottom")
233
# position size
>>> sys.getsizeof((0,0)) + 2*sys.getsizeof(0)
120
# One character size
>>> sys.getsizeof("A")
38
Let's look at the different options, assuming you have N rooms:
Dictionary from position -> room_type. This involves keeping N*(size(position) + size(room_type)) = 353 N bytes in memory.
Dictionary from position -> 1-character string. This involves keeping N*158 bytes in memory.
Dictionary from type -> set of positions. This involves keeping N*120 bytes plus a tiny overhead with storing dictionary keys.
In terms of memory usage, the third option is clearly better. However, as is often the case, you have a CPU memory tradeoff. It's worth thinking briefly about the computational complexity of the queries you are likely to do. To find the type of a room given its position, with each of the three choices above you have to:
Look up the position in a dictionary. This is an O(1) lookup, so you'll always have the same run time (approximately), independent of the number of rooms (for a large number of rooms).
Same
Look at each type, and for each type, ask if that position is in the set of positions for that type. This is an O(ntypes) lookup, that is, the time it takes is proportional to the number of types that you have. Note that, if you had gone for list instead of a set to store the rooms of a given type, this would grow to O(nrooms * ntypes), which would kill your performance.
As always, when optimising, it is important to consider the effect of an optimisation on both memory usage and CPU time. The two are often at odds.
As an alternative, you could consider keeping the types in a 2-dimensional numpy array of characters, if your map is sufficiently rectangular. I believe this would be far more efficient. Each character in a numpy array is a single byte, so the memory usage would be much less, and the CPU time would still be O(1) lookup from room position to type:
# Generate random 20 x 10 rectangular map
>>> map = np.repeat('a', 100).reshape(20, 10)
>>> map.nbytes
200 # ie. 1 byte per character.
Some additionally small scale optimisations:
Encode the room type as an int rather than a string. Ints have size 24 bytes, while one-character strings have size 38.
Encode the position as a single integer, rather than a tuple. For instance:
# Random position
xpos = 5
ypos = 92
# Encode the position as a single int, using high-order bits for x and low-order bits for y
pos = 5*1000 + ypos
# Recover the x and y values of the position.
xpos = pos / 1000
ypos = pos % 1000
Note that this kills readability, so it's only worth doing if you want to squeeze the last bits of performance. In practice, you might want to use a power of 2, rather than a power of 10, as your delimiter (but a power of 10 helps with debugging and readability). Note that this brings your number of bytes per position from 120 to 24. If you do go down this route, consider defining a Position class using __slots__ to tell Python how to allocate memory, and add xpos and ypos properties to the class. You don't want to litter your code with pos / 1000 and pos % 1000 statements.

Related

Is there away to allocate memory to a long string in python?

I am working with strings that are recursively concatenated to lengths of around 80 million characters. Python slows down dramatically as the string length increases.
Consider the following loop:
s = ''
for n in range(0,r):
s += 't'
I measure the time it takes to run to be 86ms for r = 800,000, 3.11 seconds for r = 8,000,000 and 222 seconds for r = 80,000,000
I am guessing this has something to do with how python allocates additional memory for the string. Is there a way to speed this up, such as allocating the full 80MB to the string s when it is declared?

When you have a string value and change it in your program, then the previous value will remain in a part of memory and the changed string will be placed in a new part of RAM.
As a result, the old values in RAM remain unused.
For this purpose, Garbage Collector is used and cleans your RAM from old, unused values But it will take time.
You can do this yourself. You can use the gc module to see different generations of your objects See this:
import gc
print(gc.get_threshold())
result:
(596, 2, 1)
In this example, we have 596 objects in our youngest generation, two objects in the next generation, and one object in the oldest generation.
For this reason, the allocation speed may be slow and your program may slow down
use this link to efficient String Concatenation in Python
Good luck.

It can't be done in a straightforward way with text (string) objects, but it it is trivial if you are dealing with bytes - in that case, you can create a bytearray object larger than your final outcome and insert your values into it.
If you need the final object as text, you can then decode it to text, the single step will be fast enough.
As you don't state what is the nature of your data, it may become harder - if no single-byte encoding can cover all the characters you need, you have to resort to a variable-lenght encoding, such as utf-8, or a multibyte encoding, such as utf-16 or 32. In both cases, it is no problem if you keep proper track of your insetion index - which will be also your final datasize for re-encoding. (If all you are using are genetic "GATACA" strings, just use ASCII encoding, and you are done)
data = bytearray(100_000_000) # 100 million positions -
index = 0
for character in input_data:
v = character.encode("utf-8")
s = len(v)
if s == 1:
data[index] = v
else:
data[index: index + len(v)] = v
index += len(v)
data_as_text = data[:index].decode("utf-8")

Efficient Way to Repeatedly Split Large NumPy Array and Record Middle

I have a large NumPy array nodes = np.arange(100_000_000) and I need to rearrange this array by:
Recording and then removing the middle value in the array
Split the array into the left half and right half
Repeat Steps 1-2 for each half
Stop when all values are exhausted
So, for a smaller input example nodes = np.arange(10), the output would be:
[5 2 8 1 4 7 9 0 3 6]
This was accomplished by naively doing:
import numpy as np
def split(node, out):
mid = len(node) // 2
out.append(node[mid])
return node[:mid], node[mid+1:]
def reorder(a):
nodes = [a.tolist()]
out = []
while nodes:
tmp = []
for node in nodes:
for n in split(node, out):
if n:
tmp.append(n)
nodes = tmp
return np.array(out)
if __name__ == "__main__":
nodes = np.arange(10)
print(reorder(nodes))
However, this is way too slow for nodes = np.arange(100_000_000) and so I am looking for a much faster solution.

You can vectorize your function with Numpy by working on groups of slices.
Here is an implementation:
# Similar to [e for tmp in zip(a, b) for e in tmp] ,
# but on Numpy arrays and much faster
def interleave(a, b):
assert len(a) == len(b)
return np.column_stack((a, b)).reshape(len(a) * 2)
# n is the length of the input range (len(a) in your example)
def fast_reorder(n):
if n == 0:
return np.empty(0, dtype=np.int32)
startSlices = np.array([0], dtype=np.int32)
endSlices = np.array([n], dtype=np.int32)
allMidSlices = np.empty(n, dtype=np.int32) # Similar to "out" in your implementation
midInsertCount = 0 # Actual size of allMidSlices
# Generate a bunch of middle values as long as there is valid slices to split
while midInsertCount < n:
# Generate the new mid/left/right slices
midSlices = (endSlices + startSlices) // 2
# Computing the next slices is not needed for the last step
if midInsertCount + len(midSlices) < n:
# Generate the nexts slices (possibly with invalid ones)
newStartSlices = interleave(startSlices, midSlices+1)
newEndSlices = interleave(midSlices, endSlices)
# Discard invalid slices
isValidSlices = newStartSlices < newEndSlices
startSlices = newStartSlices[isValidSlices]
endSlices = newEndSlices[isValidSlices]
# Fast appending
allMidSlices[midInsertCount:midInsertCount+len(midSlices)] = midSlices
midInsertCount += len(midSlices)
return allMidSlices[0:midInsertCount]
On my machine, this is 89 times faster than your scalar implementation with the input np.arange(100_000_000) dropping from 2min35 to 1.75s. It also consume far less memory (rougthly 3~4 times less). Note that if you want a faster code, then you probably need to use a native language like C or C++.

Edit:
The question has been updated to have a much smaller input array so I leave the below for historical reasons. Basically it was likely a typo but we often get accustomed to computers working with insanely large numbers and when memory is involved they can be a real problem.
There is already a numpy based solution submitted by someone else that I think fits the bill.
Your code requires an insane amount of RAM just to hold 100 billion 64 bit integers. Do you have 800GB of RAM? Then you convert the numpy array to a list which will be substantially larger than the array (each packed 64 bit int in the numpy array will become a much less memory efficient python int object and the list will have a pointer to that object). Then you make a lot of slices of the list which will not duplicate the data but will duplicate the pointers to the data and use even more RAM. You also append all the result values to a list a single value at a time. Lists are very fast for adding items generally but with such an extreme size this will not only be slow but the way the list is allocated is likely to be extremely wasteful RAM wise and contribute to major problems (I believe they double in size when they get to a certain level of fullness so you will end up allocating more RAM than you need and doing many allocations and likely copies). What kind of machine are you running this on? There are ways to improve your code but unless you're running it on a super computer I don't know that you're going to ever finish that calculation. I only..only? have 32GB of RAM and I'm not going to even try to create a 100B int_64 numpy array as I don't want to use up ssd write life for a mass of virtual memory.
As for improving your code stick to numpy arrays don't change to a python list it will greatly increase the RAM you need. Preallocate a numpy array to put the answer in. Then you need a new algorithm. Anything recursive or recursive like (ie a loop splitting the input,) will require tracking a lot of state, your nodes list is going to be extraordinarily gigantic and again use a lot of RAM. You could use len(a) to indicate values that are removed from your list and scan through the entire array each time to figure out what to do next but that will save RAM in favour of a tremendous amount of searching a gigantic array. I feel like there is an algorithm to cut numbers from each end and place them in the output and just track the beginning and end but I haven't figured it out at least not yet.
I also think there is a simpler algorithm where you just track the number of splits you've done instead of making a giant list of slices and keeping it all in memory. Take the middle of the left half and then the middle of the right then count up one and when you take the middle of the left half's left half you know you have to jump to the right half then the count is one so you jump over to the original right half's left half and on and on... Based on the depth into the halves and the length of the input you should be able to jump around without scanning or tracking all of those slices though I haven't been able to dedicate much time to thinking this through in my head.
With a problem of this nature if you really need to push the limits you should consider using C/C++ so you can be as efficient as possible with RAM usage and because you're doing an insane number of tiny things which doesn't map well to python performance.

Justification of constants used in random.sample

I'm looking into the source code for the function sample in random.py (python standard library).
The idea is simple:
If a small sample (k) is needed from a large population (n): Just pick k random indices, since it is unlikely you'll pick the same number twice as the population is so large. And if you do, just pick again.
If a relatively large sample (k) is needed, compared to the total population (n): It is better to keep track of what you have picked.
My Question
There are a few constants involved, setsize = 21 and setsize += 4 ** _log(3*k,4). The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list and # table size for big sets.
Where have these specific numbers come from? What is there justification?
The comments shed some light, however I find they bring as many questions as they answer.
I would kind of understand, size of a small set but find the "minus size of an empty list" confusing. Can someone shed any light on this?
what is meant specifically by "table" size, as apposed to say "set size".
Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.
The code
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence or set.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
if isinstance(population, _Set):
population = tuple(population)
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
randbelow = self._randbelow
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).

This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).
It looks like 21 was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:
>>> sys.getsizeof(set()) - sys.getsizeof([])
160
and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.
I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:
setentry smalltable[PySet_MINSIZE];
The
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.
Finally,
if n <= setsize:
compares the base overhead of a set plus any space used by an external hash table to the n pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population), so it may be underestimating the cost of the list.)

Detect significant changes in a data-set that gradually changes

I have a list of data in python that represents amount of resources used per minute. I want to find the number of times it changes significantly in that data set. What I mean by significant change is a bit different from what I've read so far.
For e.g. if I have a dataset like
[10,15,17,20,30,40,50,70,80,60,40,20]
I say a significant change happens when data increases by double or reduces by half with respect to the previous normal.
For e.g. since the list starts with 10, that is our starting normal point
Then when data doubles to 20, I count that as one significant change and set the normal to 20.
Then when data doubles to 40, it is considered a significant change and the normal is now 40
Then when data doubles to 80, it is considered a significant change and the normal is now 80
After that when data reduces by half to 40, it is considered as another significant change and the normal becomes 40
Finally when data reduces by half to 20, it is the last significant change
Here there are a total of 5 significant changes.
Is it similar to any other change detection algorithm? How can this be done efficiently in python?

This is relatively straightforward. You can do this with a single iteration through the list. We simply update our base when a 'significant' change occurs.
Note that my implementation will work for any iterable or container. This is useful if you want to, for example, read through a file without having to load it all into memory.
def gen_significant_changes(iterable, *, tol = 2):
iterable = iter(iterable) # this is necessary if it is container rather than generator.
# note that if the iterable is already a generator iter(iterable) returns itself.
base = next(iterable)
for x in iterable:
if x >= (base * tol) or x <= (base/tol):
yield x
base = x
my_list = [10,15,17,20,30,40,50,70,80,60,40,20]
print(list(gen_significant_changes(my_list)))

I can't help with the Python part, but in terms of math, the problem you're asking is fairly simple to solve using log base 2. A significant change occurs when the current value divided by a constant can be reached by raising 2 to a different power (as an integer) than the previous value. (The constant is needed since the first value in the array forms the basis of comparison.)
For each element at t, compute:
current = math.log(Array[t] /Array[0], 2)
previous = math.log(Array[t-1]/Array[0], 2)
if math.floor(current) <> math.floor(previous) a significant change has occurred
Using this method you do not need to keep track of a "normal point" at all, you just need the array. By removing the additional state variable we enable the array to be processed in any order, and we could give portions of the array to different threads if the dataset were very large. You wouldn't be able to do that with your current method.

python memory allocation when using chr()

I am new to python and I want to have a list with 2 elements the first one is an integer between 0 and 2 billion, the other one is a number between 0 to 10. I have a large number of these lists (billions).
Suppose I use chr() function to add the second argument for the list. For example:
first_number = 123456678
second_number = chr(1)
mylist = [first_number,second_number]
In this case how does python allocate memory? Will it assume that the second argument is a char and give it (1 byte + overheads) or will it assume that the second argument is a string? If it thinks that it is a string is there any way that I can define and enforce something as char or make this some how more memory efficient?
Edit --> added some more information about why I need this data structure
Here is some more information about what I want to do:
I have a sparse weighted graph with 2 billion edges and 25 million nodes. To represent this graph I tried to create a dictionary (because I need a fast lookup) in which the keys are the nodes (as integers). These nodes are represented by a number between 0 to 2 billion (there is no relation between this and the number of edges). The edges are represented like this: For each of the nodes (or the keys in the dictionary ) I am keeping a list of list. Each element of this list of list is a list that I have explained above. The first one represent the other node and the second argument represents the weight of the edge between the key and the first argument. For example, for a graph that contain 5 nodes, if I have something like
{1: [[2, 1], [3, 1], [4, 2], [5, 1]], 2: [[5, 1]], 3: [[5, 2]], 4: [[6, 1]], 5: [[6, 1]]}
it means that node 1 has 4 edges: one that goes to node 2 with weight 1, one that goes to node 3, with weight 1, one that goes to node 4 with weight 2, etc.
I was looking to see if I could make this more memory efficient by making the second argument of the edge smaller.

Using a single character string will take up about the same amount memory as a small integer because CPython will only create one object of each value, and use that object every time it needs a string or integer of that value. Using strings will take up a bit more space, but it'll be insignificant.
But lets answer you real question, how can you reduce the amount of memory your Python program uses? First I'll calculate about how much memory the objects you want to create will use. I'm using the 64-bit version of Python 2.7 to get my numbers but other 64-bit versions of Python should be similar.
Starting off you have only one dict object, but it has 25 million nodes. Python will use 2^26 hash buckets for a dict of this size, and each bucket is 24 bytes. That comes to about 1.5 GB for the dict itself.
The dict will have 25 million keys, all of them int objects, and each of them is 24 bytes. That comes to total of about 570 MB for all the integers that represent nodes. It will also have 25 million list objects as values. Each list will take up 72 bytes plus 8 bytes per element in the list. These lists will have a total of 2 billion elements, so they'll take up a total of 16.6 GB.
Each of these 2 billion list elements will refer to another list object that's two elements long. That comes to whopping 164 GB. Each of the two element lists will refer two different int objects. Now the good news, while that appears to be a total of about 4 billion integer objects, it's actually only 2 billion different integer objects. There will be only one object created for each of the small integer values used in the second element. So that's a total 44.7 GB of memory used by the integer objects referred to by the first element.
That comes to at least 227 GB of memory you'll need for the data structure as you plan to implement it. Working back through this list I'll explain how its possible for you reduce the memory you'll need to something more practical.
The 44.7 GB of memory used by the int objects that represent nodes in your two element edge lists is the easiest to deal with. Since there are only 25 million nodes, you don't need 2 billion different objects, just one for each node value. Also since you're already using the node values as keys you can just reuse those objects. So that's 44.7 GB gone right there, and depending on how you build your data structure it might not take much effort to ensure only that no redudant node value objects are created. That brings the total down to 183 GB.
Next lets tackle the 164 GB needed for all the two element edge list objects. It's possible that you can share list objects that happen to have the same node value and weighting, but you can do better. Eliminate all the edges lists by flatting the lists of lists. You'll have to do a bit arithmetic access the correct elements, but unless you have a system with a huge amount of memory you're going to have to make compromises. The list objects used as dict values will have to double in length, increasing their total size from 16.1 GB to 31.5 GB. That makes your net savings from flatting the lists a nice 149 GB, bringing the total down to a more reasonable 33.5 GB.
Going farther than this is trickier. One possibility is to use arrays. Unlike lists their elements don't refer to other objects, the value is stored in each element. An array.array object is 56 bytes long plus the size of the elements which in this case are 32-bit integers. That adds up to 16.2 GB for a net savings of 15.3 GB. The total is now only 18.3 GB.
It's possible to squeeze a little more space by taking advantage of the fact that your weights are small integers that fit in single byte characters. Create two array.array objects for each node, one with 32-bit integers for the node values, and the other with 8-bit integers for the weights. Because there are now two array objects, use a tuple object to hold the pair. The total size of all these objects is 13.6 GB. Not a big savings over a single array but now you don't need to any arithmetic to access elements, you just need switch how you index them. The total is down to 15.66 GB.
Finally the last thing I can think of to save memory is to only have two array.array objects. The dict values then become tuple objects that refer to two int objects. The first is an index into the two arrays, the second is a length. This representation takes up 11.6 GB of memory, another small net decrease, with the total becoming 13.6 GB.
That final total of 13.6 GB should work on machine with 16 GB of RAM without much swapping, but it won't leave much room for anything else.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.