How to dynamically allocate memory in Python - python

Is there any method in python, that I can use to get a block of memory from the heap ,and use a variable to reference it. Just like the keyword "new" , or the function malloc() in other languages:
Object *obj = (Object *) malloc(sizeof(Object));
Object *obj = new Object();
In the project, my program is waiting to receive some data in uncertain intervals and with a certain length of bytes when correct.
I used to it like this:
void receive()// callback
{
if(getSize()<=sizeof(DataStruct))
{
DataStruct *pData=malloc(sizeof(DataStruct));
if(recvData(pData)>0)
list_add(globalList,pData);
}
}
void worker()
{
init()
while(!isFinish)
{
dataProcess(globalList);
}
}
Now, I want to migrate these old project to python, and I tried to do it like this:
def reveive():
data=dataRecv()
globalList.append(data)
However, I get the all item in the list are same, and equal to the latest received item. It is obvious that all the list items are point to the same memory adress, and I want to get a new memory adress each the function is called.

The equivalent of "new" in python is to just use a constructor eg:
new_list = list() # or [] - expandable heterogeneous list
new_dict = dict() # expandable hash table
new_obj = CustomObject() # assuming CustomObject has been defined
Since you are porting from C, some things to note.
Everything is an object in python including integers, and most variables are just references, but the rules for scalar variables such as integers and strings are different from containers, eg:
a = 2 # a is a reference to 2
b = a # b is a reference to 'a'
b = 3 # b now points to 3, while 'a' continues to point to 2
However:
alist = ['eggs', 2, 'juice'] # alist is reference to a new list
blist = alist # blist is a reference; changing blist affects alist
blist.append('coffee') # alist and blist both point to
# ['eggs', 2, 'juice', 'coffee']
You can pre-allocate sizes, if you'd like but it often doesn't buy you much benefit in python. The following is valid:
new_list4k = [None]*4096 # initialize to list of 4096 None's
new_list4k = [0]*4096 # initialize to 4096 0's
big_list = []
big_list.extend(new_list4k) # resizes big_list to accomodate at least 4k items
If you want to ensure memory leaks do not occur, use local variables as often as possible, eg, within a function so as things go out of scope you don't have to worry.
For efficient vectorized operations (and much lower memory footprint) use numpy arrays.
import numpy as np
my_array = np.zeros(8192) # create a fixed array length of 8K elements
my_array += 4 # fills everything with 4
My added two cents:
I'd probably start by asking what your primary goal is. There is the pythonic ways of doing things, while trying to optimize for speed of program execution or minimum memory footprint. And then there is the effort of trying to port a program in as little time as possible. Sometimes they all intersect but more often, you will find the pythonic way to be quick to translate but with higher memory requirements. Getting higher performance out of python will probably take focused experience.
Good luck!

You should read the Python tutorial.
You can create lists, dictionaries, objects and closures in Python. All these live in the (Python) heap, and Python has a naive garbage collector (reference counting + marking for circularity).
(the Python GC is naive because it does not use sophisticated GC techniques; hence it is slower than e.g. Ocaml or many JVM generational copying garbage collectors; read the GC handbook for more; however the Python GC is much more friendly to external C code)

Keep in mind that interpreted languages usually don't flatten the types as compiled languages do. The memory layout is (probably) completely different than in the raw data. Therefore, you cannot simply cast raw data to a class instance or vice versa. You have to read the raw data, interpret it and fill your objects manually.

Related

Is there a way to change a Python object's byte representation at run-time?

The goal is to simulate a high-radiation environment.
Normally, code like the following:
a = 5
print(a)
print(a)
would print:
5
5
I want to be able to change the underlying byte representation of a randomly during runtime (according to some predefined function that takes a seed). In that case, the following code:
a = RandomlyChangingInteger(5)
print(a)
print(a)
could result in:
4
2
One way this can be done for languages like C and C++ is to insert extra instructions that could potentially modify a, before every usage of a in the compiled code.
Something like BITFLIPS (which uses valgrind) is what I'm thinking about.
Is this even possible in Python?
You can do it, sort of. The built-in int is immutable, therefore you cannot modify its value. You can, however, create a custom class that emulates an int:
import random
class RandomlyChangingInteger(object):
def __int__(self):
return random.randint(0, 10)
def __str__(self):
return str(self.__int__())
then
a = RandomlyChangingInteger()
print(a)
print(a)
should print something like
4
5
Note that you can't use this class to do math as it stands. You must implement other int methods (such as __add__, __mul__, etc) first.
You're trying to simulate radiation-induced bitflips, but your expectations of what that would do are way off target. Radiation effects are much more likely to crash a Python program than they are to change an object's value to another valid value. This makes simulating radiation effects not very useful.
The CPython implementation relies on so many pointers and so much indirection that after a few bit flips in your data, at least one of them is almost certain to hit something that causes a crash. Perhaps corrupting an object's type pointer, causing a bad memory access the next time you try to do almost anything with the object, or perhaps corrupting a reference count, causing an object to be freed while still in use. Maybe corrupting the length of an int (Python ints are variable-width), causing Python to try to read past the end of the allocation.
Where a C array of ints might just be a giant block of numerical data, where random bit corruption could be detected or managed, a Python list of ints is mostly pointers and other metadata.
If you really want to simulate random bit flips, the best way to go would likely be to rebuild CPython with a tool like the BITFLIPS thing you linked.

How is Python statement x=x+1 implemented?

In C, a statement x=x+1 will change the content at the same memory that is allocated for x. But in Python, since a variable can have different types, x at the left and right side of = may be of different types, which means they may refer to different pieces of memory. If so, after x changes its reference from the old memory to the new memory, the old memory can be reclaimed by the garbage collection mechanism. If it is the case, the following code may trigger the garbage collection process many times thus is very low efficient:
for i in range(1000000000)
i=i+1
Is my guess correct?
Update:
I need to correct the typo in the code to make the question clearer:
x=0
for i in range(1000000000)
x=x+1
#SvenMarnach, do you mean the integers 0,1,2,...,999999999 (which the label x once referred to) all exist in memory if garbage collection is not activated?
id can be used to track the 'allocation' of memory to objects. It should be used with caution, but here I think it's illuminating. id is a bit like a c pointer - that is, some how related to 'where' the object is located in memory.
In [18]: for i in range(0,1000,100):
...: print(i,id(i))
...: i = i+1
...: print(i,id(i))
...:
0 10914464
1 10914496
100 10917664
101 10917696
200 10920864
201 10920896
300 140186080959760
301 140185597404720
400 140186080959760
401 140185597404720
...
900 140186080959760
901 140185597404720
In [19]: id(1)
Out[19]: 10914496
Small integers (<256) are cached - that is, integer 1, once created is 'reused'.
In [20]: id(202)
Out[20]: 10920928 # same id as in the loop
In [21]: id(302)
Out[21]: 140185451618128 # different id
In [22]: id(901)
Out[22]: 140185597404208
In [23]: id(i)
Out[23]: 140185597404720 # = 901, but different id
In this loop, the first few iterations create or reuse small integers. But it appears that when creating larger integers, it is 'reusing' memory. It may not be full blown garbage collection, but the code is somehow optimized to avoid unnecessary memory use.
Generally as Python programmers don't focus on those details. Write clean reliable Python code. In this example, modifying an iteration variable in the loop is poor practice (even if it is just an example).
You are mostly correct, though I think a few clarifications may help.
First, the concept of variables in C in Python is rather different. In C, a variable generally references a fixed location in memory, as you stated yourself. In Python, a variable is just a label that can be attached to any object. An object could have multiple such labels, or none at all, and labels can be freely moved between objects. An assignment in C copies a new value to a memory location, while an assignment in Python attaches a new label to an object.
Integers are also very different in both languages. In C, an integer has a fixed size, and stores an integer value in a format native to the hardware. In Python, integers have arbitrary precision. They are stored as array of "digits" (usually 30-bit integers in CPython) together with a Python type header storing type information. Bigger integers will occupy more memory than smaller integers.
Moreover, integer objects in Python are immutable – they can't be changed once created. This means every arithmetic operation creates a new integer object. So the loop in your code indeed creates a new integer object in each iteration.
However, this isn't the only overhead. It also creates a new integer object for i in each iteration, which is dropped at the end of the loop body. And the arithmetic operation is dynamic – Python needs to look up the type of x and its __add__() method in each iteration to figure out how to add objects of this type. And function call overhead in Python is rather high.
Garbage collection and memory allocation on the other hand are rather fast in CPython. Garbage collection for integers relies completely on reference counting (no reference cycles possible here), which is fast. And for allocation, CPython uses an arena allocator for small objects that can quickly reuse memory slots without calling the system allocator.
So in summary, yes, compared to the same code in C, this code will run awfully slow in Python. A modern C compiler would simply compute the result of this loop at compile time and load the result to a register, so it would finish basically immediately. If raw speed for integer arithmetic is what you want, don't write that code in Python.

How can I call values of a np.array rather than memory address? [duplicate]

This question already has an answer here:
Python The appended element in the list changes as its original variable changes
(1 answer)
Closed 4 years ago.
In the following code, I intended to make a list starting from an empty one
by appending (random) numpy arrays. For a temporary variable, I initialized a numpy array variable 'sample_pt' which worked as a temporary variable to save a (random) numpy array. While I expected to have a list of random numpy arrays, the output was a list filled with the same (final) numpy array. I suspect that calling a numpy array by its "variable name" returns its memory address. Am I on the right direction, or are there anything that would be good to know?
[Code]
import numpy as np
sample_pt=np.array([0.]) # initial point
sample_list=[]
number_iter=3
for _ in range(number_iter):
sample_pt[0]=np.random.randn()
sample_list.append(sample_pt)
print(sample_list)
[Output]
[array([-0.78614157])]
[array([0.7172035]), array([0.7172035])]
[array([0.47565398]), array([0.47565398]), array([0.47565398])]
I don't know what you mean by "call values", or "rather than memory address", or… most of the text of your question.
But the problem is pretty simple. You're appending the same array over and over, instead of creating new ones.
If you want to create a new array, you have to do that explicitly. Which is trivial to do; just move the np.array constructor into the loop, like this:
sample_list=[]
number_iter=3
for _ in range(number_iter):
sample_pt=np.array([0.]) # initial point
sample_pt[0]=np.random.randn()
sample_list.append(sample_pt)
print(sample_list)
But this can be dramatically simplified.
First, instead of creating an array of 1 zero and then replacing that zero, why not just create an array of the element you want?
sample_pt = np.array([np.random.randn()])
Or, even better, why not just let np.random build the array for you?
sample_pt = np.random.randn(1)
At which point you could replace the whole thing with a list comprehension:
number_iter = 3
sample_list = [np.random.randn(1) for _ in range(number_iter)]
Or, even better, why not make a 3x1 array instead of a list of 3 single-element arrays?
number_iter = 3
sample_array = np.random.randn((number_iter, 1))
If you really need to change that into a list of 3 arrays for some reason, you can always call list on it later:
sample_list = list(sample_array)
… or right at the start:
sample_list = list(np.random.randn((number_iter, 1)))
Meanwhile, I think you misunderstand how values and variables work in Python.
First, forget about "memory address" for a second:
An object is a value, with a type, somewhere in the heap. You don't care where.
Variables don't have memory addresses, or types; they're just names in some namespace (globals, locals, attributes of some instance, etc.) that refer to some value somewhere.
Notice that this is very different from, say, C++, where variables are typed memory locations, and objects live in those memory locations. This means there's no "copy constructor" or "assignment operator" or anything like that in Python. When you write a = b, all that means is that a is now another name for the same value as b. If you want a copy, you have to explicitly ask for a copy.
Now, if you look at how CPython implements things under the hood:
The CPython interpreter represents all objects as pointers to PyObject structs, which are always allocated on the heap.
Variables are just string keys in a dict, owned by the module (for globals), an instance (for attributes), or whatever. The values in the dict are just objects like any other. Which means that, under the covers, what's actually stored in the hash table is pointers to string objects for the variable names in the keys, and pointers to whatever value you've assigned in the values.
There is a special optimization for locals, involving an array of object pointers stored on the frame, but you usually don't have to worry about that.
There's another special trick for closure captures, involving pointers to cell objects that hold pointers to the actual objects, which you have to worry about even less often.
As you can see, thinking about the pointers is harder to understand, and potentially misleading, unless you really care about how CPython works under the covers.

memory consumption and lifetime of temporaries

I've a python code where the memory consumption steadily grows with time. While there are several objects which can legitimately grow quite large, I'm trying to understand whether the memory footprint I'm observing is due to these objects, or is it just me littering the memory with temporaries which don't get properly disposed of --- Being a recent convert from a world of manual memory management, I guess I just don't exactly understand some very basic aspects of how the python runtime deals with temporary objects.
Consider a code with roughly this general structure (am omitting irrelevant details):
def tweak_list(lst):
new_lst = copy.deepcopy(lst)
if numpy.random.rand() > 0.5:
new_lst[0] += 1 # in real code, the operation is a little more sensible :-)
return new_lst
else:
return lst
lst = [1, 2, 3]
cache = {}
# main loop
for step in xrange(some_large_number):
lst = tweak_list(lst) # <<-----(1)
# do something with lst here, cut out for clarity
cache[tuple(lst)] = 42 # <<-----(2)
if step%chunk_size == 0:
# dump the cache dict to a DB, free the memory (?)
cache = {} # <<-----(3)
Questions:
What is the lifetime of a new_list created in a tweak_list? Will it be destroyed on exit, or will it be garbage collected (at which point?). Will repeated calls to tweak_list generate a gazillion of small lists lingering around for a long time?
Is there a temporary creation when converting a list to a tuple to be used as a dict key?
Will setting a dict to an empty one release the memory?
Or, am I approaching the issue at hand from a completely wrong perspective?
new_lst is cleaned up when the function exists when not returned. It's reference count drops to 0, and it can be garbage collected. On current cpython implementations that happens immediately.
If it is returned, the value referenced by new_lst replaces lst; the list referred to by lst sees it's reference count drop by 1, but the value originally referred to by new_lst is still being referred to by another variable.
The tuple() key is a value stored in the dict, so that's not a temporary. No extra objects are created other than that tuple.
Replacing the old cache dict with a new one will reduce the reference count by one. If cache was the only reference to the dict it'll be garbage collected. This then causes the reference count for all contained tuple keys to drop by one. If nothing else references to those those will be garbage collected.
Note that when Python frees memory, that does not necessarily mean the operating system reclaims it immediately. Most operating systems will only reclaim the memory when it is needed for something else, instead presuming the program might need some or all of that memory again soon.
You might want to take a look at Heapy as a way of profiling memory usage. I think PySizer is also used in some instances for this but I am not familiar with it. ObjGraph is also a strong tool to take a lok at.

Deletion of a list in python with and without ':' operator

I've been working with python for quite a bit of time and I'm confused regarding few issues in the areas of Garbage Collection, memory management as well as the real deal with the deletion of the variables and freeing memory.
>>> pop = range(1000)
>>> p = pop[100:700]
>>> del pop[:]
>>> pop
[]
>>> p
[100.. ,200.. 300...699]
In the above piece of code, this happens. But,
>>> pop = range(1000)
>>> k = pop
>>> del pop[:]
>>> pop
[]
>>> k
[]
Here in the 2nd case, it implies that the k is just pointing the list 'pop'.
First Part of the question :
But, what's happening in the 1st code block? Is the memory containing [100:700] elements not getting deleted or is it duplicated when list 'p' is created?
Second Part of the question :
Also, I've tried including gc.enable and gc.collect statements in between wherever possible but there's no change in the memory utilization in both the codes. This is kind of puzzling. Isn't this bad that python is not returning free memory back to OS? Correct me if I'm wrong in the little research I've did. Thanks in advance.
Slicing a sequence results in a new sequence, with a shallow copy of the appropriate elements.
Returning the memory to the OS might be bad, since the script may turn around and create new objects, at which point Python would have to request the memory from the OS again.
1st part:
In the 1st code block, you create a new object where the elements of the old one are copied before deleting that one.
In the 2nd code block, however, you just assign a reference to the same object to another variable. Then you empty the list, which, of course, is visible via both references.
2nd part: Memory is returned when appropriate, but not always. Under the hood of Python, there is a memory allocator which has control over where the memory comes from. There are 2 ways: via the brk()/sbrk() mechanism (for smaller memory blocks) and via mmap() (larger blocks).
Here we have rather smaller blocks which get allocated directly at the end of the data segment:
datadatadata object1object1 object2object2
If we only free object1, we have a memory gap which can be reused for the next object, but cannot easily freed and returned to the OS.
If we free both objects, memory could be returned. But there probably is a threshold for keeping memory back for a while, because returning everything immediately is not the very best thing.

Categories