Numpy: garbage collection after slicing

Numpy: garbage collection after slicing - python

def foo():
x = np.ones((10,10))
return x[:5,:5]
If I call y = foo() I'll get a 5x5 array (1/4 of the values in x). But what happens to the other values in x, do they persist in memory or get garbage collected in some way? I'd like to understand this.

As kindall says in the comments, basic slicing on a NumPy array creates a view of the original array. The view has to keep the entire original object alive; you can see the reference it uses to do so in the view's base attribute.
In [2]: x = numpy.ones((10, 10))
In [3]: y = x[:5, :5]
In [4]: y.base is x
Out[4]: True

Related

Python confusion -- convention, name and value

I am a beginner and have a confusion when I am learning python. If I have the following python code:
import numpy as np
X = np.array([1,0,0])
Y = X
X[0] = 2
print Y
Y will be shown to be array([2, 0, 0])
However, if I do the following:
import numpy as np
X = np.array([1,0,0])
Y = X
X = 2*X
print Y
Y is still array([1,0,0])
What is going on?

think of it this way:
the equals sign in python assigns references.
Y = X makes Y point to the same address X points to
X[0] = 2 makes x[0] point to 2
X = 2*X makes X point to a new thing, but Y is still pointing to the address of the original X, so Y is unchanged
this isn't exactly true, but its close enough to understand the principle

That's because X and Y are references to the same object np.array([1,0,0]) this means that regardless whether a call is done through X or Y, the result will be the same, but changing the reference of one, has no effect.
If you write:
X = np.array([1,0,0])
Y = X
basically what happens is that there are two local variables X and Y that refer to the same object. So the memory looks like:
+--------+
Y -> |np.array| <- X
+--------+
|[1,0,0] |
+--------+
Now if you do X[0] = 2 that is basically short for:
X.__setitem__(0,2)
so you call a method on the object. So now the memory looks like:
+--------+
Y -> |np.array| <- X
+--------+
|[2,0,0] |
+--------+
If you however write:
X = 2*X
first 2*X is evaluated. Now 2*X is short for:
X.__rmul__(2)
(Python first looks if 2 supports __mul__ for X, but since 2 will raise a NotImplementedException), Python will fallback to X.__rmul__). Now X.__rmul__ does not change X: it leaves X intact, but constructs a new array and returns that. X catches by that new array that now references to that array).
which creates an new array object: array([4, 0, 0]) and then X references to that new object. So now the memory looks like:
+--------+ +--------+
Y -> |np.array| X ->|np.array|
+--------+ +--------+
|[2,0,0] | |[4,0,0] |
+--------+ +--------+
But as you can see, Y still references to the old object.

This is more about convention and names than reference and value.
When you assign:
Y = X
Then the name Y refers to the object that the name X points to. In some way the pointer X and Y point to the same object:
X is Y # True
The is checks if the names point to the same object!
Then it get's tricky: You do some operations on the arrays.
X[0] = 2
That's called "item assignment" and calls
X.__setitem__(0, 2)
What __setitem__ should do (convention) is to update some value in the container X. So X should still point to the same object afterwards.
However X * 2 is "multiplication" and the convention states that this should create a new object (again convention, you can change that behaviour by overwriting X.__mul__). So when you do
X = X * 2
The name X now refers to the new object that X * 2 created:
X is Y # False
Normally common libraries follow these conventions but it's important to highlight that you can completly change this!

When you say X = np.array([1, 0, 0]), you create an object that has some methods and some internal buffers that contain the actual data and other information in it.
Doing Y = X sets Y to refer to the same actual object. This is called binding to a name in Python. You have bound the same object that was bound to X to the name Y as well.
Doing X[0] = 2 calls the object's __setitem__ method, which does some stuff to the underlying buffers. If modifies the object in place. Now when you print the values of either X or Y, the numbers that come out of that object's buffers are 2, 0, 0.
Doing X = 2 * X translates to X.__rmul__(2). This method does not modify X in place. It creates and returns a new array object, each of whose elements is twice the corresponding element of X. Then you bind the new object to the name X. However, the name Y is still bound to the original array because you have not done anything to change that. As an aside, X.__rmul__ is used because 2.__mul__(X) does not work. Numpy arrays naturally define multiplication to be commutative, so X.__mul__ and X.__rmul__ should to the same thing.
It is interesting to note that you can also do X *= 2, which will propagate the changes to Y. This is because the *= operator translates to the __imul__ method, which does modify the input in place.

Size of numpy strided array/broadcast array in memory?

I'm trying to create efficient broadcast arrays in numpy, e.g. a set of shape=[1000,1000,1000] arrays that have only 1000 elements, but repeated 1e6 times. This can be achieved both through np.lib.stride_tricks.as_strided and np.broadcast_arrays.
However, I am having trouble verifying that there is no duplication in memory, and this is critical since tests that actually duplicate the arrays in memory tend to crash my machine leaving no traceback.
I've tried examining the size of the arrays using .nbytes, but that doesn't seem to correspond to the actual memory usage:
>>> import numpy as np
>>> import resource
>>> initial_memuse = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
>>> pagesize = resource.getpagesize()
>>>
>>> x = np.arange(1000)
>>> memuse_x = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
>>> print("Size of x = {0} MB".format(x.nbytes/1e6))
Size of x = 0.008 MB
>>> print("Memory used = {0} MB".format((memuse_x-initial_memuse)*resource.getpagesize()/1e6))
Memory used = 150.994944 MB
>>>
>>> y = np.lib.stride_tricks.as_strided(x, [1000,10,10], strides=x.strides + (0, 0))
>>> memuse_y = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
>>> print("Size of y = {0} MB".format(y.nbytes/1e6))
Size of y = 0.8 MB
>>> print("Memory used = {0} MB".format((memuse_y-memuse_x)*resource.getpagesize()/1e6))
Memory used = 201.326592 MB
>>>
>>> z = np.lib.stride_tricks.as_strided(x, [1000,100,100], strides=x.strides + (0, 0))
>>> memuse_z = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
>>> print("Size of z = {0} MB".format(z.nbytes/1e6))
Size of z = 80.0 MB
>>> print("Memory used = {0} MB".format((memuse_z-memuse_y)*resource.getpagesize()/1e6))
Memory used = 0.0 MB
So .nbytes reports the "theoretical" size of the array, but apparently not the actual size. The resource checking is a little awkward, as it looks like there are some things being loaded & cached (perhaps?) that result in the first striding taking up some amount of memory, but future strides take none.
tl;dr: How do you determine the actual size of a numpy array or array view in memory?

One way would be to examine the .base attribute of the array, which references the object from which an array "borrows" its memory. For example:
x = np.arange(1000)
print(x.flags.owndata) # x "owns" its data
# True
print(x.base is None) # its base is therefore 'None'
# True
a = x.reshape(100, 10) # a is a reshaped view onto x
print(a.flags.owndata) # it therefore "borrows" its data
# False
print(a.base is x) # its .base is x
# True
Things are slightly more complicated with np.lib.stride_tricks:
b = np.lib.stride_tricks.as_strided(x, [1000,100,100], strides=x.strides + (0, 0))
print(b.flags.owndata)
# False
print(b.base)
# <numpy.lib.stride_tricks.DummyArray object at 0x7fb40c02b0f0>
Here, b.base is a numpy.lib.stride_tricks.DummyArray instance, which looks like this:
class DummyArray(object):
"""Dummy object that just exists to hang __array_interface__ dictionaries
and possibly keep alive a reference to a base array.
"""
def __init__(self, interface, base=None):
self.__array_interface__ = interface
self.base = base
We can therefore examine b.base.base:
print(b.base.base is x)
# True
Once you have the base array then its .nbytes attribute should accurately reflect the amount of memory it occupies.
In principle it's possible to have a view of a view of an array, or to create a strided array from another strided array. Assuming that your view or strided array is ultimately backed by another numpy array, you could recursively reference its .base attribute. Once you find an object whose .base is None, you have found the underlying object from which your array is borrowing its memory:
def find_base_nbytes(obj):
if obj.base is not None:
return find_base_nbytes(obj.base)
return obj.nbytes
As expected,
print(find_base_nbytes(x))
# 8000
print(find_base_nbytes(y))
# 8000
print(find_base_nbytes(z))
# 8000

Making a vectorized numpy function behave like a ufunc

Let's suppose that we have a Python function that takes in Numpy arrays and returns another array:
import numpy as np
def f(x, y, method='p'):
"""Parameters: x (np.ndarray) , y (np.ndarray), method (str)
Returns: np.ndarray"""
z = x.copy()
if method == 'p':
mask = x < 0
else:
mask = x > 0
z[mask] = 0
return z*y
although the actual implementation does not matter. We can assume that x and y will always be arrays of the same shape, and that the output is of the same shape as x.
The question is what would be the simplest/most elegant way of wrapping such function so it would work with both ND arrays (N>1) and scalar arguments, in a manner somewhat similar to universal functions in Numpy.
For instance, the expected output for the above function should be,
In [1]: f_ufunc(np.arange(-1,2), np.ones(3), method='p')
Out[1]: array([ 0., 0., 1.]) # random array input -> output of the same shape
In [2]: f_ufunc(np.array([1]), np.array([1]), method='p')
Out[2]: array([1]) # array input of len 1 -> output of len 1
In [3]: f_ufunc(1, 1, method='p')
Out[3]: 1 # scalar input -> scalar output
The function f cannot be changed, and it will fail if given a scalar argument for x or y.
When x and y are scalars, we transform them to 1D arrays, do the calculation then transform them back to scalars at the end.
f is optimized to work with arrays, scalar input being mostly a convenience. So writing a function that work with scalars and then using np.vectorize or np.frompyfunc would not be acceptable.
A beginning of an implementation could be,
def atleast_1d_inverse(res, x):
# this function fails in some cases (see point 1 below).
if res.shape[0] == 1:
return res[0]
else:
return res
def ufunc_wrapper(func, args=[]):
""" func: the wrapped function
args: arguments of func to which we apply np.atleast_1d """
# this needs to be generated dynamically depending on the definition of func
def wrapper(x, y, method='p'):
# we apply np.atleast_1d to the variables given in args
x = np.atleast_1d(x)
y = np.atleast_1d(x)
res = func(x, y, method='p')
return atleast_1d_inverse(res, x)
return wrapper
f_ufunc = ufunc_wrapper(f, args=['x', 'y'])
which mostly works, but will fail the tests 2 above, producing a scalar output instead of a vector one. If we want to fix that, we would need to add more tests on the type of the input (e.g. isinstance(x, np.ndarray), x.ndim>0, etc), but I'm afraid to forget some corner cases there. Furthermore, the above implementation is not generic enough to wrap a function with a different number of arguments (see point 2 below).
This seems to be a rather common problem, when working with Cython / f2py function, and I was wondering if there was a generic solution for this somewhere?
Edit: a bit more precisions following #hpaulj's comments. Essentially, I'm looking for
a function that would be the inverse of np.atleast_1d, such as
atleast_1d_inverse( np.atleast_1d(x), x) == x, where the second argument is only used to determine the type or the number of dimensions of the original object x. Returning numpy scalars (i.e. arrays with ndim = 0) instead of a python scalar is ok.
A way to inspect the function f and generate a wrapper that is consistent with its definition. For instance, such wrapper could be used as,
f_ufunc = ufunc_wrapper(f, args=['x', 'y'])
and then if we have a different function def f2(x, option=2): return x**2, we could also use
f2_ufunc = ufunc_wrapper(f2, args=['x']).
Note: the analogy with ufuncs might be a bit limited, as this corresponds to the opposite problem. Instead of having a scalar function that we transform to accept both vector and scalar input, I have a function designed to work with vectors (that can be seen as something that was previously vectorized), that I would like to accept scalars again, without changing the original function.

This doesn't fully answer the question of making a vectorized function truly behave like a ufunc, but I did recently run into a slight annoyance with numpy.vectorize that sounds similar to your issue. That wrapper insists on returning an array (with ndim=0 and shape=()) even if passed scalar inputs.
But it appears that the following does the right thing. In this case I am vectorizing a simple function to return a floating point value to a certain number of significant digits.
def signif(x, digits):
return round(x, digits - int(np.floor(np.log10(abs(x)))) - 1)
def vectorize(f):
vf = np.vectorize(f)
def newfunc(*args, **kwargs):
return vf(*args, **kwargs)[()]
return newfunc
vsignif = vectorize(signif)
This gives
>>> vsignif(0.123123, 2)
0.12
>>> vsignif([[0.123123, 123.2]], 2)
array([[ 0.12, 120. ]])
>>> vsignif([[0.123123, 123.2]], [2, 1])
array([[ 0.12, 100. ]])

How to index 0-d array in Python?

This may be a well-known question stored in some FAQ but i can't google the solution. I'm trying to write a scalar function of scalar argument but allowing for ndarray argument. The function should check its argument for domain correctness because domain violation may cause an exception. This example demonstrates what I tried to do:
import numpy as np
def f(x):
x = np.asarray(x)
y = np.zeros_like(x)
y[x>0.0] = 1.0/x
return y
print f(1.0)
On assigning y[x>0.0]=... python says 0-d arrays can't be indexed.
What's the right way to solve this execution?

This will work fine in NumPy >= 1.9 (not released as of writing this). On previous versions you can work around by an extra np.asarray call:
x[np.asarray(x > 0)] = 0

Could you call f([1.0]) instead?
Otherwise you can do:
x = np.asarray(x)
if x.ndim == 0:
x = x[..., None]

Variables and aliases with Python's code.interact

This behavior has me puzzled:
import code
class foo():
def __init__(self):
self.x = 1
def interact(self):
v = globals()
v.update(vars(self))
code.interact(local=v)
c = foo()
c.interact()
Python 2.6.6 (r266:84292, Sep 11 2012, 08:34:23)
(InteractiveConsole)
>>> id(x)
29082424
>>> id(c.x)
29082424
>>> x
1
>>> c.x
1
>>> x=2
>>> c.x
1
Why doesn't 'c.x' behave like an alias for 'x'? If I understand the id() function correctly, they are both at the same memory address.

Small integers from from -5 to 256 are cached in python, i.e their id() is always going to be same.
From the docs:
The current implementation keeps an array of integer objects for all
integers between -5 and 256, when you create an int in that range you
actually just get back a reference to the existing object.
>>> x = 1
>>> y = 1 #same id() here as integer 1 is cached by python.
>>> x is y
True
Update:
If two identifiers return same value of id() then it doesn't mean they can act as alias of
each other, it totally depends on the type of the object they are pointing to.
For immutable object you cannot create alias in python. Modifying one of the reference to an immutable object will simple make it point to a new object, while other references to that older object will still remain intact.
>>> x = y = 300
>>> x is y # x and y point to the same object
True
>>> x += 1 # modify x
>>> x # x now points to a different object
301
>>> y #y still points to the old object
300
A mutable object can be modified from any of it's references, but those modifications must be in-place modifications.
>>> x = y = []
>>> x is y
True
>>> x.append(1) # list.extend is an in-place operation
>>> y.append(2) # in-place operation
>>> x
[1, 2]
>>> y #works fine so far
[1, 2]
>>> x = x + [1] #not an in-place operation
>>> x
[1, 2, 1] #assigns a new object to x
>>> y #y still points to the same old object
[1, 2]

code.interact simply did (effectively) x=c.x for you. So when you checked their ids, they were pointing to the exact same object. But x=2 creates a new binding for the variable x. It is not an alias. Python does not have aliases, as far as I am aware.
Yes, in CPython id(x) is the memory address of the object x points to. It is not the memory address of the variable x itself (which is, after all, just a key in a dictionary).

If I understand the id() function correctly, they are both at the same memory address.
You don't understand it correctly. id returns an integer in respect of which the following identity is guaranteed: if id(x) == id(y) then x is y is guaranteed (and vice versa).
Accordingly, id tells you about the objects (values) that variables point to, not about the variables themselves.
Any relationship to memory addresses is purely an implementation detail. Python, unlike, e.g. C, does not assume any particular relationship to the underlying machine (whether physical or virtual). Variables in python are both opaque, and not language accessible (i.e. not first class).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy: garbage collection after slicing - python

def foo(): x = np.ones((10,10)) return x[:5,:5] If I call y = foo() I'll get a 5x5 array (1/4 of the values in x). But what happens to the other values in x, do they persist in memory or get garbage collected in some way? I'd like to understand this.

Related

Python confusion -- convention, name and value

Size of numpy strided array/broadcast array in memory?

Making a vectorized numpy function behave like a ufunc

How to index 0-d array in Python?

Variables and aliases with Python's code.interact

Categories

Resources