How to subclass numpy.`ma.core.masked_array`? - python

I'm trying to write a subclass a masked_array. What I've got so far is this:
class gridded_array(ma.core.masked_array):
def __init__(self, data, dimensions, mask=False, dtype=None,
copy=False, subok=True, ndmin=0, fill_value=None,
keep_mask=True, hard_mask=None, shrink=True):
ma.core.masked_array.__init__(data, mask, dtype, copy, subok,
ndmin, fill_value, keep_mask, hard_mask,
shrink)
self.dimensions = dimensions
However, when now I create a gridded_array, I don't get what I expect:
dims = OrderedDict()
dims['x'] = np.arange(4)
gridded_array(np.random.randn(4), dims)
masked_array(data = [-- -- -- --],
mask = [ True True True True],
fill_value = 1e+20)
I would expect an unmasked array. I have the suspicion that the dimensions argument I'm passing gets passed on the the masked_array.__init__ call, but since I'm quite new to OOP, I don't know how to resolve this.
Any help is greatly appreciated.
PS: I'm on Python 2.7

A word of warning: if you're brand new to OOP, subclassing ndarrays and MaskedArrays is not the easiest way to get started, by far...
Before anything else, you should go and check this tutorial. That should introduce you to the mechanisms involved in subclassing ndarrays.
MaskedArrays, like ndarrays, uses the __new__ method for creating class instances, not the __init__. By the time you get to the __init__ of your subclass, you already have a fully instanciated object, with the actual initialization delegated to the __array_finalize__ method. In simpler terms: your __init__ doesn't work as you would expect with standard Python object. (actually, I wonder whether it's called at all... After __array_finalize__, if I recall correctly...)
Now that you've been warned, you may want to consider whether you really need to go through the hassle of subclassing a ndarray:
What are your objectives with your gridded_array?
Should you support all methods of ndarrays, or only some? All dtypes?
What should happen when you take a single element or a slice of your object?
Will you be using gridded_arrays extensively as inputs of NumPy functions ?
If you have a doubt, then it might be easier to design gridded_array as a generic class that takes a ndarray (or a MaskedArray) as attribute (say, gridded_array._array), and add only the methods you would need to operate on your self._array.
Suggestions
If you only need to "tag" each item of your gridded_array, you may be interested in pandas.
If you only have to deal with floats, MaskedArray might be a bit overkill: just use nans to represent invalid data, a lot of numpy functions have nans equivalent. At worst, you can always mask your gridded_array when needed: taking a view of a subclass of ndarray with .view(np.ma.MaskedArray) should return a masked version of your input...

The issue is that masked_array uses __new__ instead of __init__, so your dimensions argument is being misinterpreted.
To override __new__, use:
class gridded_array(ma.core.masked_array):
def __new__(cls, data, dimensions, *args, **kwargs):
self = super(gridded_array, cls).__new__(cls, data, *args, **kwargs)
self.dimensions = dimensions
return self

Related

Which way to call numpy functions should I prefer? [duplicate]

A few examples:
numpy.sum()
ndarray.sum()
numpy.amax()
ndarray.max()
numpy.dot()
ndarray.dot()
... and quite a few more. Is it to support some legacy code, or is there a better reason for that? And, do I choose only on the basis of how my code 'looks', or is one of the two ways better than the other?
I can imagine that one might want numpy.dot() to use reduce (e.g., reduce(numpy.dot, A, B, C, D)) but I don't think that would be as useful for something like numpy.sum().
As others have noted, the identically-named NumPy functions and array methods are often equivalent (they end up calling the same underlying code). One might be preferred over the other if it makes for easier reading.
However, in some instances the two behave different slightly differently. In particular, using the ndarray method sometimes emphasises the fact that the method is modifying the array in-place.
For example, np.resize returns a new array with the specified shape. On the other hand, ndarray.resize changes the shape of the array in-place. The fill values used in each case are also different.
Similarly, a.sort() sorts the array a in-place, while np.sort(a) returns a sorted copy.
In most cases the method is the basic compiled version. The function uses that method when available, but also has some sort of backup when the argument(s) is not an array. It helps to look at the code and/or docs of the function or method.
For example if in Ipython I ask to look at the code for the sum method, I see that it is compiled code
In [711]: x.sum??
Type: builtin_function_or_method
String form: <built-in method sum of numpy.ndarray object at 0xac1bce0>
...
Refer to `numpy.sum` for full documentation.
Do the same on np.sum I get many lines of documentation plus some Python code:
if isinstance(a, _gentype):
res = _sum_(a)
if out is not None:
out[...] = res
return out
return res
elif type(a) is not mu.ndarray:
try:
sum = a.sum
except AttributeError:
return _methods._sum(a, axis=axis, dtype=dtype,
out=out, keepdims=keepdims)
# NOTE: Dropping the keepdims parameters here...
return sum(axis=axis, dtype=dtype, out=out)
else:
return _methods._sum(a, axis=axis, dtype=dtype,
out=out, keepdims=keepdims)
If I call np.sum(x) where x is an array, it ends up calling x.sum():
sum = a.sum
return sum(axis=axis, dtype=dtype, out=out)
np.amax similar (but simpler). Note that the np. form can handle a an object that isn't an array (that doesn't have the method), e.g. a list: np.amax([1,2,3]).
np.dot and x.dot both show as 'built-in' function, so we can't say anything about priority. They probably both end up calling some underlying C function.
np.reshape is another that deligates if possible:
try:
reshape = a.reshape
except AttributeError:
return _wrapit(a, 'reshape', newshape, order=order)
return reshape(newshape, order=order)
So np.reshape(x,(2,3)) is identical in functionality to x.reshape((2,3)). But the _wrapit expression enables np.reshape([1,2,3,4],(2,2)).
np.sort returns a copy by doing an inplace sort on a copy:
a = asanyarray(a).copy()
a.sort(axis, kind, order)
return a
x.resize is built-in, while np.resize ends up doing a np.concatenate and reshape.
If your array is a subclass, like matrix or masked, it may have its own variant. The action of a matrix .sum is:
return N.ndarray.sum(self, axis, dtype, out, keepdims=True)._collapse(axis)
Elaborating on Peter's comment for visibility:
We could make it more consistent by removing methods from ndarray and sticking to just functions. But this is impossible because it would break everyone's existing code that uses methods.
Or, we could move all functions to also be methods. But this is impossible because new users and packages are constantly defining new functions. Plus continuing to multiply these duplicate methods violates "there should be one obvious way to do it".
If we could go back in time then I'd probably argue for not having these methods on ndarray at all, and using functions exclusively. ... So this all argues for using functions exclusively
numpy issue: More consistency with array-methods #7452

Avoid creating new arrays as results for numpy/scipy operations?

For doing repeated operations in numpy/scipy, there's a lot of overhead because most operation return a new object.
For example
for i in range(100):
x = A*x
I would like to avoid this by passing a reference to the operation, like you would in C
for i in range(100):
np.dot(A,x,x_new) #x_new would now store the result of the multiplication
x,x_new = x_new,x
Is there any way to do this? I would like this not for just mutiplication but all operations that return a matrix or a vector.
See Learning to avoid unnecessary array copies in IPython Books. From there, note e.g. these guidelines:
a *= b
will not produce a copy, whereas:
a = a * b
will produce a copy. Also, flatten() will copy, while ravel() only copies if necessary and returns a view otherwise (and thus should in general be preferred). reshape() also does not produce a copy, but returns a view.
Furthermore, as #hpaulj and #ali_m noted in their comments, many numpy functions support an out parameter, so have a look at the docs. From numpy.dot() docs:
out : ndarray, optional
Output argument.
This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be C-contiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible.

Subclassing numpy scalar types

I'm trying to subclass numpy.complex64 in order to make use of the way numpy stores the data, (contiguous, alternating real and imaginary part) but use my own __add__, __sub__, ... routines.
My problem is that when I make a numpy.ndarray, setting dtype=mysubclass, I get a numpy.ndarray with dtype='numpy.complex64' in stead, which results in numpy not using my own functions for additions, subtractions and so on.
Example:
import numpy as np
class mysubclass(np.complex64):
pass
a = mysubclass(1+1j)
A = np.empty(2, dtype=mysubclass)
print type(a)
print repr(A)
Output:
<class '__main__.mysubclass'>
array([ -2.07782988e-20 +4.58546896e-41j, -2.07782988e-20 +4.58546896e-41j], dtype=complex64)'
Does anyone know how to do this?
Thanks in advance - Soren
The NumPy type system is only designed to be extended from C, via the PyArray_RegisterDataType function. It may be possible to access this functionality from Python using ctypes but I wouldn't recommend it; better to write an extension in C or Cython, or subclass ndarray as #seberg describes.
There's a simple example dtype in the NumPy source tree: newdtype_example/floatint.c. If you're into Pyrex, reference.pyx in the pytables source may be worth a look.
Note that scalars and arrays are quite different in numpy. np.complex64 (this is 32-bit float, just to note, not double precision). You will not be able to change the array like that, you will need to subclass the array instead and then override its __add__ and __sub__.
If that is all you want to do, it should just work otherwise look at http://docs.scipy.org/doc/numpy/user/basics.subclassing.html since subclassing an array is not that simple.
However if you want to use this type also as a scalar. For example you want to index scalars out, it gets more difficult at least currently. You can get a little further by defining __array_wrap__ to convert to scalars to your own scalar type for some reduce functions, for indexing to work in all cases it appears to me that you may have define your own __getitem__ currently.
In all cases with this approach, you still use the complex datatype, and all functions that are not explicitly overridden will still behave the same. #ecatmur mentioned that you can create new datatypes from the C side, if that is really what you want.

Python-numpy test for ndarray using ndim

I'm working on a project in Python requiring a lot of numerical array calculations. Unfortunately (or fortunately, depending on your POV), I'm very new to Python, but have been doing MATLAB and Octave programming (APL before that) for years. I'm very used to having every variable automatically typed to a matrix float, and still getting used to checking input types.
In many of my functions, I require the input S to be a numpy.ndarray of size (n,p), so I have to both test that type(S) is numpy.ndarray and get the values (n,p) = numpy.shape(S). One potential problem is that the input could be a list/tuple/int/etc..., another problem is that the input could be an array of shape (): S.ndim = 0. It occurred to me that I could simultaneously test the variable type, fix the S.ndim = 0problem, then get my dimensions like this:
# first simultaneously test for ndarray and get proper dimensions
try:
if (S.ndim == 0):
S = S.copy(); S.shape = (1,1);
# define dimensions p, and p2
(p,p2) = numpy.shape(S);
except AttributeError: # got here because input is not something array-like
raise AttributeError("blah blah blah");
Though it works, I'm wondering if this is a valid thing to do? The docstring for ndim says
If it is not already an ndarray, a conversion is
attempted.
and we surely know that numpy can easily convert an int/tuple/list to an array, so I'm confused why an AttributeError is being raised for these types inputs, when numpy should be doing this
numpy.array(S).ndim;
which should work.
When doing input validation for NumPy code, I always use np.asarray:
>>> np.asarray(np.array([1,2,3]))
array([1, 2, 3])
>>> np.asarray([1,2,3])
array([1, 2, 3])
>>> np.asarray((1,2,3))
array([1, 2, 3])
>>> np.asarray(1)
array(1)
>>> np.asarray(1).shape
()
This function has the nice feature that it only copies data when necessary; if the input is already an ndarray, the data is left in-place (only the type may be changed, because it also gets rid of that pesky np.matrix).
The docstring for ndim says
That's the docstring for the function np.ndim, not the ndim attribute, which non-NumPy objects don't have. You could use that function, but the effect would be that the data might be copied twice, so instead do:
S = np.asarray(S)
(p, p2) = S.shape
This will raise a ValueError if S.ndim != 2.
[Final note: you don't need ; in Python if you just follow the indentation rules. In fact, Python programmers eschew the semicolon.]
Given the comments to #larsmans answer, you could try:
if not isinstance(S, np.ndarray):
raise TypeError("Input not a ndarray")
if S.ndim == 0:
S = np.reshape(S, (1,1))
(p, p2) = S.shape
First, you check explicitly whether S is a (subclass of) ndarray. Then, you use the np.reshape to copy your data (and reshaping it, of course) if needed. At last, you get the dimension.
Note that in most cases, the np functions will first try to access the corresponding method of a ndarray, then attempt to convert the input to a ndarray (sometimes keeping it a subclass, as in np.asanyarray, sometimes not (as in np.asarray(...)). In other terms, it's always more efficient to use the method rather than the function: that's why we're using S.shape and not np.shape(S).
Another point: the np.asarray, np.asanyarray, np.atleast_1D... are all particular cases of the more generic function np.array. For example, asarray sets the optional copy argument of array to False, asanyarray does the same and sets subok=True, atleast_1D sets ndmin=1, atleast_2d sets ndmin=2... In other terms, it's always easier to use np.array with the appropriate arguments. But as mentioned in some comments, it's a matter of style. Shortcuts can often improve readability, which is always an objective to keep in mind.
In any case, when you use np.array(..., copy=True), you're explicitly asking for a copy of your initial data, a bit like doing a list([....]). Even if nothing else changed, your data will be copied. That has the advantages of its drawbacks (as we say in French), you could for example change the order from row-first C to column-first F. But anyway, you get the copy you wanted.
With np.array(input, copy=False), a new array is always created. It will either point to the same block of memory as input if this latter was already a ndarray (that is, no waste of memory), or will create a new one "from scratch" if input wasn't. The interesting case is of course if input was a ndarray.
Using this new array in a function may or may not change the original input, depending on the function. You have to check the documentation of the function you want to use to see whether it returns a copy or not. The NumPy developers try hard to limit unnecessary copies (following the Python example), but sometimes it can't be avoided. The documentation should tell explicitly what happens, if it doesn't or it's unclear, please mention it.
np.array(...) may raise some exceptions if something goes awry. For example, trying to use a dtype=float with an input like ["STRING", 1] will raise a ValueError. However, I must admit I can't remember which exceptions in all the cases, please edit this post accordingly.
Welcome to stack-overflow. This comes down to almost a style choice, but the most common way I've seen to deal with this kind of situation is to convert the input to an array. Numpy provides some useful tools for this. numpy.asarray has already been mentioned, but here are a few more. numpy.at_least1d is similar to asarray, but reshapes () arrays to be (1,) numpy.at_least2d is the same as above but reshapes 0d and 1d arrays to be 2d, ie (3,) to (1, 3). The reason we convert "array_like" inputs to arrays is partly just because we're lazy, for example sometimes it can be easier to write foo([1, 2, 3]) than foo(numpy.array([1, 2, 3])), but this is also the design choice made within numpy itself. Notice that the following works:
>>> numpy.mean([1., 2., 3.])
>>> 2.0
In the docs for numpy.mean we can see that x should be "array_like".
Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
That being said, there are situations when you want to only accept arrays as arguments and not all "array_like" types.

Custom data types in numpy arrays

I'm creating a numpy array which is to be filled with objects of a particular class I've made. I'd like to initialize the array such that it will only ever contain objects of that class. For example, here's what I'd like to do, and what happens if I do it.
class Kernel:
pass
>>> L = np.empty(4,dtype=Kernel)
TypeError: data type not understood
I can do this:
>>> L = np.empty(4,dtype=object)
and then assign each element of L as a Kernel object (or any other type of object). It would be so neat were I able to have an array of Kernels, though, from both a programming point of view (type checking) and a mathematical one (operations on sets of functions).
Is there any way for me to specify the data type of a numpy array using an arbitrary class?
If your Kernel class has a predictable amount of member data, then you could define a dtype for it instead of a class. e.g. if it's parameterized by 9 floats and an int, you could do
kerneldt = np.dtype([('myintname', np.int32), ('myfloats', np.float64, 9)])
arr = np.empty(dims, dtype=kerneldt)
You'll have to do some coercion to turn them into objects of class Kernel every time you want to manipulate methods of a single kernel but that's one way to store the actual data in a NumPy array. If you want to only store a reference, then the object dtype is the best you can do without subclassing ndarray.
It has to be a Numpy scalar type:
http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#arrays-scalars-built-in
or a subclass of ndarray:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray
As far as I know, enforcing a single type for elements in a numpy.ndarray has to be done manually (unless the array contains Numpy scalars): there is no built-in checking mechanism (your array has dtype=object). If you really want to enforce a single type, you have to subclass ndarray and implement the checks in the appropriate methods (__setitem__, etc.).
If you want to implement operations on a set of functions (Kernel objects), you might be able to do so by defining the proper operations directly in your Kernel class. This is what I did for my uncertainties.py module, which handles numpy.ndarrays of numbers with uncertainties.

Categories