A few examples:
numpy.sum()
ndarray.sum()
numpy.amax()
ndarray.max()
numpy.dot()
ndarray.dot()
... and quite a few more. Is it to support some legacy code, or is there a better reason for that? And, do I choose only on the basis of how my code 'looks', or is one of the two ways better than the other?
I can imagine that one might want numpy.dot() to use reduce (e.g., reduce(numpy.dot, A, B, C, D)) but I don't think that would be as useful for something like numpy.sum().
As others have noted, the identically-named NumPy functions and array methods are often equivalent (they end up calling the same underlying code). One might be preferred over the other if it makes for easier reading.
However, in some instances the two behave different slightly differently. In particular, using the ndarray method sometimes emphasises the fact that the method is modifying the array in-place.
For example, np.resize returns a new array with the specified shape. On the other hand, ndarray.resize changes the shape of the array in-place. The fill values used in each case are also different.
Similarly, a.sort() sorts the array a in-place, while np.sort(a) returns a sorted copy.
In most cases the method is the basic compiled version. The function uses that method when available, but also has some sort of backup when the argument(s) is not an array. It helps to look at the code and/or docs of the function or method.
For example if in Ipython I ask to look at the code for the sum method, I see that it is compiled code
In [711]: x.sum??
Type: builtin_function_or_method
String form: <built-in method sum of numpy.ndarray object at 0xac1bce0>
...
Refer to `numpy.sum` for full documentation.
Do the same on np.sum I get many lines of documentation plus some Python code:
if isinstance(a, _gentype):
res = _sum_(a)
if out is not None:
out[...] = res
return out
return res
elif type(a) is not mu.ndarray:
try:
sum = a.sum
except AttributeError:
return _methods._sum(a, axis=axis, dtype=dtype,
out=out, keepdims=keepdims)
# NOTE: Dropping the keepdims parameters here...
return sum(axis=axis, dtype=dtype, out=out)
else:
return _methods._sum(a, axis=axis, dtype=dtype,
out=out, keepdims=keepdims)
If I call np.sum(x) where x is an array, it ends up calling x.sum():
sum = a.sum
return sum(axis=axis, dtype=dtype, out=out)
np.amax similar (but simpler). Note that the np. form can handle a an object that isn't an array (that doesn't have the method), e.g. a list: np.amax([1,2,3]).
np.dot and x.dot both show as 'built-in' function, so we can't say anything about priority. They probably both end up calling some underlying C function.
np.reshape is another that deligates if possible:
try:
reshape = a.reshape
except AttributeError:
return _wrapit(a, 'reshape', newshape, order=order)
return reshape(newshape, order=order)
So np.reshape(x,(2,3)) is identical in functionality to x.reshape((2,3)). But the _wrapit expression enables np.reshape([1,2,3,4],(2,2)).
np.sort returns a copy by doing an inplace sort on a copy:
a = asanyarray(a).copy()
a.sort(axis, kind, order)
return a
x.resize is built-in, while np.resize ends up doing a np.concatenate and reshape.
If your array is a subclass, like matrix or masked, it may have its own variant. The action of a matrix .sum is:
return N.ndarray.sum(self, axis, dtype, out, keepdims=True)._collapse(axis)
Elaborating on Peter's comment for visibility:
We could make it more consistent by removing methods from ndarray and sticking to just functions. But this is impossible because it would break everyone's existing code that uses methods.
Or, we could move all functions to also be methods. But this is impossible because new users and packages are constantly defining new functions. Plus continuing to multiply these duplicate methods violates "there should be one obvious way to do it".
If we could go back in time then I'd probably argue for not having these methods on ndarray at all, and using functions exclusively. ... So this all argues for using functions exclusively
numpy issue: More consistency with array-methods #7452
Related
For example (from https://numpy.org/doc/stable/reference/generated/numpy.nditer.html):
def iter_add_py(x, y, out=None):
addop = np.add
it = np.nditer([x, y, out], [],
[['readonly'], ['readonly'], ['writeonly','allocate']])
with it:
for (a, b, c) in it:
addop(a, b, out=c)
return it.operands[2]
In the case out = None, the only way to access the result of the operation is with it.operands[2]. Numba's nditer supports None as an operand, creating the correct empty array, but it's inaccessible because Numba doesn't support the operands attribute. Speed is critical in my solution.
Things I've tried:
Sending a new empty array (which we can reference) into out is a seemingly obvious solution, but you'd need to manually compute the broadcasted shape, dtype, etc. I couldn't find a way to do this, with/without numpy, that was quick enough to justify ignoring that nditer does the job anyway.
Even if I could, Numba's nditer doesn't support flags in its constructor, rendering it read-only, so the best way I could find to fill the array is to enumerate nditer and set out.flat in the loop, which is also too slow (and a little gross).
Unsurprisingly, a jitted function can't return nor take nditer objects. I'm new to Numba, so I could be wrong about this, but I couldn't find anything to the contrary. I can only iterate inside the function.
There's no Numba support for np.broadcast, so that won't work.
I should mention that Numba.vectorize isn't an option, for a host of irrelevant reasons.
Any other ideas?
I can verify my function receives inputs in the correct type using:
def foo(x: np.ndarray, y: float):
return x * y
Making sure if I try to use this function with x that is not a np.ndarray I will get an error even before running the code.
What I don't know, is how to verify the array type. For example:
def return_valid_points_only(points: np.ndarray, valid: np.ndarray):
assert points.shape == valid.shape
return points[valid]
I wish to check that valid is not only a np.ndarray but also valid.dtype == bool.
For this example, if valid will be supply with 0 and 1 to indicate validity, the program won't fail and I will get terrible results.
Thanks
Python is all about asking for forgiveness, not permission. That means that even in your first definition, def foo(x: np.ndarray, y: float): is really relying on the user to honor the hint, unless you are using something like mypy.
There are a couple of approaches you can take here, usually in tandem. One is to write the function in a way that works with the inputs that are passed in, which can mean failing or coercing invalid inputs. The other method is to document your code carefully, so users can make an intelligent decisions. The second method is especially important, but I will focus on the first.
Numpy does most of the checking for you. For example, rather than expecting an array, it is idiomatic to coerce one:
x = np.asanyarray(x)
np.asanyarray is usually an alias for array(a, dtype, copy=False, order=order, subok=True). You can do something similar for y:
y = np.asanyarray(y).item()
This will allow any array-like as long as it has one element, whether scalar or not. Another way is to respect numpy's ability to broadcast arrays together, so if the user passes in y as a list of x.shape[-1] elements.
For your second function, you have a couple of options. One option is to allow a fancy indexing. So if the user passes in a list of indices vs a boolean mask, you can use both. If, on the other hand, you insist on a boolean mask, you can either check or coerce the dtype.
If you check, keep in mind that the numpy indexing operation will raise an error for you if the array sizes don't match. You only need to check the type itself:
points = np.asanyarray(points)
valid = np.asanyarray(valid)
if valid.dtype != bool:
raise ValueError('valid argument must be a boolean mask')
If you choose to coerce instead, the user will be allowed to use zeros and ones, but valid inputs will not be copied unnecessarily:
valid = np.asanyarray(valid, bool)
For doing repeated operations in numpy/scipy, there's a lot of overhead because most operation return a new object.
For example
for i in range(100):
x = A*x
I would like to avoid this by passing a reference to the operation, like you would in C
for i in range(100):
np.dot(A,x,x_new) #x_new would now store the result of the multiplication
x,x_new = x_new,x
Is there any way to do this? I would like this not for just mutiplication but all operations that return a matrix or a vector.
See Learning to avoid unnecessary array copies in IPython Books. From there, note e.g. these guidelines:
a *= b
will not produce a copy, whereas:
a = a * b
will produce a copy. Also, flatten() will copy, while ravel() only copies if necessary and returns a view otherwise (and thus should in general be preferred). reshape() also does not produce a copy, but returns a view.
Furthermore, as #hpaulj and #ali_m noted in their comments, many numpy functions support an out parameter, so have a look at the docs. From numpy.dot() docs:
out : ndarray, optional
Output argument.
This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be C-contiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible.
I'm trying to write a subclass a masked_array. What I've got so far is this:
class gridded_array(ma.core.masked_array):
def __init__(self, data, dimensions, mask=False, dtype=None,
copy=False, subok=True, ndmin=0, fill_value=None,
keep_mask=True, hard_mask=None, shrink=True):
ma.core.masked_array.__init__(data, mask, dtype, copy, subok,
ndmin, fill_value, keep_mask, hard_mask,
shrink)
self.dimensions = dimensions
However, when now I create a gridded_array, I don't get what I expect:
dims = OrderedDict()
dims['x'] = np.arange(4)
gridded_array(np.random.randn(4), dims)
masked_array(data = [-- -- -- --],
mask = [ True True True True],
fill_value = 1e+20)
I would expect an unmasked array. I have the suspicion that the dimensions argument I'm passing gets passed on the the masked_array.__init__ call, but since I'm quite new to OOP, I don't know how to resolve this.
Any help is greatly appreciated.
PS: I'm on Python 2.7
A word of warning: if you're brand new to OOP, subclassing ndarrays and MaskedArrays is not the easiest way to get started, by far...
Before anything else, you should go and check this tutorial. That should introduce you to the mechanisms involved in subclassing ndarrays.
MaskedArrays, like ndarrays, uses the __new__ method for creating class instances, not the __init__. By the time you get to the __init__ of your subclass, you already have a fully instanciated object, with the actual initialization delegated to the __array_finalize__ method. In simpler terms: your __init__ doesn't work as you would expect with standard Python object. (actually, I wonder whether it's called at all... After __array_finalize__, if I recall correctly...)
Now that you've been warned, you may want to consider whether you really need to go through the hassle of subclassing a ndarray:
What are your objectives with your gridded_array?
Should you support all methods of ndarrays, or only some? All dtypes?
What should happen when you take a single element or a slice of your object?
Will you be using gridded_arrays extensively as inputs of NumPy functions ?
If you have a doubt, then it might be easier to design gridded_array as a generic class that takes a ndarray (or a MaskedArray) as attribute (say, gridded_array._array), and add only the methods you would need to operate on your self._array.
Suggestions
If you only need to "tag" each item of your gridded_array, you may be interested in pandas.
If you only have to deal with floats, MaskedArray might be a bit overkill: just use nans to represent invalid data, a lot of numpy functions have nans equivalent. At worst, you can always mask your gridded_array when needed: taking a view of a subclass of ndarray with .view(np.ma.MaskedArray) should return a masked version of your input...
The issue is that masked_array uses __new__ instead of __init__, so your dimensions argument is being misinterpreted.
To override __new__, use:
class gridded_array(ma.core.masked_array):
def __new__(cls, data, dimensions, *args, **kwargs):
self = super(gridded_array, cls).__new__(cls, data, *args, **kwargs)
self.dimensions = dimensions
return self
I'm working on a project in Python requiring a lot of numerical array calculations. Unfortunately (or fortunately, depending on your POV), I'm very new to Python, but have been doing MATLAB and Octave programming (APL before that) for years. I'm very used to having every variable automatically typed to a matrix float, and still getting used to checking input types.
In many of my functions, I require the input S to be a numpy.ndarray of size (n,p), so I have to both test that type(S) is numpy.ndarray and get the values (n,p) = numpy.shape(S). One potential problem is that the input could be a list/tuple/int/etc..., another problem is that the input could be an array of shape (): S.ndim = 0. It occurred to me that I could simultaneously test the variable type, fix the S.ndim = 0problem, then get my dimensions like this:
# first simultaneously test for ndarray and get proper dimensions
try:
if (S.ndim == 0):
S = S.copy(); S.shape = (1,1);
# define dimensions p, and p2
(p,p2) = numpy.shape(S);
except AttributeError: # got here because input is not something array-like
raise AttributeError("blah blah blah");
Though it works, I'm wondering if this is a valid thing to do? The docstring for ndim says
If it is not already an ndarray, a conversion is
attempted.
and we surely know that numpy can easily convert an int/tuple/list to an array, so I'm confused why an AttributeError is being raised for these types inputs, when numpy should be doing this
numpy.array(S).ndim;
which should work.
When doing input validation for NumPy code, I always use np.asarray:
>>> np.asarray(np.array([1,2,3]))
array([1, 2, 3])
>>> np.asarray([1,2,3])
array([1, 2, 3])
>>> np.asarray((1,2,3))
array([1, 2, 3])
>>> np.asarray(1)
array(1)
>>> np.asarray(1).shape
()
This function has the nice feature that it only copies data when necessary; if the input is already an ndarray, the data is left in-place (only the type may be changed, because it also gets rid of that pesky np.matrix).
The docstring for ndim says
That's the docstring for the function np.ndim, not the ndim attribute, which non-NumPy objects don't have. You could use that function, but the effect would be that the data might be copied twice, so instead do:
S = np.asarray(S)
(p, p2) = S.shape
This will raise a ValueError if S.ndim != 2.
[Final note: you don't need ; in Python if you just follow the indentation rules. In fact, Python programmers eschew the semicolon.]
Given the comments to #larsmans answer, you could try:
if not isinstance(S, np.ndarray):
raise TypeError("Input not a ndarray")
if S.ndim == 0:
S = np.reshape(S, (1,1))
(p, p2) = S.shape
First, you check explicitly whether S is a (subclass of) ndarray. Then, you use the np.reshape to copy your data (and reshaping it, of course) if needed. At last, you get the dimension.
Note that in most cases, the np functions will first try to access the corresponding method of a ndarray, then attempt to convert the input to a ndarray (sometimes keeping it a subclass, as in np.asanyarray, sometimes not (as in np.asarray(...)). In other terms, it's always more efficient to use the method rather than the function: that's why we're using S.shape and not np.shape(S).
Another point: the np.asarray, np.asanyarray, np.atleast_1D... are all particular cases of the more generic function np.array. For example, asarray sets the optional copy argument of array to False, asanyarray does the same and sets subok=True, atleast_1D sets ndmin=1, atleast_2d sets ndmin=2... In other terms, it's always easier to use np.array with the appropriate arguments. But as mentioned in some comments, it's a matter of style. Shortcuts can often improve readability, which is always an objective to keep in mind.
In any case, when you use np.array(..., copy=True), you're explicitly asking for a copy of your initial data, a bit like doing a list([....]). Even if nothing else changed, your data will be copied. That has the advantages of its drawbacks (as we say in French), you could for example change the order from row-first C to column-first F. But anyway, you get the copy you wanted.
With np.array(input, copy=False), a new array is always created. It will either point to the same block of memory as input if this latter was already a ndarray (that is, no waste of memory), or will create a new one "from scratch" if input wasn't. The interesting case is of course if input was a ndarray.
Using this new array in a function may or may not change the original input, depending on the function. You have to check the documentation of the function you want to use to see whether it returns a copy or not. The NumPy developers try hard to limit unnecessary copies (following the Python example), but sometimes it can't be avoided. The documentation should tell explicitly what happens, if it doesn't or it's unclear, please mention it.
np.array(...) may raise some exceptions if something goes awry. For example, trying to use a dtype=float with an input like ["STRING", 1] will raise a ValueError. However, I must admit I can't remember which exceptions in all the cases, please edit this post accordingly.
Welcome to stack-overflow. This comes down to almost a style choice, but the most common way I've seen to deal with this kind of situation is to convert the input to an array. Numpy provides some useful tools for this. numpy.asarray has already been mentioned, but here are a few more. numpy.at_least1d is similar to asarray, but reshapes () arrays to be (1,) numpy.at_least2d is the same as above but reshapes 0d and 1d arrays to be 2d, ie (3,) to (1, 3). The reason we convert "array_like" inputs to arrays is partly just because we're lazy, for example sometimes it can be easier to write foo([1, 2, 3]) than foo(numpy.array([1, 2, 3])), but this is also the design choice made within numpy itself. Notice that the following works:
>>> numpy.mean([1., 2., 3.])
>>> 2.0
In the docs for numpy.mean we can see that x should be "array_like".
Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
That being said, there are situations when you want to only accept arrays as arguments and not all "array_like" types.