I'm working on a project in Python requiring a lot of numerical array calculations. Unfortunately (or fortunately, depending on your POV), I'm very new to Python, but have been doing MATLAB and Octave programming (APL before that) for years. I'm very used to having every variable automatically typed to a matrix float, and still getting used to checking input types.
In many of my functions, I require the input S to be a numpy.ndarray of size (n,p), so I have to both test that type(S) is numpy.ndarray and get the values (n,p) = numpy.shape(S). One potential problem is that the input could be a list/tuple/int/etc..., another problem is that the input could be an array of shape (): S.ndim = 0. It occurred to me that I could simultaneously test the variable type, fix the S.ndim = 0problem, then get my dimensions like this:
# first simultaneously test for ndarray and get proper dimensions
try:
if (S.ndim == 0):
S = S.copy(); S.shape = (1,1);
# define dimensions p, and p2
(p,p2) = numpy.shape(S);
except AttributeError: # got here because input is not something array-like
raise AttributeError("blah blah blah");
Though it works, I'm wondering if this is a valid thing to do? The docstring for ndim says
If it is not already an ndarray, a conversion is
attempted.
and we surely know that numpy can easily convert an int/tuple/list to an array, so I'm confused why an AttributeError is being raised for these types inputs, when numpy should be doing this
numpy.array(S).ndim;
which should work.
When doing input validation for NumPy code, I always use np.asarray:
>>> np.asarray(np.array([1,2,3]))
array([1, 2, 3])
>>> np.asarray([1,2,3])
array([1, 2, 3])
>>> np.asarray((1,2,3))
array([1, 2, 3])
>>> np.asarray(1)
array(1)
>>> np.asarray(1).shape
()
This function has the nice feature that it only copies data when necessary; if the input is already an ndarray, the data is left in-place (only the type may be changed, because it also gets rid of that pesky np.matrix).
The docstring for ndim says
That's the docstring for the function np.ndim, not the ndim attribute, which non-NumPy objects don't have. You could use that function, but the effect would be that the data might be copied twice, so instead do:
S = np.asarray(S)
(p, p2) = S.shape
This will raise a ValueError if S.ndim != 2.
[Final note: you don't need ; in Python if you just follow the indentation rules. In fact, Python programmers eschew the semicolon.]
Given the comments to #larsmans answer, you could try:
if not isinstance(S, np.ndarray):
raise TypeError("Input not a ndarray")
if S.ndim == 0:
S = np.reshape(S, (1,1))
(p, p2) = S.shape
First, you check explicitly whether S is a (subclass of) ndarray. Then, you use the np.reshape to copy your data (and reshaping it, of course) if needed. At last, you get the dimension.
Note that in most cases, the np functions will first try to access the corresponding method of a ndarray, then attempt to convert the input to a ndarray (sometimes keeping it a subclass, as in np.asanyarray, sometimes not (as in np.asarray(...)). In other terms, it's always more efficient to use the method rather than the function: that's why we're using S.shape and not np.shape(S).
Another point: the np.asarray, np.asanyarray, np.atleast_1D... are all particular cases of the more generic function np.array. For example, asarray sets the optional copy argument of array to False, asanyarray does the same and sets subok=True, atleast_1D sets ndmin=1, atleast_2d sets ndmin=2... In other terms, it's always easier to use np.array with the appropriate arguments. But as mentioned in some comments, it's a matter of style. Shortcuts can often improve readability, which is always an objective to keep in mind.
In any case, when you use np.array(..., copy=True), you're explicitly asking for a copy of your initial data, a bit like doing a list([....]). Even if nothing else changed, your data will be copied. That has the advantages of its drawbacks (as we say in French), you could for example change the order from row-first C to column-first F. But anyway, you get the copy you wanted.
With np.array(input, copy=False), a new array is always created. It will either point to the same block of memory as input if this latter was already a ndarray (that is, no waste of memory), or will create a new one "from scratch" if input wasn't. The interesting case is of course if input was a ndarray.
Using this new array in a function may or may not change the original input, depending on the function. You have to check the documentation of the function you want to use to see whether it returns a copy or not. The NumPy developers try hard to limit unnecessary copies (following the Python example), but sometimes it can't be avoided. The documentation should tell explicitly what happens, if it doesn't or it's unclear, please mention it.
np.array(...) may raise some exceptions if something goes awry. For example, trying to use a dtype=float with an input like ["STRING", 1] will raise a ValueError. However, I must admit I can't remember which exceptions in all the cases, please edit this post accordingly.
Welcome to stack-overflow. This comes down to almost a style choice, but the most common way I've seen to deal with this kind of situation is to convert the input to an array. Numpy provides some useful tools for this. numpy.asarray has already been mentioned, but here are a few more. numpy.at_least1d is similar to asarray, but reshapes () arrays to be (1,) numpy.at_least2d is the same as above but reshapes 0d and 1d arrays to be 2d, ie (3,) to (1, 3). The reason we convert "array_like" inputs to arrays is partly just because we're lazy, for example sometimes it can be easier to write foo([1, 2, 3]) than foo(numpy.array([1, 2, 3])), but this is also the design choice made within numpy itself. Notice that the following works:
>>> numpy.mean([1., 2., 3.])
>>> 2.0
In the docs for numpy.mean we can see that x should be "array_like".
Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
That being said, there are situations when you want to only accept arrays as arguments and not all "array_like" types.
Related
For doing repeated operations in numpy/scipy, there's a lot of overhead because most operation return a new object.
For example
for i in range(100):
x = A*x
I would like to avoid this by passing a reference to the operation, like you would in C
for i in range(100):
np.dot(A,x,x_new) #x_new would now store the result of the multiplication
x,x_new = x_new,x
Is there any way to do this? I would like this not for just mutiplication but all operations that return a matrix or a vector.
See Learning to avoid unnecessary array copies in IPython Books. From there, note e.g. these guidelines:
a *= b
will not produce a copy, whereas:
a = a * b
will produce a copy. Also, flatten() will copy, while ravel() only copies if necessary and returns a view otherwise (and thus should in general be preferred). reshape() also does not produce a copy, but returns a view.
Furthermore, as #hpaulj and #ali_m noted in their comments, many numpy functions support an out parameter, so have a look at the docs. From numpy.dot() docs:
out : ndarray, optional
Output argument.
This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be C-contiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible.
I suppose this is a bit of a beginner's question, but I'm wondering about the more pythonic approach to use when you're presented with a situation where you're using class methods from a class defined in a module, as well as methods defined within the module itself. I'll use numpy as an example.
import numpy as np
foo = np.matrix([[3, 4], [9, 12]])
# Get norm (without using linalg)
norm = np.sqrt(foo.dot(foo.T)).diagonal()
I can use a mixed case, like this, where I'm calling methods of foo and methods defined in numpy, or I can write the code as below:
norm = np.diagonal(np.sqrt(np.dot(foo, foo.T)))
I would prefer using foo.bar.baz.shoop.doop syntax, myself, but in this case I can't, as sqrt isn't a method of foo. So, what would be the more pythonic way to write a line like this?
Incidentally, as a side-question, are class methods usually more optimized, compared to methods defined in a module? I don't understand too well what's going on under the hood, but I assumed (again using numpy as an example) that numpy has an np.dot method that is written for the general case where arg can be an array or a matrix, while np.matrix.dot is reimplemented and optimized only for matrix operations. Please correct me if I'm wrong in this.
The questions you're asking don't really have an answer, because the case you're asking about simply doesn't exist.
Typically, in Python, you do not have the same function available as a method and as a global function.
NumPy is a special case, because some, but not all, top-level functions are also available as methods on the appropriate object. Even then, they often don't have the same semantics, so the answer isn't a question of style, but of which one is the right function.
For example, in your case, the only one you have a choice on is diagonal. And the two options give different results.
>>> m = matrix([[1,2,3], [4,5,6], [7,8,9]]
>>> np.diagonal(m)
array([1, 5, 9])
>>> m.diagonal()
matrix([[1, 5, 9]])
The module function takes a 2D array of shape (N, N) and returns a 1D array of shape (N,). The method takes a 2D matrix of shape (N, N) and returns a 2D matrix of shape (1, N).
It's possible that the matrix method will be faster. But that's not as important as the fact that if one of them is correct, the other one is wrong. It's like asking whether + or * is a faster way to multiply two numbers. Whether + is faster than * or not, it's not a faster way to multiply, because it doesn't multiply.
I'm trying to subclass numpy.complex64 in order to make use of the way numpy stores the data, (contiguous, alternating real and imaginary part) but use my own __add__, __sub__, ... routines.
My problem is that when I make a numpy.ndarray, setting dtype=mysubclass, I get a numpy.ndarray with dtype='numpy.complex64' in stead, which results in numpy not using my own functions for additions, subtractions and so on.
Example:
import numpy as np
class mysubclass(np.complex64):
pass
a = mysubclass(1+1j)
A = np.empty(2, dtype=mysubclass)
print type(a)
print repr(A)
Output:
<class '__main__.mysubclass'>
array([ -2.07782988e-20 +4.58546896e-41j, -2.07782988e-20 +4.58546896e-41j], dtype=complex64)'
Does anyone know how to do this?
Thanks in advance - Soren
The NumPy type system is only designed to be extended from C, via the PyArray_RegisterDataType function. It may be possible to access this functionality from Python using ctypes but I wouldn't recommend it; better to write an extension in C or Cython, or subclass ndarray as #seberg describes.
There's a simple example dtype in the NumPy source tree: newdtype_example/floatint.c. If you're into Pyrex, reference.pyx in the pytables source may be worth a look.
Note that scalars and arrays are quite different in numpy. np.complex64 (this is 32-bit float, just to note, not double precision). You will not be able to change the array like that, you will need to subclass the array instead and then override its __add__ and __sub__.
If that is all you want to do, it should just work otherwise look at http://docs.scipy.org/doc/numpy/user/basics.subclassing.html since subclassing an array is not that simple.
However if you want to use this type also as a scalar. For example you want to index scalars out, it gets more difficult at least currently. You can get a little further by defining __array_wrap__ to convert to scalars to your own scalar type for some reduce functions, for indexing to work in all cases it appears to me that you may have define your own __getitem__ currently.
In all cases with this approach, you still use the complex datatype, and all functions that are not explicitly overridden will still behave the same. #ecatmur mentioned that you can create new datatypes from the C side, if that is really what you want.
I am learning Python and numpy, and am new to the idea of duck typing. I'm writing functions into which something/someone should pass a numpy array. Trying to embrace duck typing, I'm writing my code to use numpy.array with the copy= and ndmin= options to convert array_likes or 1d/0d arrays into the shape I need. Specifically, I use the ndmin= option in cases where I can accept either a (p,p) array or a scalar; the scalar can be coded as an int, (1,) array, (1,1) array, [1] list, etc...
So to take care of this, I'm using something like S = numpy.array(S,copy=False,ndmin=2) to get an array (if possible) with the right ndim, then test for the shape as I need. I know I should embed this in a Try-Except block, but can't find any documentation about what kind of exception numpy.array() is likely to throw. Thus I currently just have this:
# duck covariance matrix into a 2d matrix
try:
S = numpy.array(S, ndmin = 2, copy=False)
except Exception as e:
raise e
What specific exception(s) should I try to catch here? Thanks.
Document your function as accepting an array_like object and leave handling of exceptions to a caller.
numpy.array() is very permissive function it will convert to an array almost anything.
Try using np.asarray to convert inputs into arrays instead. It's guaranteed not to copy anything if the input is already a Numpy array. If you expect to receive array subclasses, use np.asanyarray.
Note that a lot of Numpy interfaces don't care if the input is 1- or 2-dimensional -- for example, np.dot works with both 1- and 2-dimensional input. It's probably best to leave it that way -- that way, things like scalar multiplication just work.
Surely a 0d array is scalar, but Numpy does not seem to think so... am I missing something or am I just misunderstanding the concept?
>>> foo = numpy.array(1.11111111111, numpy.float64)
>>> numpy.ndim(foo)
0
>>> numpy.isscalar(foo)
False
>>> foo.item()
1.11111111111
One should not think too hard about it. It's ultimately better for the mental health and longevity of the individual.
The curious situation with Numpy scalar-types was bore out of the fact that there is no graceful and consistent way to degrade the 1x1 matrix to scalar types. Even though mathematically they are the same thing, they are handled by very different code.
If you've been doing any amount of scientific code, ultimately you'd want things like max(a) to work on matrices of all sizes, even scalars. Mathematically, this is a perfectly sensible thing to expect. However for programmers this means that whatever presents scalars in Numpy should have the .shape and .ndim attirbute, so at least the ufuncs don't have to do explicit type checking on its input for the 21 possible scalar types in Numpy.
On the other hand, they should also work with existing Python libraries that does do explicit type-checks on scalar type. This is a dilemma, since a Numpy ndarray have to individually change its type when they've been reduced to a scalar, and there is no way of knowing whether that has occurred without it having do checks on all access. Actually going that route would probably make bit ridiculously slow to work with by scalar type standards.
The Numpy developer's solution is to inherit from both ndarray and Python scalars for its own scalary type, so that all scalars also have .shape, .ndim, .T, etc etc. The 1x1 matrix will still be there, but its use will be discouraged if you know you'll be dealing with a scalar. While this should work fine in theory, occasionally you could still see some places where they missed with the paint roller, and the ugly innards are exposed for all to see:
>>> from numpy import *
>>> a = array(1)
>>> b = int_(1)
>>> a.ndim
0
>>> b.ndim
0
>>> a[...]
array(1)
>>> a[()]
1
>>> b[...]
array(1)
>>> b[()]
1
There's really no reason why a[...] and a[()] should return different things, but it does. There are proposals in place to change this, but looks like they forgot to finish the job for 1x1 arrays.
A potentially bigger, and possibly non-resolvable issue, is the fact that Numpy scalars are immutable. Therefore "spraying" a scalar into a ndarray, mathematically the adjoint operation of collapsing an array into a scalar, is a PITA to implement. You can't actually grow a Numpy scalar, it cannot by definition be cast into an ndarray, even though newaxis mysteriously works on it:
>>> b[0,1,2,3] = 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'numpy.int32' object does not support item assignment
>>> b[newaxis]
array([1])
In Matlab, growing the size of a scalar is a perfectly acceptable and brainless operation. In Numpy you have to stick jarring a = array(a) everywhere you think you'd have the possibility of starting with a scalar and ending up with an array. I understand why Numpy has to be this way to play nice with Python, but that doesn't change the fact that many new switchers are deeply confused about this. Some have explicit memory of struggling with this behaviour and eventually persevering, while others who are too far gone are generally left with some deep shapeless mental scar that frequently haunts their most innocent dreams. It's an ugly situation for all.
You have to create the scalar array a little bit differently:
>>> x = numpy.float64(1.111)
>>> x
1.111
>>> numpy.isscalar(x)
True
>>> numpy.ndim(x)
0
It looks like scalars in numpy may be a bit different concept from what you may be used to from a purely mathematical standpoint. I'm guessing you're thinking in terms of scalar matricies?