I have a function foo(x,y) that takes two scalars (or lists of scalars) and returns a scalar output (or list of scalars computed pairwise from the input). I want to be able to evaluate this function over 2 orthogonal arrays such that the output is a matrix ij of foo(x[i], y[j]).
I have a for-loop version that solves this problem as below:
import numpy as np
x = np.range(50) # Could be linspaces, whatever the axis in the vector space is
y = np.range(50)
mat = np.zeros(len(x), len(y)) # To hold the result for plotting
for i in range(len(x)):
for j in range(len(y)):
mat[i][j] = foo(x[i], y[j])
where my result is stored in mat. However, this is dreadfully slow, and looks to me as if it could easily be vectorized. I'm not aware of how Python solves this problem however, as this doesn't appear to be something like zip or map. Is there another such function or concept (beyond trivially making extremely long arrays of the same array rotated by a value and passing them that way) that could vectorize this successfully? Or is the nature of the foo function limiting the ability to vectorize this?
In this case, itertools.product is the tool you want. It generates an iterable sequence of elements from the Cartesian product of N inputs, which you can use to discretely map a vector space. You can then evaluate foo on these. This isn't vectorization per se, but does reduce the nested for loop.
See docs at https://docs.python.org/3/library/itertools.html#itertools.product
I am trying to calculate the first and second order moments for a portfolio of stocks (i.e. expected return and standard deviation).
expected_returns_annual
Out[54]:
ticker
adj_close CNP 0.091859
F -0.007358
GE 0.095399
TSLA 0.204873
WMT -0.000943
dtype: float64
type(expected_returns_annual)
Out[55]: pandas.core.series.Series
weights = np.random.random(num_assets)
weights /= np.sum(weights)
returns = np.dot(expected_returns_annual, weights)
So normally the expected return is calculated by
(x1,...,xn' * (R1,...,Rn)
with x1,...,xn are weights with a constraint that all the weights have to sum up to 1 and ' means that the vector is transposed.
Now I am wondering a bit about the numpy dot function, because
returns = np.dot(expected_returns_annual, weights)
and
returns = np.dot(expected_returns_annual, weights.T)
give the same results.
I tested also the shape of weights.T and weights.
weights.shape
Out[58]: (5,)
weights.T.shape
Out[59]: (5,)
The shape of weights.T should be (,5) and not (5,), but numpy displays them as equal (I also tried np.transpose, but there is the same result)
Does anybody know why numpy behave this way? In my opinion the np.dot product automatically shape the vector the right why so that the vector product work well. Is that correct?
Best regards
Tom
The semantics of np.dot are not great
As Dominique Paul points out, np.dot has very heterogenous behavior depending on the shapes of the inputs. Adding to the confusion, as the OP points out in his question, given that weights is a 1D array, np.array_equal(weights, weights.T) is True (array_equal tests for equality of both value and shape).
Recommendation: use np.matmul or the equivalent # instead
If you are someone just starting out with Numpy, my advice to you would be to ditch np.dot completely. Don't use it in your code at all. Instead, use np.matmul, or the equivalent operator #. The behavior of # is more predictable than that of np.dot, while still being convenient to use. For example, you would get the same dot product for the two 1D arrays you have in your code like so:
returns = expected_returns_annual # weights
You can prove to yourself that this gives the same answer as np.dot with this assert:
assert expected_returns_annual # weights == expected_returns_annual.dot(weights)
Conceptually, # handles this case by promoting the two 1D arrays to appropriate 2D arrays (though the implementation doesn't necessarily do this). For example, if you have x with shape (N,) and y with shape (M,), if you do x # y the shapes will be promoted such that:
x.shape == (1, N)
y.shape == (M, 1)
Complete behavior of matmul/#
Here's what the docs have to say about matmul/# and the shapes of inputs/outputs:
If both arguments are 2-D they are multiplied like conventional matrices.
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.
If the second argument is 1-D, it is promoted to a matrix by appending a 1 to its dimensions. After matrix multiplication the appended 1 is removed.
Notes: the arguments for using # over dot
As hpaulj points out in the comments, np.array_equal(x.dot(y), x # y) for all x and y that are 1D or 2D arrays. So why do I (and why should you) prefer #? I think the best argument for using # is that it helps to improve your code in small but significant ways:
# is explicitly a matrix multiplication operator. x # y will raise an error if y is a scalar, whereas dot will make the assumption that you actually just wanted elementwise multiplication. This can potentially result in a hard-to-localize bug in which dot silently returns a garbage result (I've personally run into that one). Thus, # allows you to be explicit about your own intent for the behavior of a line of code.
Because # is an operator, it has some nice short syntax for coercing various sequence types into arrays, without having to explicitly cast them. For example, [0,1,2] # np.arange(3) is valid syntax.
To be fair, while [0,1,2].dot(arr) is obviously not valid, np.dot([0,1,2], arr) is valid (though more verbose than using #).
When you do need to extend your code to deal with many matrix multiplications instead of just one, the ND cases for # are a conceptually straightforward generalization/vectorization of the lower-D cases.
I had the same question some time ago. It seems that when one of your matrices is one dimensional, then numpy will figure out automatically what you are trying to do.
The documentation for the dot function has a more specific explanation of the logic applied:
If both a and b are 1-D arrays, it is inner product of vectors
(without complex conjugation).
If both a and b are 2-D arrays, it is matrix multiplication, but using
matmul or a # b is preferred.
If either a or b is 0-D (scalar), it is equivalent to multiply and
using numpy.multiply(a, b) or a * b is preferred.
If a is an N-D array and b is a 1-D array, it is a sum product over
the last axis of a and b.
If a is an N-D array and b is an M-D array (where M>=2), it is a sum
product over the last axis of a and the second-to-last axis of b:
In NumPy, a transpose .T reverses the order of dimensions, which means that it doesn't do anything to your one-dimensional array weights.
This is a common source of confusion for people coming from Matlab, in which one-dimensional arrays do not exist. See Transposing a NumPy Array for some earlier discussion of this.
np.dot(x,y) has complicated behavior on higher-dimensional arrays, but its behavior when it's fed two one-dimensional arrays is very simple: it takes the inner product. If we wanted to get the equivalent result as a matrix product of a row and column instead, we'd have to write something like
np.asscalar(x # y[:, np.newaxis])
adding a trailing dimension to y to turn it into a "column", multiplying, and then converting our one-element array back into a scalar. But np.dot(x,y) is much faster and more efficient, so we just use that.
Edit: actually, this was dumb on my part. You can, of course, just write matrix multiplication x # y to get equivalent behavior to np.dot for one-dimensional arrays, as tel's excellent answer points out.
The shape of weights.T should be (,5) and not (5,),
suggests some confusion over the shape attribute. shape is an ordinary Python tuple, i.e. just a set of numbers, one for each dimension of the array. That's analogous to the size of a MATLAB matrix.
(5,) is just the way of displaying a 1 element tuple. The , is required because of older Python history of using () as a simple grouping.
In [22]: tuple([5])
Out[22]: (5,)
Thus the , in (5,) does not have a special numpy meaning, and
In [23]: (,5)
File "<ipython-input-23-08574acbf5a7>", line 1
(,5)
^
SyntaxError: invalid syntax
A key difference between numpy and MATLAB is that arrays can have any number of dimensions (upto 32). MATLAB has a lower boundary of 2.
The result is that a 5 element numpy array can have shapes (5,), (1,5), (5,1), (1,5,1)`, etc.
The handling of a 1d weight array in your example is best explained the np.dot documentation. Describing it as inner product seems clear enough to me. But I'm also happy with the
sum product over the last axis of a and the second-to-last axis of b
description, adjusted for the case where b has only one axis.
(5,) with (5,n) => (n,) # 5 is the common dimension
(n,5) with (5,) => (n,)
(n,5) with (5,1) => (n,1)
In:
(x1,...,xn' * (R1,...,Rn)
are you missing a )?
(x1,...,xn)' * (R1,...,Rn)
And the * means matrix product? Not elementwise product (.* in MATLAB)? (R1,...,Rn) would have size (n,1). (x1,...,xn)' size (1,n). The product (1,1).
By the way, that raises another difference. MATLAB expands dimensions to the right (n,1,1...). numpy expands them to the left (1,1,n) (if needed by broadcasting). The initial dimensions are the outermost ones. That's not as critical a difference as the lower size 2 boundary, but shouldn't be ignored.
What I am trying to do is take a numpy array representing 3D image data and calculate the hessian matrix for every voxel. My input is a matrix of shape (Z,X,Y) and I can easily take a slice along z and retrieve a single original image.
gx, gy, gz = np.gradient(imgs)
gxx, gxy, gxz = np.gradient(gx)
gyx, gyy, gyz = np.gradient(gy)
gzx, gzy, gzz = np.gradient(gz)
And I can access the hessian for an individual voxel as follows:
x = 100
y = 100
z = 63
H = [[gxx[z][x][y], gxy[z][x][y], gxz[z][x][y]],
[gyx[z][x][y], gyy[z][x][y], gyz[z][x][y]],
[gzx[z][x][y], gzy[z][x][y], gzz[z][x][y]]]
But this is cumbersome and I can't easily slice the data.
I have tried using reshape as follows
H = H.reshape(Z, X, Y, 3, 3)
But when I test this by retrieving the hessian for a specific voxel the, the value returned from the reshaped array is completely different than the original array.
I think I could use zip somehow but I have only been able to find that for making lists of tuples.
Bonus: If there's a faster way to accomplish this please let me know, I essentially need to calculate the three eigenvalues of the hessian matrix for every voxel in the 3D data set. Calculating the hessian values is really fast but finding the eigenvalues for a single 2D image slice takes about 20 seconds. Are there any GPUs or tensor flow accelerated libraries for image processing?
We can use a list comprehension to get the hessians -
H_all = np.array([np.gradient(i) for i in np.gradient(imgs)]).transpose(2,3,4,0,1)
Just to give it a bit of explanation : [np.gradient(i) for i in np.gradient(imgs)] loops through the two levels of outputs from np.gradient calls, resulting in a (3 x 3) shaped tensor at the outer two axes. We need these two as the last two axes in the final output. So, we push those at the end with the transpose.
Thus, H_all holds all the hessians and hence we can extract our specific hessian given x,y,z, like so -
x = 100
y = 100
z = 63
H = H_all[z,y,x]
I have multiple arrays of the same dimension, or rather a matrix say
data.shape
# (n, m)
I want to interpolate the m-axis and leave the n-axis. Ideally I would get a function which I can call by with an x-array of length n.
interpolated(x)
x.shape
# (n,)
I tried
from scipy import interpolate
interpolated = interpolate.interp1d(x=x_points, y=data)
interpolated(x).shape
# (n, n)
but this evaluates every array at the given point. Is there a better way to do it than ugly loops like
interpolated = array(interpolate.interp1d(x=x_points, y=array_) for
array_ in data)
array(func_(xi) for func_, xi in zip(interpolated, x))
Your (n,m)-shaped data is, as you said, is a collection of n datasets, each of length m. You're trying to pass this an n-length x array, and expect to obtain an n-length result. That is, you're querying the n independent datasets at n unrelated points.
This makes me believe that you need to use n independent interpolators. There is no real benefit in trying to get away with a single call to an interpolation routine. Interpolation routines as far as I know assume that the target of the interpolation is a single object. Either a multivariate function, or a function that has an array-shaped value; in either case you can query the function one (optionally higher-dimensional) point at a time. For instance, multilinear interpolation works across rows of the input, so there's (again, as far as I know) no way to "interpolate linearly along an axis". In your case, there is absolutely no relationship between the rows of your data, and there's no relationship between query points, so it's also semantically motivated to use n independent interpolators for your problem.
As for convenience, you can shove all those interpolating functions into a single function for ease of use:
interpolated = [interpolate.interp1d(x=x_points, y=array_) for
array_ in data]
def common_interpolator(x):
'''interpolate n separate datasets at n separate input points'''
return array([fun(xx) for fun,xx in zip(interpolated,x)])
This will allow you to use a single call to common_interpolator with an input array_like of length n.
But since you mentioned it in comments, you can actually make use of np.vectorize if you want to add multiple sets if query points to this function. Here's a complete example with three trivial dummy functions:
import numpy as np
# three scalar (well, or vectorized) functions:
funs = [lambda x,i=i: x+i for i in range(3)]
# define a wrapper for calling them together
def allfuns(xs):
'''bundled call to functions: n-length input to n-length output'''
return np.array([fun(x) for fun,x in zip(funs,xs)])
# define a vectorized version of the wrapper, (...,n) to (...,n)-shape
allfuns_vector = np.vectorize(allfuns,signature='(n)->(n)')
# print some examples
x = np.arange(3)
print([fun(xx) for fun,xx in zip(funs,x)])
# [0, 2, 4]
print(allfuns(x))
# [0 2 4]
print(allfuns_vector(x))
# [0 2 4]
print(allfuns_vector([x,x+10]))
#[[ 0 2 4]
# [10 12 14]]
As you can see, all of the above work the same way for a 1d input array. But we can pass a (k,n)-shaped array to the vectorized version and it will perform the interpolation row-wise, that is each [:,n] slice will be fed to the original interpolator bundle. As far as I know np.vectorize is essentially a wrapper for a for loop, but at least it makes calling your functions more convenient.
I have the following line of code in MATLAB which I am trying to convert to Python numpy:
pred = traindata(:,2:257)*beta;
In Python, I have:
pred = traindata[ : , 1:257]*beta
beta is a 256 x 1 array.
In MATLAB,
size(pred) = 1389 x 1
But in Python,
pred.shape = (1389L, 256L)
So, I found out that multiplying by the beta array is producing the difference between the two arrays.
How do I write the original Python line, so that the size of pred is 1389 x 1, like it is in MATLAB when I multiply by my beta array?
I suspect that beta is in fact a 1D numpy array. In numpy, 1D arrays are not row or column vectors where MATLAB clearly makes this distinction. These are simply 1D arrays agnostic of any shape. If you must, you need to manually introduce a new singleton dimension to the beta vector to facilitate the multiplication. On top of this, the * operator actually performs element-wise multiplication. To perform matrix-vector or matrix-matrix multiplication, you must use numpy's dot function to do so.
Therefore, you must do something like this:
import numpy as np # Just in case
pred = np.dot(traindata[:, 1:257], beta[:,None])
beta[:,None] will create a 2D numpy array where the elements from the 1D array are populated along the rows, effectively making a column vector (i.e. 256 x 1). However, if you have already done this on beta, then you don't need to introduce the new singleton dimension. Just use dot normally:
pred = np.dot(traindata[:, 1:257], beta)