Related
I am trying to calculate the first and second order moments for a portfolio of stocks (i.e. expected return and standard deviation).
expected_returns_annual
Out[54]:
ticker
adj_close CNP 0.091859
F -0.007358
GE 0.095399
TSLA 0.204873
WMT -0.000943
dtype: float64
type(expected_returns_annual)
Out[55]: pandas.core.series.Series
weights = np.random.random(num_assets)
weights /= np.sum(weights)
returns = np.dot(expected_returns_annual, weights)
So normally the expected return is calculated by
(x1,...,xn' * (R1,...,Rn)
with x1,...,xn are weights with a constraint that all the weights have to sum up to 1 and ' means that the vector is transposed.
Now I am wondering a bit about the numpy dot function, because
returns = np.dot(expected_returns_annual, weights)
and
returns = np.dot(expected_returns_annual, weights.T)
give the same results.
I tested also the shape of weights.T and weights.
weights.shape
Out[58]: (5,)
weights.T.shape
Out[59]: (5,)
The shape of weights.T should be (,5) and not (5,), but numpy displays them as equal (I also tried np.transpose, but there is the same result)
Does anybody know why numpy behave this way? In my opinion the np.dot product automatically shape the vector the right why so that the vector product work well. Is that correct?
Best regards
Tom
The semantics of np.dot are not great
As Dominique Paul points out, np.dot has very heterogenous behavior depending on the shapes of the inputs. Adding to the confusion, as the OP points out in his question, given that weights is a 1D array, np.array_equal(weights, weights.T) is True (array_equal tests for equality of both value and shape).
Recommendation: use np.matmul or the equivalent # instead
If you are someone just starting out with Numpy, my advice to you would be to ditch np.dot completely. Don't use it in your code at all. Instead, use np.matmul, or the equivalent operator #. The behavior of # is more predictable than that of np.dot, while still being convenient to use. For example, you would get the same dot product for the two 1D arrays you have in your code like so:
returns = expected_returns_annual # weights
You can prove to yourself that this gives the same answer as np.dot with this assert:
assert expected_returns_annual # weights == expected_returns_annual.dot(weights)
Conceptually, # handles this case by promoting the two 1D arrays to appropriate 2D arrays (though the implementation doesn't necessarily do this). For example, if you have x with shape (N,) and y with shape (M,), if you do x # y the shapes will be promoted such that:
x.shape == (1, N)
y.shape == (M, 1)
Complete behavior of matmul/#
Here's what the docs have to say about matmul/# and the shapes of inputs/outputs:
If both arguments are 2-D they are multiplied like conventional matrices.
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.
If the second argument is 1-D, it is promoted to a matrix by appending a 1 to its dimensions. After matrix multiplication the appended 1 is removed.
Notes: the arguments for using # over dot
As hpaulj points out in the comments, np.array_equal(x.dot(y), x # y) for all x and y that are 1D or 2D arrays. So why do I (and why should you) prefer #? I think the best argument for using # is that it helps to improve your code in small but significant ways:
# is explicitly a matrix multiplication operator. x # y will raise an error if y is a scalar, whereas dot will make the assumption that you actually just wanted elementwise multiplication. This can potentially result in a hard-to-localize bug in which dot silently returns a garbage result (I've personally run into that one). Thus, # allows you to be explicit about your own intent for the behavior of a line of code.
Because # is an operator, it has some nice short syntax for coercing various sequence types into arrays, without having to explicitly cast them. For example, [0,1,2] # np.arange(3) is valid syntax.
To be fair, while [0,1,2].dot(arr) is obviously not valid, np.dot([0,1,2], arr) is valid (though more verbose than using #).
When you do need to extend your code to deal with many matrix multiplications instead of just one, the ND cases for # are a conceptually straightforward generalization/vectorization of the lower-D cases.
I had the same question some time ago. It seems that when one of your matrices is one dimensional, then numpy will figure out automatically what you are trying to do.
The documentation for the dot function has a more specific explanation of the logic applied:
If both a and b are 1-D arrays, it is inner product of vectors
(without complex conjugation).
If both a and b are 2-D arrays, it is matrix multiplication, but using
matmul or a # b is preferred.
If either a or b is 0-D (scalar), it is equivalent to multiply and
using numpy.multiply(a, b) or a * b is preferred.
If a is an N-D array and b is a 1-D array, it is a sum product over
the last axis of a and b.
If a is an N-D array and b is an M-D array (where M>=2), it is a sum
product over the last axis of a and the second-to-last axis of b:
In NumPy, a transpose .T reverses the order of dimensions, which means that it doesn't do anything to your one-dimensional array weights.
This is a common source of confusion for people coming from Matlab, in which one-dimensional arrays do not exist. See Transposing a NumPy Array for some earlier discussion of this.
np.dot(x,y) has complicated behavior on higher-dimensional arrays, but its behavior when it's fed two one-dimensional arrays is very simple: it takes the inner product. If we wanted to get the equivalent result as a matrix product of a row and column instead, we'd have to write something like
np.asscalar(x # y[:, np.newaxis])
adding a trailing dimension to y to turn it into a "column", multiplying, and then converting our one-element array back into a scalar. But np.dot(x,y) is much faster and more efficient, so we just use that.
Edit: actually, this was dumb on my part. You can, of course, just write matrix multiplication x # y to get equivalent behavior to np.dot for one-dimensional arrays, as tel's excellent answer points out.
The shape of weights.T should be (,5) and not (5,),
suggests some confusion over the shape attribute. shape is an ordinary Python tuple, i.e. just a set of numbers, one for each dimension of the array. That's analogous to the size of a MATLAB matrix.
(5,) is just the way of displaying a 1 element tuple. The , is required because of older Python history of using () as a simple grouping.
In [22]: tuple([5])
Out[22]: (5,)
Thus the , in (5,) does not have a special numpy meaning, and
In [23]: (,5)
File "<ipython-input-23-08574acbf5a7>", line 1
(,5)
^
SyntaxError: invalid syntax
A key difference between numpy and MATLAB is that arrays can have any number of dimensions (upto 32). MATLAB has a lower boundary of 2.
The result is that a 5 element numpy array can have shapes (5,), (1,5), (5,1), (1,5,1)`, etc.
The handling of a 1d weight array in your example is best explained the np.dot documentation. Describing it as inner product seems clear enough to me. But I'm also happy with the
sum product over the last axis of a and the second-to-last axis of b
description, adjusted for the case where b has only one axis.
(5,) with (5,n) => (n,) # 5 is the common dimension
(n,5) with (5,) => (n,)
(n,5) with (5,1) => (n,1)
In:
(x1,...,xn' * (R1,...,Rn)
are you missing a )?
(x1,...,xn)' * (R1,...,Rn)
And the * means matrix product? Not elementwise product (.* in MATLAB)? (R1,...,Rn) would have size (n,1). (x1,...,xn)' size (1,n). The product (1,1).
By the way, that raises another difference. MATLAB expands dimensions to the right (n,1,1...). numpy expands them to the left (1,1,n) (if needed by broadcasting). The initial dimensions are the outermost ones. That's not as critical a difference as the lower size 2 boundary, but shouldn't be ignored.
I have a question how to efficiently apply a function which takes an m-dimensional slice of a n-dimensional array as an input.
For example, I have a n-dimensional array of shape (i,j,k,l). And on the dimensions (j,l), I want to apply the function, which gives me back a matrix of shape (j,l). The resulting numpy array should again have the shape (i,j,k,l).
For example I want to apply the following, normalisation function
def norm(arr2d):
return arr2d - np.mean(arr2d)
over the array
arrnd = np.arange(2*3*4*5).reshape(2,3,4,5) # Shape is (2,3,4,5)
on the slice (j,l).
The result I want to achieve I would get via a (slow?) Python list comprehension and moving axes.
result = np.asarray([ [ f(arrnd[:,j,:,l]) for l in range(5) ] for j in range(3)]) # Shape is (3,5,2,4)
result = np.moveaxis(np.moveaxis(result,2,0),2,3).shape # Shape is (2,3,4,5) again
Is there any better, more "numpyic" way to achieve this, without any involved loops?
I alreay looked at np.apply_along_axis() and np.apply_over_axes() but the former only works for 1-d functions, and the latter might only work, if my function is implemented as a ufunc.
The example I provided is just a toy example. The solution should work for any python function.
((If normalising a slice would be my specific problem, I could have circumenvented the python loop and moveaxis by using the ufunc's axes=(..).))
I have a different shape of 3D matrices. Such as:
Matrix shape = [5,10,2048]
Matrix shape = [5,6,2048]
Matrix shape = [5,1,2048]
and so on....
I would like to put them into big matrix, but I am normally getting a shape error (since they have different shape) when I am trying to use numpy.asarray(list_of_matrix) function.
What would be your recommendation to handle such a case?
My implementation was like the following:
matrices = []
matrices.append(mat1)
matrices.append(mat2)
matrices.append(mat3)
result_matrix = numpy.asarray(matrices)
and having shape error!!
UPDATE
I am willing to have a result matrix that is 4D.
Thank you.
I'm not entirely certain if this would work for you, but it looks as though your matrices only disagree along the 1st axis, so why not concatenate them:
e.g.
>>> import numpy as np
>>> c=np.zeros((5,10,2048))
>>> d=np.zeros((5,6,2048))
>>> e=np.zeros((5,1,2048))
>>> f=np.concatenate((c,d,e),axis=1)
>>> f.shape
(5, 17, 2048)
Now, you'd have to keep track of which indices of the 1st axis corresponds to which matrices, but maybe this could work for you?
I'm trying to figure out what the parameters to the reshape() function are below.
I didn't find anything about pickle having a reshape() method, and import cPickle as pickle, import numpy as np were given in the file, so I'm assuming (maybe a bad assumption) that the reshape function is because of numpy. I found the definition of the reshape method for numpy (also below). However, I can't tell which arguments belong to which parameter.
Because this thing is supposed to load in picture data, I'm guessing 32,32 might be the image size, and would correspond to the newshape parameter?
I don't have a clue what 1000,3 are doing: the term "array_like" for the a parameter is confusing, and I don't know why 4 parameters are given if there's only 3 for the method, or how python would know that 32,32 is one argument, if it really is (why no []?)
Basically, what parameter does each argument (passed in) belong to? And how on earth can it tell? And how did X go from being an object from the pickle load that has numpy methods on it? Is that even possible?
datadict = pickle.load(f)
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32)
The documentation you've linked is slightly different than what is actually happening, which may explain your confusion. The actual documentation, which is effectively the same function but set up as an object method instead of a library method, is here.
In this case, the (10000, 3, 32, 32) corresponds to the shape of the output array. So your output is actually a 4-dimensional array with shape (10000, 3, 32, 32). I suspect that if this is supposed to be image data, you could have a 32x32 image with RGB values and 1,000 images.
Additionally, pickle stores type information when you store objects, so this is how it knows that the object is a numpy array!
This loads a dictionary from the file:
datadict = pickle.load(f)
Then select two values from the dictionary. Ordinary dictionary key indexing:
X = datadict['data']
Y = datadict['labels']
Evidently X is a numpy array. reshape is a method (a function that 'belongs' to the array).
X = X.reshape(10000, 3, 32, 32)
A numpy array has a property called shape. After this reshape, X.shape should return (10000, 3, 32, 32), the shape of a 4dimensional array. The numbers are the newshape parameter described in the documentation.
The documentation is for the function version of reshape. It would be used as:
X = np.reshape(X, (10000, 3, 32, 32))
Same functionality, just a different way of invoking it.
To go on from here you probably need to study numpy documentation.
The documentation for the method version is:
a.reshape(shape, order='C')
Returns an array containing the same data with a new shape.
Refer to numpy.reshape for full documentation.
I have an array of values and would like to create a matrix from that, where each row is my starting point vector multiplied by a sample from a (normal) distribution.
The number of rows of this matrix will then vary in dependence from the number of samples I want.
%pylab
my_vec = array([1,2,3])
my_rand_vec = my_vec*randn(100)
Last command does not work, because array shapes do not match.
I could think of using a for loop, but I am trying to leverage on array operations.
Try this
my_rand_vec = my_vec[None,:]*randn(100)[:,None]
For small numbers I get for example
import numpy as np
my_vec = np.array([1,2,3])
my_rand_vec = my_vec[None,:]*np.random.randn(5)[:,None]
my_rand_vec
# array([[ 0.45422416, 0.90844831, 1.36267247],
# [-0.80639766, -1.61279531, -2.41919297],
# [ 0.34203295, 0.6840659 , 1.02609885],
# [-0.55246431, -1.10492863, -1.65739294],
# [-0.83023829, -1.66047658, -2.49071486]])
Your solution my_vec*rand(100) does not work because * corresponds to the element-wise multiplication which only works if both arrays have identical shapes.
What you have to do is adding an additional dimension using [None,:] and [:,None] such that numpy's broadcasting works.
As a side note I would recommend not to use pylab. Instead, use import as in order to include modules as pointed out here.
It is the outer product of vectors:
my_rand_vec = numpy.outer(randn(100), my_vec)
You can pass the dimensions of the array you require to numpy.random.randn:
my_rand_vec = my_vec*np.random.randn(100,3)
To multiply each vector by the same random number, you need to add an extra axis:
my_rand_vec = my_vec*np.random.randn(100)[:,np.newaxis]