How to use numpy empty_like - python

I've got the following array:
rxn_probability = [1. 0. 0.]
And I want to create another array, num_rxn, that has the same shape and size of rxn_probability, which contains a number of the reactions, so in this case num_rxn would be: [1, 2, 3]. Starting at 1 and increasing until it reaches the same size and shape of rxn_probability, so that if I change the size of rxn_probability the size and shape of num_rxn will automatically be changed.
So far I've tried:
num_rxn = np.array(range(len(rxn_probability + 1)))
(also tried using arange in a similar way)
But this outputs:
[0 1 2]
which isn't what I want because it doesn't start at 1 or end at 3.
I've been reading about numpy.empty_like but I'm not sure if that would be the best or right solution. Any ideas?
Cheers

You can use np.arange and then reshape to fit the shape of rxn_probability:
num_rxn = np.arange(1, rxn_probability.size + 1).reshape(rxn_probability.shape)

Related

Numpy appending two-dimensional arrays together

I am trying to create a function which exponentiates a 2-D matrix and keeps the result in a 3D array, where the first dimension is indexing the exponent. This is important because the rows of the matrix I am exponentiating represent information about different vertices on a graph. So for example if we have A, A^2, A^3, each is shape (50,50) and I want a matrix D = (3,50,50) so that I can go D[:,1,:] to retrieve all the information about node 1 and be able to do matrix multiplication with that. My code is currently as
def expo(times,A,n):
temp = A;
result = csr_matrix.toarray(temp)
for i in range(0,times):
temp = np.dot(temp,A)
if i == 0:
result = np.array([result,csr_matrix.toarray(temp)]) # this creates a (2,50,50) array
if i > 0:
result = np.append(result,csr_matrix.toarray(temp),axis=0) # this does not work
return result
However, this is not working because in the "i>0" case the temp array is of the shape (50,50) and cannot be appended. I am not sure how to make this work and I am rather confused by the dimensionality in Numpy, e.g. why thinks are (50,1) sometimes and just (50,) other times. Would anyone be able to help me make this code work and explain generally how these things should be done in Numpy?
Documentation reference
If you want to stack matrices in numpy, you can use the stack function.
If you also want the index to correspond to the exponent, you might want to add a unity matrix to the beginning of your output:
MWE
import numpy as np
def expo(A, n):
result =[np.eye(len(A)), A,]
for _ in range(n-1):
result.append(result[-1].dot(A))
return np.stack(result, axis=0)
# If you do not really need the 3D array,
# you could also just return the list
result = expo(np.array([[1,-2],[-2,1]]), 3)
print(result)
# [[[ 1. 0.]
# [ 0. 1.]]
#
# [[ 1. -2.]
# [ -2. 1.]]
#
# [[ 5. -4.]
# [ -4. 5.]]
#
# [[ 13. -14.]
# [-14. 13.]]]
print(result[1])
# [[ 1. -2.]
# [-2. 1.]]
Comments
As you can see, we first simply create the list of matrices, and then convert them to an array at the end. I am not sure if you really need the 3D array though, as you could also just index the list that was created, but that depends on your use case, if that is convenient or not.
I guess the axis keyword argument for a lot of numpy functions can be confusing at first, but the documentation usually has good examples that combined with same trial and error, should get you pretty far. For example for numpy.stack, the very first example is indeed exactly what you want to do.

What is the fastest way to stack numpy arrays in a loop?

I have a code that generates me within a for loop two numpy arrays (data_transform). In the first loop generates a numpy array of (40, 2) and in the second loop one of (175, 2). I want to concatenate these two arrays into one, to give me an array of (215, 2). I tried with np.concatenate and with np.append, but it gives me an error since the arrays must be the same size. Here is an example of how I am doing the code:
result_arr = np.array([])
for label in labels_set:
data = [index for index, value in enumerate(labels_list) if value == label]
for i in data:
sub_corpus.append(corpus[i])
data_sub_tfidf = vec.fit_transform(sub_corpus)
data_transform = pca.fit_transform(data_sub_tfidf)
#Append array
sub_corpus = []
I have also used np.row_stack but nothing else gives me a value of (175, 2) which is the second array I want to concatenate.
What #hpaulj was trying to say with
Stick with list append when doing loops.
is
#use a normal list
result_arr = []
for label in labels_set:
data_transform = pca.fit_transform(data_sub_tfidf)
# append the data_transform object to that list
# Note: this is not np.append(), which is slow here
result_arr.append(data_transform)
# and stack it after the loop
# This prevents slow memory allocation in the loop.
# So only one large chunk of memory is allocated since
# the final size of the concatenated array is known.
result_arr = np.concatenate(result_arr)
# or
result_arr = np.stack(result_arr, axis=0)
# or
result_arr = np.vstack(result_arr)
Your arrays don't really have different dimensions. They have one different dimension, the other one is identical. And in that case you can always stack along the "different" dimension.
Using concatenate, initializing "c":
a = np.array([[8,3,1],[2,5,1],[6,5,2]])
b = np.array([[2,5,1],[2,5,2]])
matrix = [a,b]
c = np.empty([0,matrix[0].shape[1]])
for v in matrix:
c = np.append(c, v, axis=0)
Output:
[[8. 3. 1.]
[2. 5. 1.]
[6. 5. 2.]
[2. 5. 1.]
[2. 5. 2.]]
If you have an array a of size (40, 2) and an array b of size (175,2), you can simply have a final array of size (215, 2) using np.concatenate([a,b]).

padding a list of torch tensors (or numpy arrays)

Let's say I have a list as the following:
l = [torch.randn(2,3), torch.randn(2,4),torch.randn(2,5)]
I want to zero pad all of them in the second dimension, so they will extend as far as 5 elements (5 being the max number between the three of the elements in the second dimension). How can I do this. I tried this but failed:
from torch.nn.utils.rnn import pad_sequence
pad_sequence(l, batch_first=True, padding_value=0)
which caused the following error:
RuntimeError: The expanded size of the tensor (3) must match the existing size (4) at non-singleton dimension 1. Target sizes: [2, 3]. Tensor sizes: [2, 4]
The equivalent answer in Numpy would also be appreciated.
One option is to use np.pad.
Example:
import numpy as np
a = np.random.randn(2, 3)
b = np.pad(a, [(0, 0), (0, 2)], mode='constant')
Print a gives
[[ 1.22721163 1.23456672 0.51948003]
[ 0.16545496 0.06609003 -0.32071653]]
Print b gives
[[ 1.22721163 1.23456672 0.51948003 0. 0. ]
[ 0.16545496 0.06609003 -0.32071653 0. 0. ]]
The second argument of pad is pad_width which is a list of before/after paddings for each dimension. So in this example no padding in the first dimension and two paddings at the end of the second dimension.
There are lots of other mode options you can use so check out the docs.
For your particular problem you'll need to add an extra step to work out the padding for each array.
Edit
For pytorch I think you want torch.nn.functional.pad e.g.
import torch
t = torch.randn(2, 3)
torch.nn.functional.pad(t, (0, 2))
Edit 2
The torch.nn.utils.rnn.pad_sequence requires the trailing dimensions of all the tensors in the list to be the same so you need to some transposing for it to work nicely
import torch
# l = [torch.randn(2,3), torch.randn(2,4),torch.randn(2,5)]
# l = [i.transpose(0, 1) for i in l]
# or simply make you tensors with switched dimensions
l = [torch.randn(3,2), torch.randn(4,2),torch.randn(5,2)]
out = torch.nn.utils.rnn.pad_sequence(l, batch_first=True)
# out will now be a tensor with shape (3, 5, 2)
# You can transpose it back to (3, 2, 5) with
out = out.transpose(1, 2)

python, tensorflow, how to get a tensor shape with half the features

I need the shape of a tensor, except instead of feature_size as the -1 dimension I need feature_size//2
The code I'm currently using is
_, half_output = tf.split(output,2,axis=-1)
half_shape = tf.shape(half_output)
This works but it's incredibly inelegant. I don't need an extra copy of half the tensor, I just need that shape. I've tried to do this other ways but nothing besides this bosh solution has worked yet.
Anyone know a simple way to do this?
A simple way to get the shape with the last value halved:
half_shape = tf.shape(output[..., 1::2])
What it does is simply iterate output in its last dimension with step 2, starting from the second element (index 1).
The ... doesn't touch other dimensions. As a result, you will have output[..., 1::2] with the same dimensions as output, except for the last one, which will be sampled like the following example, resulting in half the original value.
>>> a = np.random.rand(5,5)
>>> a
array([[ 0.21553665, 0.62008421, 0.67069869, 0.74136913, 0.97809012],
[ 0.70765302, 0.14858418, 0.47908281, 0.75706245, 0.70175868],
[ 0.13786186, 0.23760233, 0.31895335, 0.69977537, 0.40196103],
[ 0.7601455 , 0.09566717, 0.02146819, 0.80189659, 0.41992885],
[ 0.88053697, 0.33472285, 0.84303012, 0.10148065, 0.46584882]])
>>> a[..., 1::2]
array([[ 0.62008421, 0.74136913],
[ 0.14858418, 0.75706245],
[ 0.23760233, 0.69977537],
[ 0.09566717, 0.80189659],
[ 0.33472285, 0.10148065]])
This half_shape prints the following Tensor:
Tensor("Shape:0", shape=(3,), dtype=int32)
Alternatively you could get the shape of output and create the shape you want manually:
s = output.get_shape().as_list()
half_shape = tf.TensorShape(s[:-1] + [s[-1] // 2])
This half_shape prints a TensorShape showing the shape halved in the last dimension.

Numpy Broadcast to perform euclidean distance vectorized

I have matrices that are 2 x 4 and 3 x 4. I want to find the euclidean distance across rows, and get a 2 x 3 matrix at the end. Here is the code with one for loop that computes the euclidean distance for every row vector in a against all b row vectors. How do I do the same without using for loops?
import numpy as np
a = np.array([[1,1,1,1],[2,2,2,2]])
b = np.array([[1,2,3,4],[1,1,1,1],[1,2,1,9]])
dists = np.zeros((2, 3))
for i in range(2):
dists[i] = np.sqrt(np.sum(np.square(a[i] - b), axis=1))
Here are the original input variables:
A = np.array([[1,1,1,1],[2,2,2,2]])
B = np.array([[1,2,3,4],[1,1,1,1],[1,2,1,9]])
A
# array([[1, 1, 1, 1],
# [2, 2, 2, 2]])
B
# array([[1, 2, 3, 4],
# [1, 1, 1, 1],
# [1, 2, 1, 9]])
A is a 2x4 array.
B is a 3x4 array.
We want to compute the Euclidean distance matrix operation in one entirely vectorized operation, where dist[i,j] contains the distance between the ith instance in A and jth instance in B. So dist is 2x3 in this example.
The distance
could ostensibly be written with numpy as
dist = np.sqrt(np.sum(np.square(A-B))) # DOES NOT WORK
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# ValueError: operands could not be broadcast together with shapes (2,4) (3,4)
However, as shown above, the problem is that the element-wise subtraction operation A-B involves incompatible array sizes, specifically the 2 and 3 in the first dimension.
A has dimensions 2 x 4
B has dimensions 3 x 4
In order to do element-wise subtraction, we have to pad either A or B to satisfy numpy's broadcast rules. I'll choose to pad A with an extra dimension so that it becomes 2 x 1 x 4, which allows the arrays' dimensions to line up for broadcasting. For more on numpy broadcasting, see the tutorial in the scipy manual and the final example in this tutorial.
You can perform the padding with either np.newaxis value or with the np.reshape command. I show both below:
# First approach is to add the extra dimension to A with np.newaxis
A[:,np.newaxis,:] has dimensions 2 x 1 x 4
B has dimensions 3 x 4
# Second approach is to reshape A with np.reshape
np.reshape(A, (2,1,4)) has dimensions 2 x 1 x 4
B has dimensions 3 x 4
As you can see, using either approach will allow the dimensions to line up. I'll use the first approach with np.newaxis. So now, this will work to create A-B, which is a 2x3x4 array:
diff = A[:,np.newaxis,:] - B
# Alternative approach:
# diff = np.reshape(A, (2,1,4)) - B
diff.shape
# (2, 3, 4)
Now we can put that difference expression into the dist equation statement to get the final result:
dist = np.sqrt(np.sum(np.square(A[:,np.newaxis,:] - B), axis=2))
dist
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
Note that the sum is over axis=2, which means take the sum over the 2x3x4 array's third axis (where the axis id starts with 0).
If your arrays are small, then the above command will work just fine. However, if you have large arrays, then you may run into memory issues. Note that in the above example, numpy internally created a 2x3x4 array to perform the broadcasting. If we generalize A to have dimensions a x z and B to have dimensions b x z, then numpy will internally create an a x b x z array for broadcasting.
We can avoid creating this intermediate array by doing some mathematical manipulation. Because you are computing the Euclidean distance as a sum-of-squared-differences, we can take advantage of the mathematical fact that sum-of-squared-differences can be rewritten.
Note that the middle term involves the sum over element-wise multiplication. This sum over multiplcations is better known as a dot product. Because A and B are each a matrix, then this operation is actually a matrix multiplication. We can thus rewrite the above as:
We can then write the following numpy code:
threeSums = np.sum(np.square(A)[:,np.newaxis,:], axis=2) - 2 * A.dot(B.T) + np.sum(np.square(B), axis=1)
dist = np.sqrt(threeSums)
dist
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
Note that the answer above is exactly the same as the previous implementation. Again, the advantage here is the we do not need to create the intermediate 2x3x4 array for broadcasting.
For completeness, let's double-check that the dimensions of each summand in threeSums allowed broadcasting.
np.sum(np.square(A)[:,np.newaxis,:], axis=2) has dimensions 2 x 1
2 * A.dot(B.T) has dimensions 2 x 3
np.sum(np.square(B), axis=1) has dimensions 1 x 3
So, as expected, the final dist array has dimensions 2x3.
This use of the dot product in lieu of sum of element-wise multiplication is also discussed in this tutorial.
I had the same problem recently working with deep learning(stanford cs231n,Assignment1),but when I used
np.sqrt((np.square(a[:,np.newaxis]-b).sum(axis=2)))
There was a error
MemoryError
That means I ran out of memory(In fact,that produced a array of 500*5000*1024 in the middle.It's so huge!)
To prevent that error,we can use a formula to simplify:
code:
import numpy as np
aSumSquare = np.sum(np.square(a),axis=1);
bSumSquare = np.sum(np.square(b),axis=1);
mul = np.dot(a,b.T);
dists = np.sqrt(aSumSquare[:,np.newaxis]+bSumSquare-2*mul)
Simply use np.newaxis at the right place:
np.sqrt((np.square(a[:,np.newaxis]-b).sum(axis=2)))
This functionality is already included in scipy's spatial module and I recommend using it as it will be vectorized and highly optimized under the hood. But, as evident by the other answer, there are ways you can do this yourself.
import numpy as np
a = np.array([[1,1,1,1],[2,2,2,2]])
b = np.array([[1,2,3,4],[1,1,1,1],[1,2,1,9]])
np.sqrt((np.square(a[:,np.newaxis]-b).sum(axis=2)))
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
from scipy.spatial.distance import cdist
cdist(a,b)
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])
Using numpy.linalg.norm also works well with broadcasting. Specifying an integer value for axis will use a vector norm, which defaults to Euclidean norm.
import numpy as np
a = np.array([[1,1,1,1],[2,2,2,2]])
b = np.array([[1,2,3,4],[1,1,1,1],[1,2,1,9]])
np.linalg.norm(a[:, np.newaxis] - b, axis = 2)
# array([[ 3.74165739, 0. , 8.06225775],
# [ 2.44948974, 2. , 7.14142843]])

Categories