I have a 2d MxN array A , each row of which is a sequence of indices, padded by -1's at the end e.g.:
[[ 2 1 -1 -1 -1]
[ 1 4 3 -1 -1]
[ 3 1 0 -1 -1]]
I have another MxN array of float values B:
[[ 0.7 0.4 1.5 2.0 4.4 ]
[ 0.8 4.0 0.3 0.11 0.53]
[ 0.6 7.4 0.22 0.71 0.06]]
and I want to use the indices in A to filter B i.e. for each row, only the indices present in A retain their values, and the values at all other locations are set to 0.0, i.e. the result would look like:
[[ 0.0 0.4 1.5 0.0 0.0 ]
[ 0.0 4.0 0.0 0.11 0.53 ]
[ 0.6 7.4 0.0 0.71 0.0]]
What's a good way to do this in "pure" numpy? (I would like to do this in pure numpy so I can jit it in jax.
Numpy supports fancy indexing. Ignoring the "-1" entries for the moment, you can do something like this:
index = (np.arange(B.shape[0]).reshape(-1, 1), A)
result = np.zeros_like(B)
result[index] = B[index]
This works because indices are broadcasted. The column np.arange(B.shape[0]).reshape(-1, 1) matches all the elements of a given row of A to the corresponding row in B and result.
This example does not address the fact that -1 is a valid numpy index. You need to clear the elements that correspond to -1 in A when 4 (the last column) is not present in that row:
mask = (A == -1).any(axis=1) & (A != A.shape[1] - 1).all(axis=1)
result[mask, -1] = 0.0
Here, the mask is [True, False, True], indicating that even though the second row has a -1 in it, it also contains a 4.
This approach is fairly efficient. It will create no more than a couple of boolean arrays of the same shape as A for the mask.
You can use broadcasting, but note that it will create a large intermediate array of shape (M, N, N) (in pure numpy at least):
import numpy as np
A = ...
B = ...
M, N = A.shape
out = np.where(np.any(A[..., None] == np.arange(N), axis=1), B, 0.0)
out:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
Another possible solution:
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
np.where(mask, B, 0)
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
EDIT (When there is rows with only -1)
The following code aims to contemplate the possibility, raised by #MadPhysicist (to whom I thank), of having rows containing only -1 -- that is only necessary to add 2 lines of code to my previous code.
A = np.array([[ 2, 1, -1, -1, -1],
[ -1, -1, -1, -1, -1],
[ 3, 1, 0, -1, -1]])
B = np.array([[ 0.7, 0.4, 1.5, 2.0, 4.4 ],
[ 0.8, 4.0, 0.3, 0.11, 0.53],
[ 0.6, 7.4, 0.22, 0.71, 0.06]])
rminus1 = np.all(A == -1, axis=1) # new
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
C = np.where(mask, B, 0)
C[rminus1, :] = 0 # new
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
Related
I have 2 numpy matrix with slightly different alignment
X
id, value
1, 0.78
2, 0.65
3, 0.77
...
...
98, 0.88
99, 0.77
100, 0.87
Y
id, value
1, 0.79
2, 0.65
3, 0.78
...
...
98, 0.89
100, 0.80
Y is simply missing a particular ID.
I would like to perform vector operations on X and Y (e.g. correlation, difference...etc). Meaning I need to drop the corresponding missing value in X. How would I do that?
All the values are the same, so the extra element in x will be the difference between the sums.
This solution is o(n), other solutions here are o(n^2)
Data generation:
import numpy as np
# x = np.arange(10)
x = np.random.rand(10)
y = np.r_[x[:6], x[7:]] # exclude 6
print(x)
np.random.shuffle(y)
print(y)
Solution:
Notice np.isclose() used for floating point comparison.
sum_x = np.sum(x)
sum_y = np.sum(y)
diff = sum_x - sum_y
value_index = np.argwhere(np.isclose(x, diff))
print(value_index)
Delete relevant index
deleted = np.delete(x, value_index)
print(deleted)
out:
[0.36373441 0.5030346 0.895204 0.03352821 0.20693263 0.28651572
0.25859596 0.97969841 0.77368822 0.80105397]
[0.97969841 0.77368822 0.28651572 0.36373441 0.5030346 0.895204
0.03352821 0.80105397 0.20693263]
[[6]]
[0.36373441 0.5030346 0.895204 0.03352821 0.20693263 0.28651572
0.97969841 0.77368822 0.80105397]
Use in1d:
>>> X
array([[ 1. , 0.53],
[ 2. , 0.72],
[ 3. , 0.44],
[ 4. , 0.35],
[ 5. , 0.32],
[ 6. , 0.14],
[ 7. , 0.52],
[ 8. , 0.4 ],
[ 9. , 0.1 ],
[10. , 0.1 ]])
>>> Y
array([[ 1. , 0.19],
[ 2. , 0.96],
[ 3. , 0.24],
[ 4. , 0.44],
[ 5. , 0.12],
[ 6. , 0.91],
[ 7. , 0.7 ],
[ 8. , 0.54],
[10. , 0.09]])
>>> X[np.in1d(X[:, 0], Y[:, 0])]
array([[ 1. , 0.53],
[ 2. , 0.72],
[ 3. , 0.44],
[ 4. , 0.35],
[ 5. , 0.32],
[ 6. , 0.14],
[ 7. , 0.52],
[ 8. , 0.4 ],
[10. , 0.1 ]])
You can try this:
X = X[~numpy.isnan(X)]
Y = Y[~numpy.isnan(Y)]
And there you can do whatever operation you want
I have this big serie of length t (t = 200K rows)
prices = [200, 100, 500, 300 ..]
and I want to calculate a matrix (tXt) where a value is calculated as:
matrix[i][j] = prices[j]/prices[i] - 1
I tried this using a double for, but it's too slow. Any ideas how to perform it better?
for p0 in prices:
for p1 in prices:
matrix[i][j] = p1/p0 - 1
A vectorized solution is using np.meshgrid, with prices and 1/prices as arguments (note that prices must be an array), and multiplying the result and substracting 1 in order to compute matrix[i][j] = prices[j]/prices[i] - 1:
a, b = np.meshgrid(p, 1/p)
a * b - 1
As an example:
p = np.array([1,4,2])
Would give:
a, b = np.meshgrid(p, 1/p)
a * b - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
Quick check of some of the cells:
(i,j) prices[j]/prices[i] - 1
--------------------------------
(1,1) 1/1 - 1 = 0
(1,2) 4/1 - 1 = 3
(1,3) 2/1 - 1 = 1
(2,1) 1/4 - 1 = -0.75
Another solution:
[p] / np.array([p]).T - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
There are two idiomatic ways of doing an outer product-type operation. Either use the .outer method of universal functions, here np.divide:
In [2]: p = np.array([10, 20, 30, 40])
In [3]: np.divide.outer(p, p)
Out[3]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
Alternatively, use broadcasting:
In [4]: p[:, None] / p[None, :]
Out[4]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
This p[None, :] itself can be spelled as a reshape, p.reshape((1, len(p))), but readability.
Both are equivalent to a double for-loop:
In [6]: o = np.empty((len(p), len(p)))
In [7]: for i in range(len(p)):
...: for j in range(len(p)):
...: o[i, j] = p[i] / p[j]
...:
In [8]: o
Out[8]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
I guess it can be done in this way
import numpy
prices = [200., 300., 100., 500., 600.]
x = numpy.array(prices).reshape(1, len(prices))
matrix = (1/x.T) * x - 1
Let me explain in details. This matrix is a matrix product of column vector of element-wise reciprocal price values and a row vector of original price values. Then matrix of ones of the same size needs to be subtracted from the result.
First of all we create row-vector from prices list
x = numpy.array(prices).reshape(1, len(prices))
Reshaping is required here. Otherwise your vector will have shape (len(prices),), not required (1, len(prices)).
Then we compute a column vector of element-wise reciprocal price values:
(1/x.T)
Finally, we compute the resulting matrix
matrix = (1/x.T) * x - 1
Here ending - 1 will be broadcasted to a matrix of the same shape with (1/x.T) * x.
I was reading and came across this formula:
The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix:
M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]]
Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i. I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5 matrix I believe. I tried to do
df = pd.DataFrame(M)
item_mean_subtracted = df.sub(df.mean(axis=0), axis=1)
similarity_matrix = item_mean_subtracted.fillna(0).corr(method="pearson").values
However, this does not seem right.
Here's a possible implementation of the adjusted cosine similarity:
import numpy as np
from scipy.spatial.distance import pdist, squareform
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
Remarks:
I'm taking advantage of NumPy broadcasting to subtract the mean.
If M is a sparse matrix, you could do something like ths: M.toarray().
From the docs:
Y = pdist(X, 'cosine')
Computes the cosine distance between vectors u and v,
1 − u⋅v / (||u||2||v||2)
where ||∗||2 is the 2-norm of its argument *, and u⋅v is the dot product of u and v.
Array transposition is performed through the T method.
Demo:
In [277]: M_u
Out[277]: array([ 2. , 1. , 2.4, 1. ])
In [278]: item_mean_subtracted
Out[278]:
array([[ 0. , 1. , 2. , -1. , -2. ],
[-1. , -1. , -1. , -1. , 4. ],
[ 2.6, 1.6, 0.6, -2.4, -2.4],
[ 0. , 0. , 0. , 0. , 0. ]])
In [279]: np.set_printoptions(precision=2)
In [280]: similarity_matrix
Out[280]:
array([[ 1. , 0.87, 0.4 , -0.68, -0.72],
[ 0.87, 1. , 0.8 , -0.65, -0.91],
[ 0.4 , 0.8 , 1. , -0.38, -0.8 ],
[-0.68, -0.65, -0.38, 1. , 0.27],
[-0.72, -0.91, -0.8 , 0.27, 1. ]])
I have a lower triangular array, like B:
B = np.array([[1,0,0,0],[.25,.75,0,0], [.1,.2,.7,0],[.2,.3,.4,.1]])
>>> B
array([[ 1. , 0. , 0. , 0. ],
[ 0.25, 0.75, 0. , 0. ],
[ 0.1 , 0.2 , 0.7 , 0. ],
[ 0.2 , 0.3 , 0.4 , 0.1 ]])
I want to flip it to look like:
array([[ 1. , 0. , 0. , 0. ],
[ 0.75, 0.25, 0. , 0. ],
[ 0.7 , 0.2 , 0.1 , 0. ],
[ 0.1 , 0.4 , 0.3 , 0.2 ]])
That is, I want to take all the positive values, and reverse within the positive values, leaving the trailing zeros in place. This is not what fliplr does:
>>> np.fliplr(B)
array([[ 0. , 0. , 0. , 1. ],
[ 0. , 0. , 0.75, 0.25],
[ 0. , 0.7 , 0.2 , 0.1 ],
[ 0.1 , 0.4 , 0.3 , 0.2 ]])
Any tips? Also, the actual array I am working with would be something like B.shape = (200,20,4,4) instead of (4,4). Each (4,4) block looks like the above example (with different numbers across the 200, 20 different entries).
How about this:
# row, column indices of the lower triangle of B
r, c = np.tril_indices_from(B)
# flip the column indices by subtracting them from r, which is equal to the number
# of nonzero elements in each row minus one
B[r, c] = B[r, r - c]
print(repr(B))
# array([[ 1. , 0. , 0. , 0. ],
# [ 0.75, 0.25, 0. , 0. ],
# [ 0.7 , 0.2 , 0.1 , 0. ],
# [ 0.1 , 0.4 , 0.3 , 0.2 ]])
The same approach will generalize to any arbitrary N-dimensional array that consists of multiple lower triangular submatrices:
# creates a (200, 20, 4, 4) array consisting of tiled copies of B
B2 = np.tile(B[None, None, ...], (200, 20, 1, 1))
print(repr(B2[100, 10]))
# array([[ 1. , 0. , 0. , 0. ],
# [ 0.25, 0.75, 0. , 0. ],
# [ 0.1 , 0.2 , 0.7 , 0. ],
# [ 0.2 , 0.3 , 0.4 , 0.1 ]])
r, c = np.tril_indices_from(B2[0, 0])
B2[:, :, r, c] = B2[:, :, r, r - c]
print(repr(B2[100, 10]))
# array([[ 1. , 0. , 0. , 0. ],
# [ 0.75, 0.25, 0. , 0. ],
# [ 0.7 , 0.2 , 0.1 , 0. ],
# [ 0.1 , 0.4 , 0.3 , 0.2 ]])
For an upper triangular matrix you could simply subtract r from c instead, e.g.:
r, c = np.triu_indices_from(B.T)
B.T[r, c] = B.T[c - r, c]
Here's one approach for a 2D array case -
mask = np.tril(np.ones((4,4),dtype=bool))
out = np.zeros_like(B)
out[mask] = B[:,::-1][mask[:,::-1]]
You can extend it to a 3D array case using the same 2D mask by masking the last two axes with it, like so -
out = np.zeros_like(B)
out[:,mask] = B[:,:,::-1][:,mask[:,::-1]]
.. and similarly for a 4D array case, like so -
out = np.zeros_like(B)
out[:,:,mask] = B[:,:,:,::-1][:,:,mask[:,::-1]]
As one can see, we are keeping the masking process to the last two axes of (4,4) and the solution basically stays the same.
Sample run -
In [95]: B
Out[95]:
array([[ 1. , 0. , 0. , 0. ],
[ 0.25, 0.75, 0. , 0. ],
[ 0.1 , 0.2 , 0.7 , 0. ],
[ 0.2 , 0.3 , 0.4 , 0.1 ]])
In [96]: mask = np.tril(np.ones((4,4),dtype=bool))
...: out = np.zeros_like(B)
...: out[mask] = B[:,::-1][mask[:,::-1]]
...:
In [97]: out
Out[97]:
array([[ 1. , 0. , 0. , 0. ],
[ 0.75, 0.25, 0. , 0. ],
[ 0.7 , 0.2 , 0.1 , 0. ],
[ 0.1 , 0.4 , 0.3 , 0.2 ]])
I have a ndarray. From this array I need to choose the list of N numbers with biggest values. I found heapq.nlargest to find the N largest entries, but I need to extract the indexes.
I want to build a new array where only the N rows with the largest weights in the first column will survive. The rest of the rows will be replaced by random values
import numpy as np
import heapq # For choosing list of max values
a = [[1.1,2.1,3.1], [2.1,3.1,4.1], [5.1,0.1,7.1],[0.1,1.1,1.1],[4.1,3.1,9.1]]
a = np.asarray(a)
maxVal = heapq.nlargest(2,a[:,0])
if __name__ == '__main__':
print a
print maxVal
The output I have is:
[[ 1.1 2.1 3.1]
[ 2.1 3.1 4.1]
[ 5.1 0.1 7.1]
[ 0.1 1.1 1.1]
[ 4.1 3.1 9.1]]
[5.0999999999999996, 4.0999999999999996]
but what I need is [2,4] as the indexes to build a new array. The indexes are the rows so if in this example I want to replace the rest by 0 I need to finish with:
[[0.0 0.0 0.0]
[ 0.0 0.0 0.0]
[ 5.1 0.1 7.1]
[ 0.0 0.0 0.0]
[ 4.1 3.1 9.1]]
I am stuck in the point where I need indexes. The original array has 1000 rows and 100 columns. The weights are normalized floating points and I don't want to do something like if a[:,1] == maxVal[0]: because sometimes I have the weights very close and can finish with more values maxVal[0] than my original N.
Is there any simple way to extract indexes on this setup to replace the rest of the array?
If you only have 1000 rows, I would forget about the heap and use np.argsort on the first column:
>>> np.argsort(a[:,0])[::-1][:2]
array([2, 4])
If you want to put it all together, it would look something like:
def trim_rows(a, n) :
idx = np.argsort(a[:,0])[:-n]
a[idx] = 0
>>> a = np.random.rand(10, 4)
>>> a
array([[ 0.34416425, 0.89021968, 0.06260404, 0.0218131 ],
[ 0.72344948, 0.79637177, 0.70029863, 0.20096129],
[ 0.27772833, 0.05372373, 0.00372941, 0.18454153],
[ 0.09124461, 0.38676351, 0.98478492, 0.72986697],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0.27597241, 0.26705301, 0.62124467, 0.43337711],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0.3943888 , 0.61586129, 0.02776393, 0.2560126 ],
[ 0.5934556 , 0.23093912, 0.12550062, 0.58542137]])
>>> trim_rows(a, 3)
>>> a
array([[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0. , 0. , 0. , 0. ],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
And for your data size it's probably fast enough:
In [7]: a = np.random.rand(1000, 100)
In [8]: %timeit -n1 -r1 trim_rows(a, 50)
1 loops, best of 1: 7.65 ms per loop