numpy - align 2 vectors with potentially missing values - python

I have 2 numpy matrix with slightly different alignment
X
id, value
1, 0.78
2, 0.65
3, 0.77
...
...
98, 0.88
99, 0.77
100, 0.87
Y
id, value
1, 0.79
2, 0.65
3, 0.78
...
...
98, 0.89
100, 0.80
Y is simply missing a particular ID.
I would like to perform vector operations on X and Y (e.g. correlation, difference...etc). Meaning I need to drop the corresponding missing value in X. How would I do that?

All the values are the same, so the extra element in x will be the difference between the sums.
This solution is o(n), other solutions here are o(n^2)
Data generation:
import numpy as np
# x = np.arange(10)
x = np.random.rand(10)
y = np.r_[x[:6], x[7:]] # exclude 6
print(x)
np.random.shuffle(y)
print(y)
Solution:
Notice np.isclose() used for floating point comparison.
sum_x = np.sum(x)
sum_y = np.sum(y)
diff = sum_x - sum_y
value_index = np.argwhere(np.isclose(x, diff))
print(value_index)
Delete relevant index
deleted = np.delete(x, value_index)
print(deleted)
out:
[0.36373441 0.5030346 0.895204 0.03352821 0.20693263 0.28651572
0.25859596 0.97969841 0.77368822 0.80105397]
[0.97969841 0.77368822 0.28651572 0.36373441 0.5030346 0.895204
0.03352821 0.80105397 0.20693263]
[[6]]
[0.36373441 0.5030346 0.895204 0.03352821 0.20693263 0.28651572
0.97969841 0.77368822 0.80105397]

Use in1d:
>>> X
array([[ 1. , 0.53],
[ 2. , 0.72],
[ 3. , 0.44],
[ 4. , 0.35],
[ 5. , 0.32],
[ 6. , 0.14],
[ 7. , 0.52],
[ 8. , 0.4 ],
[ 9. , 0.1 ],
[10. , 0.1 ]])
>>> Y
array([[ 1. , 0.19],
[ 2. , 0.96],
[ 3. , 0.24],
[ 4. , 0.44],
[ 5. , 0.12],
[ 6. , 0.91],
[ 7. , 0.7 ],
[ 8. , 0.54],
[10. , 0.09]])
>>> X[np.in1d(X[:, 0], Y[:, 0])]
array([[ 1. , 0.53],
[ 2. , 0.72],
[ 3. , 0.44],
[ 4. , 0.35],
[ 5. , 0.32],
[ 6. , 0.14],
[ 7. , 0.52],
[ 8. , 0.4 ],
[10. , 0.1 ]])

You can try this:
X = X[~numpy.isnan(X)]
Y = Y[~numpy.isnan(Y)]
And there you can do whatever operation you want

Related

how to reverse index a 2-d array

I have a 2d MxN array A , each row of which is a sequence of indices, padded by -1's at the end e.g.:
[[ 2 1 -1 -1 -1]
[ 1 4 3 -1 -1]
[ 3 1 0 -1 -1]]
I have another MxN array of float values B:
[[ 0.7 0.4 1.5 2.0 4.4 ]
[ 0.8 4.0 0.3 0.11 0.53]
[ 0.6 7.4 0.22 0.71 0.06]]
and I want to use the indices in A to filter B i.e. for each row, only the indices present in A retain their values, and the values at all other locations are set to 0.0, i.e. the result would look like:
[[ 0.0 0.4 1.5 0.0 0.0 ]
[ 0.0 4.0 0.0 0.11 0.53 ]
[ 0.6 7.4 0.0 0.71 0.0]]
What's a good way to do this in "pure" numpy? (I would like to do this in pure numpy so I can jit it in jax.
Numpy supports fancy indexing. Ignoring the "-1" entries for the moment, you can do something like this:
index = (np.arange(B.shape[0]).reshape(-1, 1), A)
result = np.zeros_like(B)
result[index] = B[index]
This works because indices are broadcasted. The column np.arange(B.shape[0]).reshape(-1, 1) matches all the elements of a given row of A to the corresponding row in B and result.
This example does not address the fact that -1 is a valid numpy index. You need to clear the elements that correspond to -1 in A when 4 (the last column) is not present in that row:
mask = (A == -1).any(axis=1) & (A != A.shape[1] - 1).all(axis=1)
result[mask, -1] = 0.0
Here, the mask is [True, False, True], indicating that even though the second row has a -1 in it, it also contains a 4.
This approach is fairly efficient. It will create no more than a couple of boolean arrays of the same shape as A for the mask.
You can use broadcasting, but note that it will create a large intermediate array of shape (M, N, N) (in pure numpy at least):
import numpy as np
A = ...
B = ...
M, N = A.shape
out = np.where(np.any(A[..., None] == np.arange(N), axis=1), B, 0.0)
out:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
Another possible solution:
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
np.where(mask, B, 0)
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
EDIT (When there is rows with only -1)
The following code aims to contemplate the possibility, raised by #MadPhysicist (to whom I thank), of having rows containing only -1 -- that is only necessary to add 2 lines of code to my previous code.
A = np.array([[ 2, 1, -1, -1, -1],
[ -1, -1, -1, -1, -1],
[ 3, 1, 0, -1, -1]])
B = np.array([[ 0.7, 0.4, 1.5, 2.0, 4.4 ],
[ 0.8, 4.0, 0.3, 0.11, 0.53],
[ 0.6, 7.4, 0.22, 0.71, 0.06]])
rminus1 = np.all(A == -1, axis=1) # new
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
C = np.where(mask, B, 0)
C[rminus1, :] = 0 # new
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.6 , 7.4 , 0. , 0.71, 0. ]])

Cartesian product from 2 series

I have this big serie of length t (t = 200K rows)
prices = [200, 100, 500, 300 ..]
and I want to calculate a matrix (tXt) where a value is calculated as:
matrix[i][j] = prices[j]/prices[i] - 1
I tried this using a double for, but it's too slow. Any ideas how to perform it better?
for p0 in prices:
for p1 in prices:
matrix[i][j] = p1/p0 - 1
A vectorized solution is using np.meshgrid, with prices and 1/prices as arguments (note that prices must be an array), and multiplying the result and substracting 1 in order to compute matrix[i][j] = prices[j]/prices[i] - 1:
a, b = np.meshgrid(p, 1/p)
a * b - 1
As an example:
p = np.array([1,4,2])
Would give:
a, b = np.meshgrid(p, 1/p)
a * b - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
Quick check of some of the cells:
(i,j) prices[j]/prices[i] - 1
--------------------------------
(1,1) 1/1 - 1 = 0
(1,2) 4/1 - 1 = 3
(1,3) 2/1 - 1 = 1
(2,1) 1/4 - 1 = -0.75
Another solution:
[p] / np.array([p]).T - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
There are two idiomatic ways of doing an outer product-type operation. Either use the .outer method of universal functions, here np.divide:
In [2]: p = np.array([10, 20, 30, 40])
In [3]: np.divide.outer(p, p)
Out[3]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
Alternatively, use broadcasting:
In [4]: p[:, None] / p[None, :]
Out[4]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
This p[None, :] itself can be spelled as a reshape, p.reshape((1, len(p))), but readability.
Both are equivalent to a double for-loop:
In [6]: o = np.empty((len(p), len(p)))
In [7]: for i in range(len(p)):
...: for j in range(len(p)):
...: o[i, j] = p[i] / p[j]
...:
In [8]: o
Out[8]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
I guess it can be done in this way
import numpy
prices = [200., 300., 100., 500., 600.]
x = numpy.array(prices).reshape(1, len(prices))
matrix = (1/x.T) * x - 1
Let me explain in details. This matrix is a matrix product of column vector of element-wise reciprocal price values and a row vector of original price values. Then matrix of ones of the same size needs to be subtracted from the result.
First of all we create row-vector from prices list
x = numpy.array(prices).reshape(1, len(prices))
Reshaping is required here. Otherwise your vector will have shape (len(prices),), not required (1, len(prices)).
Then we compute a column vector of element-wise reciprocal price values:
(1/x.T)
Finally, we compute the resulting matrix
matrix = (1/x.T) * x - 1
Here ending - 1 will be broadcasted to a matrix of the same shape with (1/x.T) * x.

Dimensions of array don't match

I have a numpy array and when I print it i get this output. But I expected to get (105835, 99, 13) as output when printing the print(feat.shape) and was expecting feat to have 3 dimensions.
print(feat.ndim)
print(feat.shape)
print(feat.size)
print(feat[1].ndim)
print(feat[1].shape)
print(feat[1].size)`
1
(105835,)
105835
2
(99, 13)
1287
I don't know how to reduce this. But feat is a MFCC feature. If I print feat this is what I get.
array([array([[-1.0160675e+01, -1.3804866e+01, 9.1880971e-01, ...,
1.5415058e+00, 1.1875046e-02, -5.8664594e+00],
[-9.9697800e+00, -1.3823588e+01, -7.0778362e-02, ...,
1.5948311e+00, 4.3481258e-01, -5.1646194e+00],
[-9.9518738e+00, -1.2771760e+01, -1.2623003e-01, ...,
3.4290311e+00, 2.7361808e+00, -6.0621500e+00],
...,
[-11.605266 , -7.1909204, -33.44656 , ..., -11.974911 ,
12.825395 , 10.635098 ],
[-11.769397 , -9.340318 , -34.413307 , ..., -10.077869 ,
8.821722 , 7.704534 ],
[-12.301968 , -10.67318 , -32.46104 , ..., -6.829077 ,
15.29837 , 13.100596 ]], dtype=float32)], dtype=object)
the same structure can be create in a more simple way :
ain=rand(2,2)
a=ndarray(3,dtype=object)
a[:] = [ain]*3
#array([array([[ 0.14, 0.56],
# [ 0.9 , 0.9 ]]),
# array([[ 0.14, 0.56],
# [ 0.9 , 0.9 ]]),
# array([[ 0.14, 0.56],
# [ 0.9 , 0.9 ]])], dtype=object)
The problem arise because a.dtype is object. You can reconstruct your data by :
a= array(list(a))
#array([
# [[ 0.14, 0.56],
# [ 0.9 , 0.9 ]],
# [[ 0.14, 0.56],
# [ 0.9 , 0.9 ]],
# [[ 0.14, 0.56],
# [ 0.9 , 0.9 ]]])
With will have the float type inherited from the base dtype.

Cosine Similarity

I was reading and came across this formula:
The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix:
M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]]
Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i. I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5 matrix I believe. I tried to do
df = pd.DataFrame(M)
item_mean_subtracted = df.sub(df.mean(axis=0), axis=1)
similarity_matrix = item_mean_subtracted.fillna(0).corr(method="pearson").values
However, this does not seem right.
Here's a possible implementation of the adjusted cosine similarity:
import numpy as np
from scipy.spatial.distance import pdist, squareform
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
Remarks:
I'm taking advantage of NumPy broadcasting to subtract the mean.
If M is a sparse matrix, you could do something like ths: M.toarray().
From the docs:
Y = pdist(X, 'cosine')
Computes the cosine distance between vectors u and v,
1 − u⋅v / (||u||2||v||2)
where ||∗||2 is the 2-norm of its argument *, and u⋅v is the dot product of u and v.
Array transposition is performed through the T method.
Demo:
In [277]: M_u
Out[277]: array([ 2. , 1. , 2.4, 1. ])
In [278]: item_mean_subtracted
Out[278]:
array([[ 0. , 1. , 2. , -1. , -2. ],
[-1. , -1. , -1. , -1. , 4. ],
[ 2.6, 1.6, 0.6, -2.4, -2.4],
[ 0. , 0. , 0. , 0. , 0. ]])
In [279]: np.set_printoptions(precision=2)
In [280]: similarity_matrix
Out[280]:
array([[ 1. , 0.87, 0.4 , -0.68, -0.72],
[ 0.87, 1. , 0.8 , -0.65, -0.91],
[ 0.4 , 0.8 , 1. , -0.38, -0.8 ],
[-0.68, -0.65, -0.38, 1. , 0.27],
[-0.72, -0.91, -0.8 , 0.27, 1. ]])

getting a list of coordinates from a 2D matrix

Let's say I have a 10 x 20 matrix of values (so 200 data points)
values = np.random.rand(10,20)
with a known regular spacing between coordinates so that the x and y coordinates are defined by
coord_x = np.arange(0,5,0.5) --> gives [0.0,0.5,1.0,1.5...4.5]
coord_y = np.arange(0,5,0.25) --> gives [0.0,0.25,0.50,0.75...4.5]
I'd like to get an array representing each coordinates points so that
the shape of the array is (200,2), 200 being the total number of points and the extra dimension simply representing x and y such as
coord[0][0]=0.0, coord[0][1]=0.0
coord[1][0]=0.0, coord[1][1]=0.25
coord[2][0]=0.0, coord[2][1]=0.50
...
coord[19][0]=0.0, coord[19][1]=5.0
coord[20][0]=0.5, coord[20][1]=0.0
coord[21][0]=0.5, coord[21][1]=0.25
coord[22][0]=0.5, coord[22][1]=0.50
...
coord[199][0]=4.5, coord[199][1]=4.5
That would a fairly easy thing to do with a double for loop, but I wonder if there is more elegant solution using built-in numpy (or else) functions.
?
I think meshgrid may be what you're looking for.
Here's an example, with smaller number of datapoints:
>>> from numpy import fliplr, dstack, meshgrid, linspace
>>> x, y, nx, ny = 4.5, 4.5, 3, 10
>>> Xs = linspace(0, x, nx)
>>> Ys = linspace(0, y, ny)
>>> fliplr(dstack(meshgrid(Xs, Ys)).reshape(nx * ny, 2))
array([[ 0. , 0. ],
[ 0. , 2.25],
[ 0. , 4.5 ],
[ 0.5 , 0. ],
[ 0.5 , 2.25],
[ 0.5 , 4.5 ],
[ 1. , 0. ],
[ 1. , 2.25],
[ 1. , 4.5 ],
[ 1.5 , 0. ],
[ 1.5 , 2.25],
[ 1.5 , 4.5 ],
[ 2. , 0. ],
[ 2. , 2.25],
[ 2. , 4.5 ],
[ 2.5 , 0. ],
[ 2.5 , 2.25],
[ 2.5 , 4.5 ],
[ 3. , 0. ],
[ 3. , 2.25],
[ 3. , 4.5 ],
[ 3.5 , 0. ],
[ 3.5 , 2.25],
[ 3.5 , 4.5 ],
[ 4. , 0. ],
[ 4. , 2.25],
[ 4. , 4.5 ],
[ 4.5 , 0. ],
[ 4.5 , 2.25],
[ 4.5 , 4.5 ]])
I think you meant coord_y = np.arange(0,5,0.25) in your question. You can do
from numpy import meshgrid,column_stack
x,y=meshgrid(coord_x,coord_y)
coord = column_stack((x.T.flatten(),y.T.flatten()))

Categories