Input dimensions for distance function for nearest neighbors

Input dimensions for distance function for nearest neighbors - python

In the context of unsupervised nearest neighbors with scikit-learn, I have implemented my own distance function to deal with my uncertain points (i.e. a point is represented as a normal distribution):
def my_mahalanobis_distance(x, y):
'''
x: array of shape (4,) x[0]: mu_x_1, x[1]: mu_x_2,
x[2]: cov_x_11, x[3]: cov_x_22
y: array of shape (4,) y[0]: mu_ y_1, y[1]: mu_y_2,
y[2]: cov_y_11, y[3]: cov_y_22
'''
cov_inv = np.linalg.inv(np.diag(x[:2])+np.diag(y[:2]))
return sp.spatial.distance.mahalanobis(x[:2], y[:2], cov_inv)
However, when I set my nearest neighbors:
nnbrs = NearestNeighbors(n_neighbors=1, metric='pyfunc', func=my_mahalanobis_distance)
nearest_neighbors = nnbrs.fit(X)
where X is a (N, 4) (n_samples, n_features) array, if I print x and y in my my_mahalanobis_distance, I get shapes of (10,) instead of (4,) as I would expect.
Example:
I add the following line to my_mahalanobis_distance:
print(x.shape)
Then in my main:
n_features = 4
n_samples = 10
# generate X array:
X = np.random.rand(n_samples, n_features)
nnbrs = NearestNeighbors(n_neighbors=1, metric='pyfunc', func=my_mahalanobis_distance)
nearest_neighbors = nnbrs.fit(X)
The result is:
(10,)
ValueError: shapes (2,) and (8,8) not aligned: 2 (dim 0) != 8 (dim 0)
I perfectly understand the error, but I do not understand why my x.shape is (10,) while my number of features is 4 in X.
I am using Python 2.7.10 and scikit-learn 0.16.1.
EDIT:
replacing return sp.spatial.distance.mahalanobis(x[:2], y[:2], cov_inv) by return 1 just for testing return:
(10,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
So only the first call to my_mahalanobis_distance is wrong. Looking at the x and y values at this first iteration, my observations are:
x and y are identical
if I run my code multiple times, x and y are still identical but their values have change compared to the previous run.
these values seem coming from a numpy.random function.
I would conclude that such a first call is a debugging piece of code which has not been removed.

This is not an answer, yet too long for a comment. I can not reproduce the error.
Using:
Python 3.5.2 and
Sklearn 0.18.1
with the code:
from sklearn.neighbors import NearestNeighbors
import numpy as np
import scipy as sp
n_features = 4
n_samples = 10
# generate X array:
X = np.random.rand(n_samples, n_features)
def my_mahalanobis_distance(x, y):
cov_inv = np.linalg.inv(np.diag(x[:2])+np.diag(y[:2]))
print(x.shape)
return sp.spatial.distance.mahalanobis(x[:2], y[:2], cov_inv)
n_features = 4
n_samples = 10
# generate X array:
X = np.random.rand(n_samples, n_features)
nnbrs = NearestNeighbors(n_neighbors=1, metric=my_mahalanobis_distance)
nearest_neighbors = nnbrs.fit(X)
The output is
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)
(4,)

I customed my my_mahalanobis_distance to handle this issue:
def my_mahalanobis_distance(x, y):
'''
x: array of shape (4,) x[0]: mu_x_1, x[1]: mu_x_2,
x[2]: cov_x_11, x[3]: cov_x_22
y: array of shape (4,) y[0]: mu_ y_1, y[1]: mu_y_2,
y[2]: cov_y_11, y[3]: cov_y_22
'''
if (x.size, y.size) == (4, 4):
return sp.spatial.distance.mahalanobis(x[:2], y[:2],
np.linalg.inv(np.diag(x[2:])
+ np.diag(y[2:])))
# to handle the buggy first call when calling NearestNeighbors.fit()
else:
warnings.warn('x and y are respectively of size %i and %i' % (x.size, y.size))
return sp.spatial.distance.euclidean(x, y)

Related

python implementation of numpy.lib.stride_tricks.as_strided

I'm trying to translate the as_strided function of NumPy to a function in Python when I translate ahead the number of strides to the number of variables according to the type of the variable (for float32 I divide the stride by 4, etc).
The code I implemented:
def as_strided(x, shape, strides):
x = x.flatten()
size = 1
for value in shape:
size *= value
arr = np.zeros(size, dtype=np.float32)
curr = 0
for i in range(shape[0]):
for j in range(shape[1]):
for k in range(shape[2]):
arr[curr] = x[i * strides[0] + j * strides[1] + k * strides[2]]
curr = curr + 1
return np.reshape(arr, shape)
In order to test the function I wrote 2 auxiliary functions:
def sliding_window(x, shape, strides):
f_mine = as_strided(x, shape, [stride // 4 for stride in strides])
f_np = np.lib.stride_tricks.as_strided(x, shape=shape, strides=strides).copy()
check_strides(x.flatten(), f_mine)
check_strides(x.flatten(), f_np)
return f_mine, f_np
def check_strides(original, strided):
s1 = int(np.where(original == strided[1][0][0])[0])
s2 = int(np.where(original == strided[0][1][0])[0])
s3 = int(np.where(original == strided[0][0][1])[0])
print([s1, s2, s3])
return [s1, s2, s3]
In the main code, I selected some shape and strides values and ran 2 cases:
Uploaded a .npy file that includes a matrix in float32 - variable x.
Created random matrix of the same size and type as variable x - variable y.
When I check the strides of the resulting matrices I get a strange phenomenon.
For case 1 - the final resulted strides obtained using the NumPy function are different from the required stride (and from my implementation).
For case 2 - the outputs are identical.
The main code:
shape = (30, 818, 300)
strides = (4, 120, 120)
# case 1
x = np.load('x.npy')
s_mine, s_np = sliding_window(x, shape, strides)
print(np.array_equal(s_mine, s_np))
#case 2
y = np.random.randn(x.shape[0], x.shape[1]).astype(np.float32)
s_mine, s_np = sliding_window(y, shape, strides)
print(np.array_equal(s_mine, s_np))
Here you can find the x.npy file that causes the desired stride change in the numpy function. I'd be happy if anyone could explain to me why this is happening.

I downloaded x.npy and loaded it. And ran as_strided on y. I haven't looked at your code.
Normally when playing with as_strided I like to look at the arrays, but in this case they are large enough that I'll focus more making sense the strides and shape.
In [39]: x.shape, x.strides
Out[39]: ((30, 1117), (4, 120))
In [40]: y.shape, y.strides
Out[40]: ((30, 1117), (4468, 4))
I wondered where you got the
shape = (30, 818, 300)
strides = (4, 120, 120)
OK the 30 is shared, but the 4 is only for x. And with those strides x looks like it's F ordered, may be even a transpose of a (1117,30) array. Your y, which was constructed with random, has the typical strides for C ordered array, 4 bytes for the inner, trailing dimension, and 4*1117 for the leading dimension.

np.concatenate doesn't allow sequential concatenation

I have been trying to concatenate two 1D arrays using np.concatenate but it doesn't work as expected. Can someone please let me know where I'm making a mistake?
My code is as follows:
x = np.array([1.13793103, 0.24137931, 0.48275862, 1.24137931, 1.00000000, 1.89655172])
y = np.array([0.03666667, 0.00888889, 0.01555556, 0.04 , 0.03222222, 0.06111111])
z = np.concatenate((x,y), axis=0)
print(z)
array([1.13793103, 0.24137931, 0.48275862, ... 0.04, 0.03222222, 0.06111111])
print(f'{type(x)} {type(y)} {type(z)}')
<class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'numpy.ndarray'>
print(f'{x.shape} {y.shape} {z.shape}')
(6,) (6,) (12,)
So, instead of adding y as a new array, it's joining the two arrays which isn't my intention. I am looking for something as follows:
array([1.13793103, 0.24137931, 0.48275862, 1.24137931, 1.00000000, 1.89655172],
[0.03666667, 0.00888889, 0.01555556, 0.04 , 0.03222222, 0.06111111])

You can use np.concatenate to concatenate along some axis if that dimension exists in the arrays that you want to concatenate:
x = np.array([1,2,3])
y = np.array([4,5,6])
here, x and y have shape (3,) so only one axis.
This means you can only concatenate along that axis (i.e. axis=0):
z = np.concatenate((x,y))
z.shape
out : (6,)
concatenating along axis=1 will throw an error:
z = np.concatenate((x,y), axis=1)
AxisError: axis 1 is out of bounds for array of dimension 1
You can make np.concatenate work, if you reshape x and y:
x, y = x.reshape(-1,1), y.reshape(-1,1)
Now both have shape (3,1) and can be concatenated along axis 1:
z = np.concatenate((x.reshape(-1,1),y.reshape(-1,1)),axis=1)
z.shape
(6,2)
alternatively, you can reshape to (1,3) and concatenate along axis 0:
z = np.concatenate((x.reshape(1,-1),y.reshape(1,-1)),axis=0)
z.shape
(2,6)
or you use np.vstack, which does not require the reshaping.

scipy.integrate.solve_ivp vectorized

Trying to use the vectorized option for solve_ivp and strangely it throws an error that y0 must be 1 dimensional.
MWE :
from scipy.integrate import solve_ivp
import numpy as np
import math
def f(t, y):
theta = math.pi/4
ham = np.array([[1,0],[1,np.exp(-1j*theta*t)]])
return-1j * np.dot(ham,y)
def main():
y0 = np.eye(2,dtype= np.complex128)
t0 = 0
tmax = 10**(-6)
sol=solve_ivp( lambda t,y :f(t,y),(t0,tmax),y0,method='RK45',vectorized=True)
print(sol.y)
if __name__ == '__main__':
main()
The calling signature is fun(t, y). Here t is a scalar, and there are two options for the ndarray y: It can either have shape (n,); then fun must return array_like with shape (n,). Alternatively it can have shape (n, k); then fun must return an array_like with shape (n, k), i.e. each column corresponds to a single column in y. The choice between the two options is determined by vectorized argument (see below). The vectorized implementation allows a faster approximation of the Jacobian by finite differences (required for stiff solvers).
Error :
ValueError: y0 must be 1-dimensional.
Python 3.6.8
scipy.version
'1.2.1'

The meaning of vectorize here is a bit confusing. It doesn't mean that y0 can be 2d, but rather that y as passed to your function can be 2d. In other words that func may be evaluated at multiple points at once, if the solver so desires. How many points is up to the solver, not you.
Change the f to show the shape a y at each call:
def f(t, y):
print(y.shape)
theta = math.pi/4
ham = np.array([[1,0],[1,np.exp(-1j*theta*t)]])
return-1j * np.dot(ham,y)
A sample call:
In [47]: integrate.solve_ivp(f,(t0,tmax),[1j,0],method='RK45',vectorized=False)
(2,)
(2,)
(2,)
(2,)
(2,)
(2,)
(2,)
(2,)
Out[47]:
message: 'The solver successfully reached the end of the integration interval.'
nfev: 8
njev: 0
nlu: 0
sol: None
status: 0
success: True
t: array([0.e+00, 1.e-06])
t_events: None
y: array([[0.e+00+1.e+00j, 1.e-06+1.e+00j],
[0.e+00+0.e+00j, 1.e-06-1.e-12j]])
Same call, but with vectorize=True:
In [48]: integrate.solve_ivp(f,(t0,tmax),[1j,0],method='RK45',vectorized=True)
(2, 1)
(2, 1)
(2, 1)
(2, 1)
(2, 1)
(2, 1)
(2, 1)
(2, 1)
Out[48]:
message: 'The solver successfully reached the end of the integration interval.'
nfev: 8
njev: 0
nlu: 0
sol: None
status: 0
success: True
t: array([0.e+00, 1.e-06])
t_events: None
y: array([[0.e+00+1.e+00j, 1.e-06+1.e+00j],
[0.e+00+0.e+00j, 1.e-06-1.e-12j]])
With False, the y passed to f is (2,), 1d; with True it is (2,1). I'm guessing it could be (2,2) or even (2,3) if the solver method so desires. That could speed up the execution, with fewer calls to f. In this case, it doesn't matter.
quadrature has a similar vec_func boolean parameter:
Numerical Quadrature of scalar valued function with vector input using scipy
A related bug/issue discussion:
https://github.com/scipy/scipy/issues/8922

Multipy each row of numpy array with matrix

What is the most pythonic way to multiply each row(axis=2) of a np array with a matrix. For example, I am working with images read as np array of shape (480, 512, 3), I want to multiply each img[i,j] with a 3x3 matrix. I don't want to use for loops for this. This is what I tried but it gives an error
A = np.array([
[.412453, .35758, .180423],
[.212671, .71516, .072169],
[.019334, .119193, .950227]
])
lin_XYZ = lambda x: np.dot(A, x[::-1])
#lin_XYZ = np.vectorize(lin_XYZ)
tmp_img = lin_XYZ(tmp_img[:,:])
File ".\proj1a.py", line 24, in color2luv
tmp_img = lin_XYZ(tmp_img[:,:])
File ".\proj1a.py", line 22, in <lambda>
lin_XYZ = lambda x: np.dot(A, x)
ValueError: shapes (3,3) and (480,512,3) not aligned: 3 (dim 1) != 512 (dim 1)

So A is (3,3) and x is (480, 512, 3), and you what is a dot on the size 3 dimension. The key thing to remember with dot(A,B) is, last dim of A with 2nd to the last of B. (That's what the error is complaining about 3 (dim 1) != 512 (dim 1))
x.dot(A)
x.dot(A.T)
would meet that requirement.
A.dot(x.transpose(0,2,1)) # (3,3) with (480,3,512)
would also work, though the resulting array may need further transposing - assuming you want the 3 to be last.
You can also pair dimensions with einsum or tensordot:
np.einsum('ij,kli->klj', A, x)
x[::-1] flips x on its first dimenion, the 480 one. Shape remains the same. Did you want the transpose?

Python Numpy Logistic Regression

I'm trying to implement vectorized logistic regression in python using
numpy. My Cost function (CF) seems to work OK. However there is a
problem with gradient calculation. It returns 3x100 array whereas it
should return 3x1. I think there is a problem with the (hypo-y) part.
def sigmoid(a):
return 1/(1+np.exp(-a))
def CF(theta,X,y):
m=len(y)
hypo=sigmoid(np.matmul(X,theta))
J=(-1./m)*((np.matmul(y.T,np.log(hypo)))+(np.matmul((1-y).T,np.log(1-hypo))))
return(J)
def gr(theta,X,y):
m=len(y)
hypo=sigmoid(np.matmul(X,theta))
grad=(1/m)*(np.matmul(X.T,(hypo-y)))
return(grad)
X is a 100x3 arrray, y is 100x1, and theta is a 3x1 arrray. It seems both functions are working individually, however this optimization function gives an error:
optim = minimize(CF, theta, method='BFGS', jac=gr, args=(X,y))
The error: "ValueError: shapes (3,100) and (3,100) not aligned: 100 (dim 1) != 3 (dim 0)"

I think there is a problem with the (hypo-y) part.
Spot on!
hypo is of shape (100,) and y is of shape (100, 1). In the element-wise - operation, hypo is broadcasted to shape (1, 100) according to numpy's broadcasting rules. This results in a (100, 100) array, which causes the matrix multiplication to result in a (3, 100) array.
Fix this by bringing hypo into the same shape as y:
hypo = sigmoid(np.matmul(X, theta)).reshape(-1, 1) # -1 means automatic size on first dimension
There is one more issue: scipy.optimize.minimize (which I assume you are using) expects the gradient to be an array of shape (k,) but the function gr returns a vector of shape (k, 1). This is easy to fix:
return grad.reshape(-1)
The final function becomes
def gr(theta,X,y):
m=len(y)
hypo=sigmoid(np.matmul(X,theta)).reshape(-1, 1)
grad=(1/m)*(np.matmul(X.T,(hypo-y)))
return grad.reshape(-1)
and running it with toy data works (I have not checked the math or the plausibility of the results):
theta = np.reshape([1, 2, 3], 3, 1)
X = np.random.randn(100, 3)
y = np.round(np.random.rand(100, 1))
optim = minimize(CF, theta, method='BFGS', jac=gr, args=(X,y))
print(optim)
# fun: 0.6830931976615066
# hess_inv: array([[ 4.51307367, -0.13048255, 0.9400538 ],
# [-0.13048255, 3.53320257, 0.32364498],
# [ 0.9400538 , 0.32364498, 5.08740428]])
# jac: array([ -9.20709950e-07, 3.34459058e-08, 2.21354905e-07])
# message: 'Optimization terminated successfully.'
# nfev: 15
# nit: 13
# njev: 15
# status: 0
# success: True
# x: array([-0.07794477, 0.14840167, 0.24572182])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Input dimensions for distance function for nearest neighbors - python

Related

python implementation of numpy.lib.stride_tricks.as_strided

np.concatenate doesn't allow sequential concatenation

scipy.integrate.solve_ivp vectorized

Multipy each row of numpy array with matrix

Python Numpy Logistic Regression

Categories

Resources