More pythonic way of creating within-class scatter matrix - python

I am looking for a better way of calculating the following
import numpy as np
np.random.seed(123)
# test code
t = np.random.randint(3, size = 100)
X = np.random.random((100, 3))
m = np.random.random((3, 3))
# current method
res = 0
for k in np.unique(t):
for row in X[t == k] - m[k]:
res += np.outer(row, row)
res
"""
Output:
array([[12.45661335, -3.51124346, 3.75900294],
[-3.51124346, 14.85327689, -3.02281263],
[ 3.75900294, -3.02281263, 18.30868772]])
"""
I would prefer getting rid of the for loops using numpy.
This is the within-class scatter matrix for fischers linear discriminant.

You can write as follows:
Y = X - m[t]
np.matmul(Y.T, Y)
This is because sum_i x_i x'_i = X' X, where X is (N, 3) matrix and x_i = X[i,:], i.e. i-th row of X. ' indicates the transpose.

Related

Arrange and Sub-sample 3d Points in Coordinate Grid With Numpy

I have a list of 3d points such as
np.array([
[220, 114, 2000],
[125.24, 214, 2519],
...
[54.1, 254, 1249]
])
The points are in no meaningful order. I'd like to sort and reshape the array in a way that better represents a coordinate grid (such that I have a known width and height and can retrieve Z values by index). I would also like to down sample the points into say whole integers to handle collisions. Applying min,max, or mean during the down sampling.
I know I can down sample a 1d array using np.mean and np.shape
The approach I'm currently using finds the min and max in X,Y and then puts the Z values into a 2d array while doing the down sampling manually.
This iterates the giant array numerous times and I'm wondering if there is a way to do this with np.meshgrid or some other numpy functionality that I'm overlooking.
Thanks
You can use the binning method from Most efficient way to sort an array into bins specified by an index array?
To get an index array from y,x coordinates you can use np.searchsorted and np.ravel_multi_index
Here is a sample implementation, stb module is the code from the linked post.
import numpy as np
from stb import sort_to_bins_sparse as sort_to_bins
def grid1D(u, N):
mn, mx = u.min(), u.max()
return np.linspace(mn, mx, N, endpoint=False)
def gridify(yxz, N):
try:
Ny, Nx = N
except TypeError:
Ny = Nx = N
y, x, z = yxz.T
yg, xg = grid1D(y, Ny), grid1D(x, Nx)
yidx, xidx = yg.searchsorted(y, 'right')-1, xg.searchsorted(x, 'right')-1
yx = np.ravel_multi_index((yidx, xidx), (Ny, Nx))
zs = sort_to_bins(yx, z)
return np.concatenate([[0], np.bincount(yx).cumsum()]), zs, yg, xg
def bin(yxz, N, binning_method='min'):
boundaries, binned, yg, xg = gridify(yxz, N)
result = np.full((yg.size, xg.size), np.nan)
if binning_method == 'min':
result.reshape(-1)[:len(boundaries)-1] = np.minimum.reduceat(binned, boundaries[:-1])
elif binning_method == 'max':
result.reshape(-1)[:len(boundaries)-1] = np.maximum.reduceat(binned, boundaries[:-1])
elif binning_method == 'mean':
result.reshape(-1)[:len(boundaries)-1] = np.add.reduceat(binned, boundaries[:-1]) / np.diff(boundaries)
else:
raise ValueError
result.reshape(-1)[np.where(boundaries[1:] == boundaries[:-1])] = np.nan
return result
def test():
yxz = np.random.uniform(0, 100, (100000, 3))
N = 20
boundaries, binned, yg, xg = gridify(yxz, N)
binmin = bin(yxz, N)
binmean = bin(yxz, N, 'mean')
y, x, z = yxz.T
for i in range(N-1):
for j in range(N-1):
msk = (y>=yg[i]) & (y<yg[i+1]) & (x>=xg[j]) & (x<xg[j+1])
assert (z[msk].min() == binmin[i, j]) if msk.any() else np.isnan(binmin[i, j])
assert np.isclose(z[msk].mean(), binmean[i, j]) if msk.any() else np.isnan(binmean[i, j])

Numpy find covariance of two 2-dimensional ndarray

I am new to numpy and am stuck at this problem.
I have two 2-dimensional numpy array such as
x = numpy.random.random((10, 5))
y = numpy.random.random((10, 5))
I want to use numpy cov function to find covariance of these two ndarrays row wise. i.e., for above example the output array should consist of 10 elements each denoting the covariance of corresponding rows of the ndarrays. I know I can do this by traversing the rows and finding the covariance of two 1D arrays but it isn't pythonic.
Edit1: The covariance of two array denotes the element at 0, 1 index.
Edit2: Currently this is my implementation
s = numpy.empty((x.shape[0], 1))
for i in range(x.shape[0]):
s[i] = numpy.cov(x[i], y[i])[0][1]
Use the definition of the covariance: E(XY) - E(X)E(Y).
import numpy as np
x = np.random.random((10, 5))
y = np.random.random((10, 5))
n = x.shape[1]
cov_bias = np.mean(x * y, axis=1) - np.mean(x, axis=1) * np.mean(y, axis=1))
cov_bias * n / (n-1)
Note that cov_bias corresponds to the result of numpy.cov(bias=True).
Here's one using the definition of covariance and inspired by corr2_coeff_rowwise -
def covariance_rowwise(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1, keepdims=True)
B_mB = B - B.mean(-1, keepdims=True)
# Finally get covariance
N = A.shape[1]
return np.einsum('ij,ij->i',A_mA,B_mB)/(N-1)
Sample run -
In [66]: np.random.seed(0)
...: x = np.random.random((10, 5))
...: y = np.random.random((10, 5))
In [67]: s = np.empty((x.shape[0]))
...: for i in range(x.shape[0]):
...: s[i] = np.cov(x[i], y[i])[0][1]
In [68]: np.allclose(covariance_rowwise(x,y),s)
Out[68]: True
This works, but I'm not sure if it is faster for larger matrices x and y, the call numpy.cov(x, y) computes many entries we discard with numpy.diag:
x = numpy.random.random((10, 5))
y = numpy.random.random((10, 5))
# with loop
for (xi, yi) in zip(x, y):
print(numpy.cov(xi, yi)[0][1])
# vectorized
cov_mat = numpy.cov(x, y)
covariances = numpy.diag(cov_mat, x.shape[0])
print(covariances)
I also did some timing for square matrices of size n x n:
import time
import numpy
def run(n):
x = numpy.random.random((n, n))
y = numpy.random.random((n, n))
started = time.time()
for (xi, yi) in zip(x, y):
numpy.cov(xi, yi)[0][1]
needed_loop = time.time() - started
started = time.time()
cov_mat = numpy.cov(x, y)
covariances = numpy.diag(cov_mat, x.shape[0])
needed_vectorized = time.time() - started
print(
f"n={n:4d} needed_loop={needed_loop:.3f} s "
f"needed_vectorized={needed_vectorized:.3f} s"
)
for n in (100, 200, 500, 600, 700, 1000, 2000, 3000):
run(n)
output on my slow MacBook Air is
n= 100 needed_loop=0.006 s needed_vectorized=0.001 s
n= 200 needed_loop=0.011 s needed_vectorized=0.003 s
n= 500 needed_loop=0.033 s needed_vectorized=0.023 s
n= 600 needed_loop=0.041 s needed_vectorized=0.039 s
n= 700 needed_loop=0.043 s needed_vectorized=0.049 s
n=1000 needed_loop=0.061 s needed_vectorized=0.130 s
n=2000 needed_loop=0.137 s needed_vectorized=0.742 s
n=3000 needed_loop=0.224 s needed_vectorized=2.264 s
so the break even point is around n=600
Pick the diagonal vector of cov(x,y) and expand dims:
numpy.expand_dims(numpy.diag(numpy.cov(x,y),x.shape[0]),1)

matrix in object datatype python

Can you pls explain how to create a matrix in python to be created in object datatype. My code :
w, h = 8, 5;
Matrix = ([[0 for x in range(w)] for y in range(h)],dtype=object)
gives a syntax error. I tried various other ways. But still none of them working.
Thanks a lot
In your code the Matrix line tries to create a tuple, however you are giving it an expression dtype=object.
Matrix = ([[0 for x in range(w)] for y in range(h)],dtype=object)
The line reads: Set matrix to the tuple (2D array, dtype=object). However, the second part cannot be set. You can create the matrix as follows:
Matrix = [[0 for x in range(w)] for y in range(h)]
Or if you would like to have a numpy array with dtype object:
import numpy as np
Matrix = np.array([[0 for x in range(w)] for y in range(h)], dtype=object)
Or even more clean:
import numpy as np
Matrix = np.zeros((h, w), dtype=object)
Let me present you two options using numpy module and loops.
import numpy as np
print("Using numpy module:")
x = np.array([1,5,2])
y = np.array([7,4,1])
sum = x + y
subtract = x - y
mult = x * y
div = x / y
print("Sum: {}".format(sum))
print("Subtraction: {}".format(subtract))
print("Multiplication: {}".format(mult))
print("Division: {}".format(div))
print("----------------------------------------")
print("Using for loops:")
x = [1,5,2]
y = [7,4,1]
sum = []
subtract = []
mult =[]
div = []
for i,j in zip(x,y):
sum.append(i+j)
subtract.append(i-j)
mult.append(i*j)
div.append(i/j)
print(sum)
print(subtract)
print(mult)
print(div)

dot product with diagonal matrix, without creating it full matrix

I'd like to calculate a dot product of two matrices, where one of them is a diagonal matrix. However, I don't want to use np.diag or np.diagflat in order to create the full matrix, but instead use the 1D array directly filled with the diagonal values. Is there any way or numpy operation which I can use for this kind of problem?
x = np.arange(9).reshape(3,3)
y = np.arange(3) # diagonal elements
z = np.dot(x, np.diag(y))
and the solution I'm looking for should be without np.diag
z = x ??? y
Directly multiplying the ndarray by your vector will work. Numpy conveniently assumes that you want to multiply the nth column of x by the nth element of your y.
x = np.random.random((5, 5)
y = np.random.random(5)
diagonal_y = np.diag(y)
z = np.dot(x, diagonal_y)
np.allclose(z, x * y) # Will return True
The Einstein summation is an elegant solution to these kind of problems:
import numpy as np
x = np.random.uniform(0,1, size=5)
w = np.random.uniform(0,1, size=(5, 3))
diagonal_x = np.diagflat(x)
z = np.dot(diagonal_x, w)
zz = np.einsum('i,ij->ij',x , w)
np.allclose(z, zz) # Will return True
See: https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html#numpy.einsum

Scipy Fast 1-D interpolation without any loop

I have two 2D array, x(ni, nj) and y(ni,nj), that I need to interpolate over one axis. I want to interpolate along last axis for every ni.
I wrote
import numpy as np
from scipy.interpolate import interp1d
z = np.asarray([200,300,400,500,600])
out = []
for i in range(ni):
f = interp1d(x[i,:], y[i,:], kind='linear')
out.append(f(z))
out = np.asarray(out)
However, I think this method is inefficient and slow due to loop if array size is too large. What is the fastest way to interpolate multi-dimensional array like this? Is there any way to perform linear and cubic interpolation without loop? Thanks.
The method you propose does have a python loop, so for large values of ni it is going to get slow. That said, unless you are going to have large ni you shouldn't worry much.
I have created sample input data with the following code:
def sample_data(n_i, n_j, z_shape) :
x = np.random.rand(n_i, n_j) * 1000
x.sort()
x[:,0] = 0
x[:, -1] = 1000
y = np.random.rand(n_i, n_j)
z = np.random.rand(*z_shape) * 1000
return x, y, z
And have tested them with this two versions of linear interpolation:
def interp_1(x, y, z) :
rows, cols = x.shape
out = np.empty((rows,) + z.shape, dtype=y.dtype)
for j in xrange(rows) :
out[j] =interp1d(x[j], y[j], kind='linear', copy=False)(z)
return out
def interp_2(x, y, z) :
rows, cols = x.shape
row_idx = np.arange(rows).reshape((rows,) + (1,) * z.ndim)
col_idx = np.argmax(x.reshape(x.shape + (1,) * z.ndim) > z, axis=1) - 1
ret = y[row_idx, col_idx + 1] - y[row_idx, col_idx]
ret /= x[row_idx, col_idx + 1] - x[row_idx, col_idx]
ret *= z - x[row_idx, col_idx]
ret += y[row_idx, col_idx]
return ret
interp_1 is an optimized version of your code, following Dave's answer. interp_2 is a vectorized implementation of linear interpolation that avoids any python loop whatsoever. Coding something like this requires a sound understanding of broadcasting and indexing in numpy, and some things are going to be less optimized than what interp1d does. A prime example being finding the bin in which to interpolate a value: interp1d will surely break out of loops early once it finds the bin, the above function is comparing the value to all bins.
So the result is going to be very dependent on what n_i and n_j are, and even how long your array z of values to interpolate is. If n_j is small and n_i is large, you should expect an advantage from interp_2, and from interp_1 if it is the other way around. Smaller z should be an advantage to interp_2, longer ones to interp_1.
I have actually timed both approaches with a variety of n_i and n_j, for z of shape (5,) and (50,), here are the graphs:
So it seems that for z of shape (5,) you should go with interp_2 whenever n_j < 1000, and with interp_1 elsewhere. Not surprisingly, the threshold is different for z of shape (50,), now being around n_j < 100. It seems tempting to conclude that you should stick with your code if n_j * len(z) > 5000, but change it to something like interp_2 above if not, but there is a great deal of extrapolating in that statement! If you want to further experiment yourself, here's the code I used to produce the graphs.
n_s = np.logspace(1, 3.3, 25)
int_1 = np.empty((len(n_s),) * 2)
int_2 = np.empty((len(n_s),) * 2)
z_shape = (5,)
for i, n_i in enumerate(n_s) :
print int(n_i)
for j, n_j in enumerate(n_s) :
x, y, z = sample_data(int(n_i), int(n_j), z_shape)
int_1[i, j] = min(timeit.repeat('interp_1(x, y, z)',
'from __main__ import interp_1, x, y, z',
repeat=10, number=1))
int_2[i, j] = min(timeit.repeat('interp_2(x, y, z)',
'from __main__ import interp_2, x, y, z',
repeat=10, number=1))
cs = plt.contour(n_s, n_s, np.transpose(int_1-int_2))
plt.clabel(cs, inline=1, fontsize=10)
plt.xlabel('n_i')
plt.ylabel('n_j')
plt.title('timeit(interp_2) - timeit(interp_1), z.shape=' + str(z_shape))
plt.show()
One optimization is to allocate the result array once like so:
import numpy as np
from scipy.interpolate import interp1d
z = np.asarray([200,300,400,500,600])
out = np.zeros( [ni, len(z)], dtype=np.float32 )
for i in range(ni):
f = interp1d(x[i,:], y[i,:], kind='linear')
out[i,:]=f(z)
This will save you some memory copying that occurs in your implementation, which occurs in the calls to out.append(...).

Categories