Per-cell standardization efficiently in NumPy - python

I'm standardizing each cell in my train/test matrices across all users (1st dimension) using the following code. This is of course highly inefficient, but I wanted to make sure the idea worked. How do I do it using NumPy's optimized methods?
X_dims = X.shape
channels = 14 # not all columns as binary variables should stay untouched
mu_cell = np.zeros(shape=(channels, X_dims[2], X_dims[3]))
sigma_cell = np.zeros(shape=(channels, X_dims[2], X_dims[3]))
for j in range(channels):
for k in range(X_dims[2]):
for l in range(X_dims[3]):
mu_cell[j,k,l] = np.mean(X_train[:,j,k,l])
sigma_cell[j,k,l] = np.std(X_train[:,j,k,l])
def standardizeCellWise(matrix):
for i in range(matrix.shape[0]):
for j in range(channels):
for k in range(matrix.shape[2]):
for l in range(matrix.shape[3]):
matrix[i, j, k, l] -= mu_cell[j,k,l]
matrix[i, j, k, l] = matrix[i, j, k, l] / sigma_cell[j,k,l] if sigma_cell[j,k,l] != 0 else 0
return matrix
X_train = standardizeCellWise(X_train)
X_test = standardizeCellWise(X_test)

The mu and sigma arrays can be calculated in a numpythonic way as shown here -
import numpy as np
mu_cell = X_train[:,0:channels,:,:].mean(0)
sigma_cell = X_train[:,0:channels,:,:].std(0)
Next up, if you know that you don't have any infinite number or NaN in the input matrix, you can use this vectorized approach to standardize cells -
def standardizeCellWise(matrix,mu_cell,sigma_cell):
matrix_cut = matrix[:,0:channels,:,:]
matrix_cut = (matrix_cut - mu_cell[None,:])/sigma_cell[None,:]
mask = ~np.isfinite(matrix_cut)
matrix_cut[mask] = 0
matrix[:,0:channels,:,:] = matrix_cut
return matrix
For a general input matrix case, you just need to change the calculation of the mask like so -
mask = np.tile(sigma_cell[None,:]==0,[matrix.shape[0],1,1,1])

Related

Why does my own implementation of Fast Fourier Transform in N dimensions give different results than NumPy's?

I've written the following code for the N-dimensional Fast Fourier Transform but it doesn't give me the same result as numpy's function.
def nffourier(f, direct):
dim = f.ndim
N = f.shape
G = np.zeros(f.shape, dtype=complex)
G = f
for k in range(dim):
for i in range(N[k]):
aux = G[(slice(None),) * (k) + (i,)]
trans = ffourier(aux, direct)
G[(slice(None),) * (k) + (i,)] = trans
return G
My code for calculating FFT in 1d is the following:
def ffourier(f, direct):
N = len(f)
M = int(m.log(N)/m.log(2))
G = []
order = []
for i in range(N):
order.append(int(bin(i)[2:]))
digitos = len(aux)
for i in range(N):
contenido_aux = str(int(order[i]))
aux = len(str(order[i]))
if(aux<digitos):
añadir=digitos-aux
for k in range(añadir):
contenido_aux = '0'+contenido_aux
G.append(contenido_aux)
for i in range(len(G)):
G[i] = G[i][::-1]
for i in range(len(G)):
G[i] = int(G[i], 2)
for i in range(len(G)):
G[i] = f[G[i]]
if direct == False:
signo = 1
else:
signo = -1
kmax = 1
kmax = int(kmax)
for alfa in range(1,M+1):
w1 = np.exp(signo*1j*2*m.pi/(2**alfa))
kmax = int(2*kmax)
W = 1
for k in range(0, int(kmax/2)-1+1):
for s in range(0, N-1+1, int(kmax)):
T0 = G[s+k]
T1 = G[s+k+int(kmax/2)]*W
G[s+k]=T0+T1
G[s+k+int(kmax/2)]=T0-T1
W=W*w1
cte = 1/m.sqrt(N)
for i in range(0, N-1+1):
G[i] = G[i]*cte
return G
The fundamentals of it is quite hard to explain, it's based on bit inversion, but I've checked it works properly, so the problem is with the N dimensional code.
Your indexing G[(slice(None),) * (k) + (i,)] works in 2D but not in higher dimensions. Let’s see what it does:
Say G is 2D. Now when k=0, your indexing is the same as G[i], which is the same as G[i,:]. You are selecting rows. When k=1, then that indexing is G[:,i]. You are selecting columns.
But now say G is 3D. Now when k=0, you get G[i] again, which now is equivalent to G[i,:,:]. You are selecting a 2D subarray! What you need is a 1D subarray. You need to get G[i,j,:] for all i and all j. And then G[i,:,j], and then G[:,i,j].
Likewise, for a 5D array, you want G[i,j,k,l,:], etc. That is to say, you want to loop over all dimensions minus one.
To loop over all i and j, you could do a double loop, but then you have specific 3D code. It is possible to write a loop over an arbitrary number of dimensions, but it’s not pretty. So we’ll look for an alternative.
I think the simplest way to get this to work is to flatten those N-1 dimensions, turning a MxNxOxPxQ array into a 2D (N*M*O*P)xQ array. Now you can do a 1D loop over the first dimension.
Now you need to loop over the dimensions, it’s a different dimension that we leave out every time. We can simplify this problem by “rolling” the dimensions, make a different dimension the last one every time, then apply that same flattening code. Now it’s easy to write a loop (not tested):
def nffourier(f, direct):
dim = f.ndim
G = f.astype(complex)
for k in range(dim):
G = np.moveaxis(G, 0, -1) # shifts the dimensions by one to the left
shape = G.shape
m = shape[-1]
G = np.reshape(G, (-1, m)) # flattens all but last dimension
m = G.shape[0]
for i in range(m): # loop over first dimension
G[i, :] = ffourier(G[i, :], direct) # apply over last dimension
G = np.reshape(G, shape) # return to original shape
# After applying moveaxis dim times, G should have the same dimension order it had at the start
return G
(Note also, as we already discussed in the comments, that the G = f line causes the output array G to be of the same type as f, likely not complex, and so will cause errors also.)

Is there a way to speed up these nested loops (Laplacian case) Python?

I am trying to speed up the nested loop in my function Gram.
My function that is causing a big delay is the Laplacian (Abel) because it requires to calculate for each cell of the matrix the norm of a column by a row.
abel = lambda x,y,t,p: np.exp(-np.abs(p) * np.linalg.norm(x-y))
def Gram(X,Y,function,t,p):
n = X.shape[0]
s = Y.shape[0]
K = np.zeros((n,s))
if function==abel:
for i in range(n):
for j in range(s):
K[i,j] = abel(X[i,:],Y[j,:],t,p)
else:
K = polynomial(X,Y,t,p)
return K
I was able to speed up the function a bit by keeping the exponential part out of the abel equation and then I apply it for the whole matrix.
abel_2 = lambda x,y,t,p: np.linalg.norm(x-y) (don't mind the t and p).
def Gram_2(X,Y,function,t,p):
n = X.shape[0]
s = Y.shape[0]
K = np.zeros((n,s))
if function==abel_2:
for i in range(n):
for j in range(s):
K[i,j] = abel_2(X[i,:],Y[j,:],0,0)
K = np.exp(-abs(p)*K)
else:
K = polynomial(X,Y,t,p)
return K
The time is reduced by 50%, however, the double loops (nested) are still a major problem, I believe.
Can someone help with this?
Thank you!
Basically, instead of going through the loops one by one to subtract X[i,:] from Y[j,:], it would save tons of time of just selecting X[i,:] and subtracting it from all Y, then applying the norm on a certain axis!
In my case it was axis=1.
def Gram_10(X,Y,function,t,p):
n = X.shape[0]
s = Y.shape[0]
K = np.zeros((n,s))
if function==abel_2:
for i in range(n):
# it is important to put the correct slice (:s) , so the matrix provided by the norm goes
# to the right place in the function
K[i,:s] = np.linalg.norm(X[i,:]-Y,axis=1)
K = np.exp(-abs(p)*K)
else:
K = polynomial(X,Y,t,p)
return K

Quick way to divide matrix entries K_ij by K_ii*K_jj in Python

In Python, I have a matrix K of dimensions (N x N). I want to normalize K by dividing every entry K_ij by sqrt(K_(i,i)*K_(j,j)). What is a fast way to achieve this in Python without iterating through every entry?
My current solution is:
import numpy as np
K = np.random.rand(3,3)
diag = np.diag(K)
for i in range(np.shape(K)[0]):
for j in range(np.shape(K)[1]):
K[i,j] = K[i,j]/np.sqrt(diag[i]*diag[j])
Of course you have to iterate through every entry, at least internally. For square matrices:
K / np.sqrt(np.einsum('ii,jj->ij', K, K))
If the matrix is not square, you first have to define what should replace the "missing" values K[i,i] where i > j etc.
Alternative: use numba to leave your loop as is, get free speedup, and even avoid intermediate allocation:
#njit
def normalize(K):
M = np.empty_like(K)
m, n = K.shape
for i in range(m):
Kii = K[i,i]
for j in range(n):
Kjj = K[j,j]
M[i,j] = K[i,j] / np.sqrt(Kii * Kjj)
return M

Compute an adjacency matrix efficiently

I have a recommendation dataset that I have transformed into a matrix of the form:
item1 item2 item3 ...
user1 NaN 2.3 NaN
user2 1.7 3.4 NaN
user3 NaN 1.1 2.6
...
where NaN are items that the particular user has not reviewed yet. The above is in the form of a pandas dataframe. I want to construct an adjacency matrix from this, based on a predefined distance metric. I have a working function:
def compute_adjacency_matrix(reccomender_matrix):
# replace nan with 0
rec_num = reccomender_matrix.fillna(value=0)
# compute the distances between every two users
result = np.array([[compute_distance(li[2:], lj[2:]) for lj in rec_num.itertuples()] for li in rec_num.itertuples()])
adjacency_matrix = (result > 0.0).astype(int)
return adjacency_matrix
the problem is that, for large matrices, the line that computes result takes very long. What is the most efficient way of doing this, that would scale for larger datasets?
EDIT: Here is the compute distance function:
def compute_distance(vec1, vec2):
rez = sum(abs(v1[(v1>0)&(v2>0)] - v2[(v1>0)&(v2>0)]))
norm = np.count_nonzero(v1) if np.count_nonzero(v1) < np.count_nonzero(v2) else np.count_nonzero(v2)
norm_rez = rez / norm
return norm_rez
So it looks like you want a mean absolute distance metric, although that's not exactly what you wrote (since you're normalizing not by the size of the intersection but the size of the smaller vector). If you want mean absolute distance, it's simply:
def compute_distance(vec1, vec2):
return np.nanmean(np.abs(vec1 - vec2))
You can then use that metric with scipy.spatial.distance.pdist and squareform
from scipy.spatial.distance import pdist, squareform
def compute_adjacency_matrix(reccomender_matrix):
result = squareform(pdist(reccomender_matrix.values.T, metric = compute_distance))
result = np.nan_to_num(result)
adjacency_matrix = (result > 0.0).astype(int)
return adjacency_matrix
As noted in my comment, I think you need to rethink your metrics and outputs. That code will make anyone who's recommended the same item adjacent, no matter what score they gave - unless the gave the same scores, then they won't be adjacent. Not sure that's what you want.
A slightly better method would be carrying through the nans and using them to make your adjacency matrix.
def compute_adjacency_matrix(reccomender_matrix):
result = squareform(pdist(reccomender_matrix.values.T, metric = compute_distance))
adjacency_matrix = np.logical_not(np.isnan(result)).astype(int)
return adjacency_matrix
If you don't need the distances, you can do it all with binary operations:
def adjacency(x, y):
return np.any(np.logical_and(x, y))
def compute_adjacency_matrix(reccomender_matrix):
return squareform(pdist(np.isfinite(reccomender_matrix.values.T),
metric = adjacency)).astype(int)
Finally, you can do it all with numba if that's all too slow:
import numba as nb
#nb.njit
def compute_adjacency_matrix(reccomender_matrix):
n, m = reccomender_matrix.shape
out = np.zeros((m, m))
count = np.zeros((m, m))
dists = np.zeros((m, m))
adj = np.zeros((m, m))
for i in range(1, m):
for j in range(i + 1, m):
for k in range(n):
if not(np.isnan(reccomender_matrix[k, i]) or \
np.isnan(reccomender_matrix[k, j])):
out[i, j] += np.abs(reccomender_matrix[k, i] - reccomender_matrix[k, j])
count[i, j] += 1
for i in range(m):
for j in range(m):
if i == j:
dists[i, j] = 0.
elif i < j:
if count[i, j] != 0:
dists[i, j] = out[i, j] / count [i, j]
adj[i, j] = 1
else:
dists[i, j] = 0.
else:
dists[i, j] = dists[j, i]
adj[i, j] = adj[j, i]
return dists, adj

Find closest k points for every point in row of numpy array

I have a np array, X that is size 1000 x 1000 where each element is a real number. I want to find the 5 closest points for every point in each row of this np array. Here the distance metric can just be abs(x-y). I have tried to do
for i in range(X.shape[0]):
knn = NearestNeighbors(n_neighbors=5)
knn.fit(X[i])
for j in range(X.shape[1])
d = knn.kneighbors(X[i,j], return_distance=False)
However, this does not work for me and I am not sure how efficient this is. Is there a way around this? I have seen a lot of methods for comparing vectors but not any for comparing single elements. I know that I could use a for loop and loop over and find the k smallest, but this would be computationally expensive. Could a KD tree work for this? I have tried a method similar to
Finding index of nearest point in numpy arrays of x and y coordinates
However, I can not get this to work. Is there some numpy function I don't know about that could accomplish this?
Construct a kdtree with scipy.spatial.cKDTree for each row of your data.
import numpy as np
import scipy.spatial
def nearest_neighbors(arr, k):
k_lst = list(range(k + 2))[2:] # [2,3]
neighbors = []
for row in arr:
# stack the data so each element is in its own row
data = np.vstack(row)
# construct a kd-tree
tree = scipy.spatial.cKDTree(data)
# find k nearest neighbors for each element of data, squeezing out the zero result (the first nearest neighbor is always itself)
dd, ii = tree.query(data, k=k_lst)
# apply an index filter on data to get the nearest neighbor elements
closest = data[ii].reshape(-1, k)
neighbors.append(closest)
return np.stack(neighbors)
N = 1000
k = 5
A = np.random.random((N, N))
nearest_neighbors(A, k)
I'm not really sure how you want the final results. But this definitely gets you what you need.
np.random.seed([3,1415])
X = np.random.rand(1000, 1000)
Grab upper triangle indices to track every combination of points per row
x1, x2 = np.triu_indices(X.shape[1], 1)
generate an array of all distances
d = np.abs(X[:, x1] - X[:, x2])
Find the closest 5 for every row
tpos = np.argpartition(d, 5)[:, :5]
Then x1[tpos] gives the row-wise positions of the first point in the closest pairs while x2[tpos] gives the second position of the closest pairs.
Here is an argsorting solution that strives to take advantage of the simple metric:
def nn(A, k):
out = np.zeros((A.shape[0], A.shape[1] + 2*k), dtype=int)
out[:, k:-k] = np.argsort(A, axis=-1)
out[:, :k] = out[:, -k-1, None]
out[:, -k:] = out[:, k, None]
strd = stride_tricks.as_strided(
out, strides=out.strides + (out.strides[-1],), shape=A.shape + (2*k+1,))
delta = A[np.arange(A.shape[0])[:, None, None], strd]
delta -= delta[..., k, None]
delta = np.abs(delta)
s = np.argpartition(delta,(0, k), axis = -1)[..., 1:k+1]
inds = tuple(np.ogrid[:strd.shape[0], :strd.shape[1], :0][:2])
res = np.empty(A.shape + (k,), dtype=int)
res[np.arange(strd.shape[0])[:, None, None], out[:, k:-k, None],
np.arange(k)[None, None, :]] = strd[inds + (s,)]
return res
N = 1000
k = 5
r = 10
A = np.random.random((N, N))
# crude test
print(np.abs(A[np.arange(N)[:, None, None], res]-A[..., None]).mean())
# timings
print(timeit(lambda: nn(A, k), number=r) / r)
Output:
# 0.00150537172454
# 0.4567880852999224

Categories