Conditional sum over matrices in python/numpy - python

I have two numpy arrays X and W each with shape (N,N) that result from the end of a calculation. Subdivide the range of X into equal intervals [min(X), min(X)+delta, min(X)+2*delta,..., max(X)]. I'd like to know, given an interval starting point v, the total of the corresponding W values:
idx = (X>=v) & (X<(v+delta))
W[idx].sum()
I need this sum for all starting intervals (ie. the entire range of X) and I need to do this for many different matrices X and W. Profiling has determined that this is the bottleneck. What I'm doing now amounts to:
W_total = []
for v0, v1 in zip(X, X[1:]):
idx = (X>=x0) & (X<x1)
W_total.append( W[idx].sum() )
How can I speed this up?

You can use numpy.histogram() to compute all those sums in a single operation:
sums, bins = numpy.histogram(
X, bins=numpy.arange(X.min(), X.max(), delta), weights=W)

Have you tried numpy.histogram?
nbins = (X.max() - X.min()) / delta
W_total = np.histogram(X, weights=W, bins=nbins)

Related

Efficient way to map 3D function to a meshgrid with NumPy

I have a set of data values for a scalar 3D function, arranged as inputs x,y,z in an array of shape (n,3) and the function values f(x,y,z) in an array of shape (n,).
EDIT: For instance, consider the following simple function
data = np.array([np.arange(n)]*3).T
F = np.linalg.norm(data,axis=1)**2
I would like to convolve this function with a spherical kernel in order to perform a 3D smoothing. The easiest way I found to perform this is to map the function values in a 3D spatial grid and then apply a 3D convolution with the kernel I want.
This works fine, however the part that maps the 3D function to the 3D grid is very slow, as I did not find a way to do it with NumPy only. The code below is my actual implementation, where data is the (n,3) array containing the 3D positions in which the function is evaluated, F is the (n,) array containing the corresponding values of the function and M is the (N,N,N) array that contains the 3D space grid.
step = 0.1
# Create meshgrid
xmin = data[:,0].min()
xmax = data[:,0].max()
ymin = data[:,1].min()
ymax = data[:,1].max()
zmin = data[:,2].min()
zmax = data[:,2].max()
x = np.linspace(xmin,xmax,int((xmax-xmin)/step)+1)
y = np.linspace(ymin,ymax,int((ymax-ymin)/step)+1)
z = np.linspace(zmin,zmax,int((zmax-zmin)/step)+1)
# Build image
M = np.zeros((len(x),len(y),len(z)))
for l in range(len(data)):
for i in range(len(x)-1):
if x[i] < data[l,0] < x[i+1]:
for j in range(len(y)-1):
if y[j] < data[l,1] < y[j+1]:
for k in range(len(z)-1):
if z[k] < data[l,2] < z[k+1]:
M[i,j,k] = F[l]
Is there a more efficient way to fill a 3D spatial grid with the values of a 3D function ?
For each item of data you're scanning pixels of cuboid to check if it's inside. There is an option to skip this scan. You could calculate corresponding indices of these pixels by yourself, for example:
data = np.array([[1, 2, 3], #14 (corner1)
[4, 5, 6], #77 (corner2)
[2.5, 3.5, 4.5], #38.75 (duplicated pixel)
[2.9, 3.9, 4.9], #47.63 (duplicated pixel)
[1.5, 2, 3]]) #15.25 (one step up from [1, 2, 3])
step = 0.5
data_idx = ((data - data.min(axis=0))//step).astype(int)
M = np.zeros(np.max(data_idx, axis=0) + 1)
x, y, z = data_idx.T
M[x, y, z] = F
Note that only one value of duplicated pixels is being mapped to M.
All you need is just reshape F[:, 3] (only f(x, y, z)) into a grid. Hard to be more precise without sample data:
If the data is not sorted, you need to sort it:
F_sorted = F[np.lexsort((F[:,0], F[:,1], F[:,2]))] # sort by x, then y, then z
Choose only f(x, y, z)
F_values = F_sorted[:, 3]
Finally, reshape data into a grid:
M = F_sorted.reshape(N, N, N)
This method is faster than the original (approximatly 20x speed up):
step = 0.1
mins = np.min(data, axis=0)
maxs = np.max(data, axis=0)
ranges = np.floor((maxs - mins) / step + 1).astype(int)
indx = np.zeros(data.shape,dtype=int)
for i in range(3):
x = np.linspace(mins[i], maxs[i], ranges[i])
indx[:,i] = np.argmax(data[:,i,np.newaxis] <= (x[np.newaxis,:]), axis=1) -1
M = np.zeros(ranges)
M[indx[:,0],indx[:,1],indx[:,2]] = F
The first part sets up the required grid variables. The argmax function provides a simple (and fast) way to find the first true value of the broadcasted array. This produces a set of indices for x, y and z directions for each of the function values.
The resulting array M is not the same as that produced by the original code as the original code loses data. The logic of y[j] < data[l,1] < y[j+1] where y is a vector produced using linspace means the minimum and maximum values for each direction will be missed (data[l,1] might be equal to either y[j] or y[j+1]!). Run it with a dataset of two values each with their own coordinates and the M array will be all zeros.

Correlate a 1d array with every 4d array in python in short time

I have a a 4d array (called a) with shape(35, 2000, 60, 180) that I need to correlate with a 1d array (called b) that has length of 2000, while detrending and smoothing both arrays.
I managed to use a nested for-loop to correlate the 1d array with the 3d array (called c) shape(x, y, z) by iterating through each point y, z, detrending c[x, y, :] and storing the correlation coefficient between b and at that point.
However, using a 3x-nested for-loop to calculate the correlation with a 4d array takes too much time to compute. Is there a more efficient way to produce an array that contains the correlation coefficient between each timeseries in a 4d array and a 1d array?
Here is my code for calculating the correlation with only 3 dimensions involved. It takes around a minute to execute on an array with shape(2000, 60, 180).
Also, the larger array has nan's, in which case, i set the correlation for the entire x,y point t be nan.
def correlation_detrended(cs, ts, smooth=360):
cs_det = cs
ts_det = ts
signal.detrend(ts_det[~np.isnan(ts_det)], overwrite_data=True)
ts_det = pd.DataFrame(ts_det).rolling(smooth, center=True).mean().to_numpy()[:, 0]
for i in range(len(cs_det[0, :, 0])):
for j in range(len(cs_det[0, i, :])):
print(str(i) + ":" + str(j) )
if np.any(np.isnan(cs_det[:, i, j])):
r, p = (np.nan, np.nan)
else:
signal.detrend(cs_det[:, i, j], overwrite_data=True)
cs_det[:, i, j] = pd.DataFrame(cs_det[:, i, j]).rolling(smooth, center=True).mean().to_numpy()[:, 0]
offset = int((smooth/2+120))
r, p = stats.pearsonr(cs_det[offset:-(offset), i, j], ts_det[offset:-(offset)])
correlation[i, j] = r
return correlation```

Efficiently select elements from an (x,y) field with a 2D mask in Python

I have a large field of 2D-position data, given as two arrays x and y, where len(x) == len(y). I would like to return the array of indices idx_masked at which (x[idx_masked], y[idx_masked]) is masked by an N x N int array called mask. That is, mask[x[idx_masked], y[idx_masked]] == 1. The mask array consists of 0s and 1s only.
I have come up with the following solution, but it (specifically, the last line below) is very slow, given that I have N x N = 5000 x 5000, repeated 1000s of times:
import numpy as np
import matplotlib.pyplot as plt
# example mask of one corner of a square
N = 100
mask = np.zeros((N, N))
mask[0:10, 0:10] = 1
# example x and y position arrays in arbitrary units
x = np.random.uniform(0, 1, 1000)
y = np.random.uniform(0, 1, 1000)
x_bins = np.linspace(np.min(x), np.max(x), N)
y_bins = np.linspace(np.min(y), np.max(y), N)
x_bin_idx = np.digitize(x, x_bins)
y_bin_idx = np.digitize(y, y_bins)
idx_masked = np.ravel(np.where(mask[y_bin_idx - 1, x_bin_idx - 1] == 1))
plt.imshow(mask[::-1, :])
plt.scatter(x, y, color='red')
plt.scatter(x[idx_masked], y[idx_masked], color='blue')
Is there a more efficient way of doing this?
Given that mask overlays your field with identically-sized bins, you do not need to define the bins explicitly. *_bin_idx can be determined at each location from a simple floor division, since you know that each bin is 1 / N in size. I would recommend using 1 - 0 for the total width (what you passed into np.random.uniform) instead of x.max() - x.min(), if of course you know the expected size of the range.
x0 = 0 # or x.min()
x1 = 1 # or x.max()
x_bin = (x1 - x0) / N
x_bin_idx = ((x - x0) // x_bin).astype(int)
# ditto for y
This will be faster and simpler than digitizing, and avoids the extra bin at the beginning.
For most purposes, you do not need np.where. 90% of the questions asking about it (including this one) should not be using where. If you want a fast way to access the necessary elements of x and y, just use a boolean mask. The mask is simply
selction = mask[x_bin_idx, y_bin_idx].astype(bool)
If mask is already a boolean (which it should be anyway), the expression mask[x_bin_idx, y_bin_idx] is sufficient. It results in an array of the same size as x_bin_idx and y_bin_idx (which are the same size as x and y) containing the mask value for each of your points. You can use the mask as
x[selection] # Elements of x in mask
y[selection] # Elements of y in mask
If you absolutely need the integer indices, where is sill not your best option.
indices = np.flatnonzero(selection)
OR
indices = selection.nonzero()[0]
If your goal is simply to extract values from x and y, I would recommend stacking them together into a single array:
coords = np.stack((x, y), axis=1)
This way, instead of having to apply indices twice, you can extract the values with just
coords[selection, :]
OR
coords[indices, :]
Depending on the relative densities of mask and x and y, either the boolean masking or linear indexing may be faster. You will have to time some relevant cases to get a better intuition.

Manipulating 2D matrix using numpy boolean indexing takes a long time

I've generated a huge amount of random data like so:
ndata = np.random.binomial(1, 0.25, (100000, 1000))
which is a 100,000 by 1000 matrix(!)
I'm generating new matrix where for each row, each column is true if the mean of all the columns beforehand (minus the expectancy of bernoulli RV with p=0.25) is greater than equal some epsilon.
like so:
def true_false_inequality(data, eps, data_len):
return [abs(np.mean(data[:index + 1]) - 0.25) >= eps for index in range(data_len)]
After doing so I'm generating a 1-d array (finally!) where each column represents how many true values I had in the same column in the matrix, and then I'm dividing every column by some number (exp_numer = 100,000)
def final_res(data, eps):
tf = np.array([true_false_inequality(data[seq], eps, data_len) for seq in range(exp_number)])
percentage = tf.sum(axis=0)/exp_number
return percentage
Also I have 5 different epsilons which I iterate from to get my final result 5 times.
(epsilons = [0.001, 0.1, 0.5, 0.25, 0.025])
My code does work, but it takes a long while for 100,000 rows by 1000 columns, I know I can make it faster by exploring the numpy functionality a little bit more but I just don't know how.
You can perform the whole calculation with vectorized operations on the full data array:
mean = np.cumsum(data, axis=1) / np.arange(1, data.shape[1]+1)
condition = np.abs(mean - 0.25) >= eps
percentage = condition.sum(axis=0) / len(data)
You can calculate the cummulative mean with:
np.cumsum(ndata, axis=0).sum(axis=1) / np.arange(1, 100001)
so we can optimize the true_false_inequality to:
def true_false_inequality(data, eps, data_len):
cummean = np.cumsum(ndata, axis=0).sum(axis=1) / np.arange(1, data_len)
return abs(cummean - 0.25) >= eps
Or like #a_guest suggests, we can first sum up the elements, and then calculate the cumulative sum:
def true_false_inequality(data, eps, data_len):
cummean = ndata.sum(axis=1).cumsum(axis=0) / np.arange(1, 100001)
return abs(cummean - 0.25) >= eps

Minimum Euclidean distance between points in two different Numpy arrays, not within

I have two arrays of x-y coordinates, and I would like to find the minimum Euclidean distance between each point in one array with all the points in the other array. The arrays are not necessarily the same size. For example:
xy1=numpy.array(
[[ 243, 3173],
[ 525, 2997]])
xy2=numpy.array(
[[ 682, 2644],
[ 277, 2651],
[ 396, 2640]])
My current method loops through each coordinate xy in xy1 and calculates the distances between that coordinate and the other coordinates.
mindist=numpy.zeros(len(xy1))
minid=numpy.zeros(len(xy1))
for i,xy in enumerate(xy1):
dists=numpy.sqrt(numpy.sum((xy-xy2)**2,axis=1))
mindist[i],minid[i]=dists.min(),dists.argmin()
Is there a way to eliminate the for loop and somehow do element-by-element calculations between the two arrays? I envision generating a distance matrix for which I could find the minimum element in each row or column.
Another way to look at the problem. Say I concatenate xy1 (length m) and xy2 (length p) into xy (length n), and I store the lengths of the original arrays. Theoretically, I should then be able to generate a n x n distance matrix from those coordinates from which I can grab an m x p submatrix. Is there a way to efficiently generate this submatrix?
(Months later)
scipy.spatial.distance.cdist( X, Y )
gives all pairs of distances,
for X and Y 2 dim, 3 dim ...
It also does 22 different norms, detailed
here .
# cdist example: (nx,dim) (ny,dim) -> (nx,ny)
from __future__ import division
import sys
import numpy as np
from scipy.spatial.distance import cdist
#...............................................................................
dim = 10
nx = 1000
ny = 100
metric = "euclidean"
seed = 1
# change these params in sh or ipython: run this.py dim=3 ...
for arg in sys.argv[1:]:
exec( arg )
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )
title = "%s dim %d nx %d ny %d metric %s" % (
__file__, dim, nx, ny, metric )
print "\n", title
#...............................................................................
X = np.random.uniform( 0, 1, size=(nx,dim) )
Y = np.random.uniform( 0, 1, size=(ny,dim) )
dist = cdist( X, Y, metric=metric ) # -> (nx, ny) distances
#...............................................................................
print "scipy.spatial.distance.cdist: X %s Y %s -> %s" % (
X.shape, Y.shape, dist.shape )
print "dist average %.3g +- %.2g" % (dist.mean(), dist.std())
print "check: dist[0,3] %.3g == cdist( [X[0]], [Y[3]] ) %.3g" % (
dist[0,3], cdist( [X[0]], [Y[3]] ))
# (trivia: how do pairwise distances between uniform-random points in the unit cube
# depend on the metric ? With the right scaling, not much at all:
# L1 / dim ~ .33 +- .2/sqrt dim
# L2 / sqrt dim ~ .4 +- .2/sqrt dim
# Lmax / 2 ~ .4 +- .2/sqrt dim
To compute the m by p matrix of distances, this should work:
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
the .outer calls make two such matrices (of scalar differences along the two axes), the .hypot calls turns those into a same-shape matrix (of scalar euclidean distances).
The accepted answer does not fully address the question, which requests to find the minimum distance between the two sets of points, not the distance between every point in the two sets.
Although a straightforward solution to the original question indeed consists of computing the distance between every pair and subsequently finding the minimum one, this is not necessary if one is only interested in the minimum distances. A much faster solution exists for the latter problem.
All the proposed solutions have a running time that scales as m*p = len(xy1)*len(xy2). This is OK for small datasets, but an optimal solution can be written that scales as m*log(p), producing huge savings for large xy2 datasets.
This optimal execution time scaling can be achieved using scipy.spatial.KDTree as follows
import numpy as np
from scipy import spatial
xy1 = np.array(
[[243, 3173],
[525, 2997]])
xy2 = np.array(
[[682, 2644],
[277, 2651],
[396, 2640]])
# This solution is optimal when xy2 is very large
tree = spatial.KDTree(xy2)
mindist, minid = tree.query(xy1)
print(mindist)
# This solution by #denis is OK for small xy2
mindist = np.min(spatial.distance.cdist(xy1, xy2), axis=1)
print(mindist)
where mindist is the minimum distance between each point in xy1 and the set of points in xy2
For what you're trying to do:
dists = numpy.sqrt((xy1[:, 0, numpy.newaxis] - xy2[:, 0])**2 + (xy1[:, 1, numpy.newaxis - xy2[:, 1])**2)
mindist = numpy.min(dists, axis=1)
minid = numpy.argmin(dists, axis=1)
Edit: Instead of calling sqrt, doing squares, etc., you can use numpy.hypot:
dists = numpy.hypot(xy1[:, 0, numpy.newaxis]-xy2[:, 0], xy1[:, 1, numpy.newaxis]-xy2[:, 1])
import numpy as np
P = np.add.outer(np.sum(xy1**2, axis=1), np.sum(xy2**2, axis=1))
N = np.dot(xy1, xy2.T)
dists = np.sqrt(P - 2*N)
I think the following function also works.
import numpy as np
from typing import Optional
def pairwise_dist(X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray:
Y = X if Y is None else Y
xx = (X ** 2).sum(axis = 1)[:, None]
yy = (Y ** 2).sum(axis = 1)[:, None]
return xx + yy.T - 2 * (X # Y.T)
Explanation
Suppose each row of X and Y are coordinates of the two sets of points.
Let their sizes be m X p and p X n respectively.
The result will produce a numpy array of size m X n with the (i, j)-th entry being the distance between the i-th row and the j-th row of X and Y respectively.
I highly recommend using some inbuilt method for calculating squares, and roots for they are customized for optimized way to calculate and very safe against overflows.
#alex answer below is the most safest in terms of overflow and should also be very fast. Also for single points you can use math.hypot which now supports more than 2 dimensions.
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
Safety concerns
i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# np.hypot for 2d points
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square
overflow/underflow/speeds
I think that the most straightforward and efficient solution is to do it like this:
distances = np.linalg.norm(xy1, xy2) # calculate the euclidean distances between the test point and the training features.
min_dist = numpy.min(dists, axis=1) # get the minimum distance
min_id = np.argmi(distances) # get the index of the class with the minimum distance, i.e., the minimum difference.
Although many answers here are great, there is another way which has not been mentioned here, using numpy's vectorization / broadcasting properties to compute the distance between each points of two different arrays of different length (and, if wanted, the closest matches). I publish it here because it can be very handy to master broadcasting, and it also solves this problem elengantly while remaining very efficient.
Assuming you have two arrays like so:
# two arrays of different length, but with the same dimension
a = np.random.randn(6,2)
b = np.random.randn(4,2)
You can't do the operation a-b: numpy complains with operands could not be broadcast together with shapes (6,2) (4,2). The trick to allow broadcasting is to manually add a dimension for numpy to broadcast along to. By leaving the dimension 2 in both reshaped arrays, numpy knows that it must perform the operation over this dimension.
deltas = a.reshape(6, 1, 2) - b.reshape(1, 4, 2)
# contains the distance between each points
distance_matrix = (deltas ** 2).sum(axis=2)
The distance_matrix has a shape (6,4): for each point in a, the distances to all points in b are computed. Then, if you want the "minimum Euclidean distance between each point in one array with all the points in the other array", you would do :
distance_matrix.argmin(axis=1)
This returns the index of the point in b that is closest to each point of a.

Categories