I have a vector, a, which I wish to cross with every point in a defined 3D space.
import numpy as np
# Grid
x = np.arange(-4,4,0.1)
y = np.arange(-4,4,0.1)
z = np.arange(-4,4,0.1)
a = [1,0,0]
result = [[] for i in range(3)]
for j in range(len(x)): # loop on x coords
for k in range(len(y)): # loop on y coords
for l in range(len(z)): # loop on z coords
r = [x[j] , y[k], z[l]]
result[0].append(np.cross(a, r)[0])
result[1].append(np.cross(a, r)[1])
result[2].append(np.cross(a, r)[2])
This produces an array which has taken the cross product of a with every point in space. However, the process takes far too long, due to the nested loops. Is there anyway to exploit vectors (meshgrid perhaps?) to make this process faster?
Here's one vectorized approach -
np.cross(a, np.array(np.meshgrid(x,y,z)).transpose(2,1,3,0)).reshape(-1,3).T
Sample run -
In [403]: x = np.random.rand(4)
...: y = np.random.rand(5)
...: z = np.random.rand(6)
...:
In [404]: result = original_app(x,y,z,a)
In [405]: out = np.cross(a, np.array(np.meshgrid(x,y,z)).\
transpose(2,1,3,0)).reshape(-1,3).T
In [406]: np.allclose(result[0], out[0])
Out[406]: True
In [407]: np.allclose(result[1], out[1])
Out[407]: True
In [408]: np.allclose(result[2], out[2])
Out[408]: True
Runtime test -
# Original setup used in the question
In [393]: # Grid
...: x = np.arange(-4,4,0.1)
...: y = np.arange(-4,4,0.1)
...: z = np.arange(-4,4,0.1)
...:
# Original approach
In [397]: %timeit original_app(x,y,z,a)
1 loops, best of 3: 21.5 s per loop
# #Denziloe's soln
In [395]: %timeit [np.cross(a, r) for r in product(x, y, z)]
1 loops, best of 3: 7.34 s per loop
# Proposed in this post
In [396]: %timeit np.cross(a, np.array(np.meshgrid(x,y,z)).\
transpose(2,1,3,0)).reshape(-1,3).T
100 loops, best of 3: 16 ms per loop
More than 1000x speedup over the original one and more than 450x over the loopy approach from other post.
This takes a couple of seconds to run on my machine:
from itertools import product
result = [np.cross(a, r) for r in product(x, y, z)]
I don't know if that's fast enough for you, but there are a lot of calculations involved. It's certainly cleaner, and there is at least some reduction of redundancy (e.g. calculating np.cross(a, r) three times). It also gives the result in a slightly different format, but this is the natural way to store the result and is hopefully fine for your purposes.
Related
Suppose I have a very computationally expensive function f(x). I want to compute some values of it, and then just access them instead of evaluating the function every time with new x values.
See the following simple example to illustrate what I mean:
import numpy as np
x = np.linspace(-3, 3, 6001)
fx = x**2
x = np.round(x, 3)
#I want to evaluate the function for the following w:
w = np.random.rand(10000)
#Rounding is necessary, so that the w match the x.
w = np.round(w, 3)
fx_w = []
for i in range(w.size):
fx_w.append(fx[x==w[i]])
fx_w = np.asarray(fx_w)
So, I'd like to have f(w) computed from the values already generated for x. Of course, a for loop is out of the question, so my question is: how can I implement this somewhat efficiently?
You can use searchsorted to find the corresponding indices of your prepared function array. This will be an approximation. Rounding is not necessary.
import numpy as np
np.random.seed(42)
x = np.linspace(-3, 3, 6001)
fx = x ** 2
w = np.random.rand(10000)
result = fx[np.searchsorted(x, w)]
print('aprox. F(x):', result)
print('real F(x):', w ** 2)
Output
aprox. F(x): [0.140625 0.904401 0.535824 ... 0.896809 0.158404 0.047524 ]
real F(x): [0.1402803 0.90385769 0.53581513 ... 0.89625588 0.1579967 0.04714996]
Your function has to be much more computationally intensive to justify this approach
%timeit fx[np.searchsorted(x, w)] #1000 loops, best of 5: 992 µs per loop
%timeit w ** 2 #100000 loops, best of 5: 3.81 µs per loop
My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !
We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop
I am struggling with a slow numpy operation, using python 3.
I have the following operation:
np.sum(np.log(X.T * b + a).T, 1)
where
(30000,1000) = X.shape
(1000,1) = b.shape
(1000,1) = a.shape
My problem is that this operation is pretty slow (around 1.5 seconds), and it is inside a loop, so it is repeated around 100 times, that makes the running time of my code very long.
I am wondering if there is a faster implementation of this function.
Maybe useful fact: X is extremely sparse (only 0.08% of the entries are nonzero), but is a NumPy array.
We can optimize the logarithm operation which seems to be the bottleneck and that being one of the transcendental functions could be sped up with numexpr module and then sum-reduce with NumPy because NumPy does it much better, thus giving us a hybrid one, like so -
import numexpr as ne
def numexpr_app(X, a, b):
XT = X.T
return ne.evaluate('log(XT * b + a)').sum(0)
Looking closely at the broadcasting operations : XT * b + a, we see that there are two stages of broadcasting, on which we can optimize further. The intention is to see if that could be reduced to one stage and that seems possible here with some division. This gives us a slightly modified version, shown below -
def numexpr_app2(X, a, b):
ab = (a/b)
XT = X.T
return np.log(b).sum() + ne.evaluate('log(ab + XT)').sum(0)
Runtime test and verification
Original approach -
def numpy_app(X, a, b):
return np.sum(np.log(X.T * b + a).T, 1)
Timings -
In [111]: # Setup inputs
...: density = 0.08/100 # 0.08 % sparse
...: m,n = 30000, 1000
...: X = scipy.sparse.rand(m,n,density=density,format="csr").toarray()
...: a = np.random.rand(n,1)
...: b = np.random.rand(n,1)
...:
In [112]: out0 = numpy_app(X, a, b)
...: out1 = numexpr_app(X, a, b)
...: out2 = numexpr_app2(X, a, b)
...: print np.allclose(out0, out1)
...: print np.allclose(out0, out2)
...:
True
True
In [114]: %timeit numpy_app(X, a, b)
1 loop, best of 3: 691 ms per loop
In [115]: %timeit numexpr_app(X, a, b)
10 loops, best of 3: 153 ms per loop
In [116]: %timeit numexpr_app2(X, a, b)
10 loops, best of 3: 149 ms per loop
Just to prove the observation stated at the start that log part is the bottleneck with the original NumPy approach, here's the timing on it -
In [44]: %timeit np.log(X.T * b + a)
1 loop, best of 3: 682 ms per loop
On which the improvement was significant -
In [120]: XT = X.T
In [121]: %timeit ne.evaluate('log(XT * b + a)')
10 loops, best of 3: 142 ms per loop
It's a bit unclear why you would do np.sum(your_array.T, axis=1) instead of np.sum(your_array, axis=0).
You can use a scipy sparse matrix: (use column compressed format for X, so that X.T is row compressed, since you multiply by b which has the shape of one row of X.T)
X_sparse = scipy.sparse.csc_matrx(X)
and replace X.T * b by:
X_sparse.T.multiply(b)
However if a is not sparse it will not help you as much as it could.
These are the speed ups I obtain for this operation:
In [16]: %timeit X_sparse.T.multiply(b)
The slowest run took 10.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 374 µs per loop
In [17]: %timeit X.T * b
10 loops, best of 3: 44.5 ms per loop
with:
import numpy as np
from scipy import sparse
X = np.random.randn(30000, 1000)
a = np.random.randn(1000, 1)
b = np.random.randn(1000, 1)
X[X < 3] = 0
print(np.sum(X != 0))
X_sparse = sparse.csc_matrix(X)
I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).
What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.
I'm expecting my output to be an array with the shape N X M.
N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
The numpy function correlate requires input arrays to be one-dimensional.
The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
The scipy.stats function pearsonr requires input arrays to be one-dimensional.
Correlation (default 'valid' case) between two 2D arrays:
You can simply use matrix-multiplication np.dot like so -
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100)
In [107]: B = np.random.rand(1000, 100)
In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop
In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100)
In [111]: B = np.random.rand(5000, 100)
In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop
In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10)
In [115]: B = np.random.rand(10000, 10)
In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop
In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -
In [118]: A = np.random.rand(1000, 100)
In [119]: B = np.random.rand(1000, 100)
In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop
In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop
In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop
#Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
In order to calculate the correlation coefficient, a bit more is required:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here's a test of this function, which passes:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)
For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.
def pearsonr_2D(x, y):
"""computes pearson correlation coefficient
where x is a 1D and y a 2D array"""
upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
rho = upper / lower
return rho
Example run:
>>> x
Out[1]: array([1, 2, 3])
>>> y
Out[2]: array([[ 1, 2, 3],
[ 6, 7, 12],
[ 9, 3, 1]])
>>> pearsonr_2D(x, y)
Out[3]: array([ 1. , 0.93325653, -0.96076892])
I'm fitting a line to 3D points with OpenCV fitLine. What's the best way to calculate residuals of the resulting fit? Or, since I want residuals in addition to the fit, is there a better method than fitLine?
The following works, but there must be a better (faster) way.
# fit points
u, v, w, x, y, z = cv2.fitLine(points, cv2.DIST_L2, 0, 1, 0.01)
v = np.array(u[0], v[0], w[0])
p = np.array(x[0], y[0], z[0])
# rotate fit to z axis
k = np.cross(v, [0, 0, 1])
mag = np.linalg.norm(k)
R, _ = cv2.Rodrigues(k * np.arcsin(mag) / mag)
# rotate points and calculate distance to z-axis
rot_points = np.dot(R, (points-p).T)
err = rot_points[0]**2 + rot_points[1]**2
I'm assuming that fitLine computes the residuals err while estimating the line, so it is a waste to have to recompute them myself. Basically, knowing that I want the line and the residuals, is there a better alternative than fitLine, which only returns the line?
I am not aware of any direct method to get the sum of residuals from cv2.fitLine itself and I would be focusing solely on speeding up the existing code. Now, upon benchmarking with a relatively high number of points, it shows up that the most of the runtime is spent at the last two lines, where we get rot_points and err. Also, it seems that we are not exactly using the last row of rot_points for calculating err, so hopefully we can shave off some percentage of runtime by slicing into the first two rows only.
Let's dive into researching efficient ways to obtain rot_points and err.
1) rot_points = np.dot(R, (points-p).T)
This step involves broadcasting in points-p, which is reduced by matrix-multiplication later on. Now, broadcasting involves heavy memory usage, which in this case can be skipped if we split up the matrix-multiplication of R with points and p separately. Also, let's bring in the first two rows of slicing as discussed earlier. Thus, we could get the first two rows of rot_points like so -
rot_points_row2 = np.dot(R[:2], (points.T)) - np.dot(R[:2],p[:,None])
2) err = rot_points[0]**2 + rot_points[1]**2
This second step could be sped up with np.einsum for efficient squaring and sum-reduction, like so -
err = np.einsum('ij,ij->j',rot_points_row2,rot_points_row2)
For a relatively small number of points as 2000, the step to calculate mag : mag = np.linalg.norm(k) might become significant too in terms of runtime. So, to speed that up, one could alternatively use np.einsum again, like so -
mag = np.sqrt(np.einsum('i,i->',k,k))
Runtime test
Let's use a random array of 2000 points in 3D space as the input points and look at the associated runtime numbers with the original approach and the proposed ones for the last two lines.
In [44]: # Setup input points
...: N = 2000
...: points = np.random.rand(N,3)
...:
...: u, v, w, x, y, z = cv2.fitLine(points, cv2.DIST_L2, 0, 1, 0.01)
...: v = np.array([u[0], v[0], w[0]])
...: p = np.array([x[0], y[0], z[0]])
...:
...: # rotate fit to z axis
...: k = np.cross(v, [0, 0, 1])
...: mag = np.linalg.norm(k)
...: R, _ = cv2.Rodrigues(k * np.arcsin(mag) / mag)
...:
...: # rotate points and calculate distance to z-axis
...: rot_points = np.dot(R, (points-p).T)
...: err = rot_points[0]**2 + rot_points[1]**2
...:
Let's run our proposed methods and verify their outputs against the original outputs -
In [45]: rot_points_row2 = np.dot(R[:2], (points.T)) - np.dot(R[:2],p[:,None])
...: err2 = np.einsum('ij,ij->j',rot_points_row2,rot_points_row2)
...:
In [46]: np.allclose(rot_points[:2],rot_points_row2)
Out[46]: True
In [47]: np.allclose(err,err2)
Out[47]: True
Finally and most importantly, let's time these sections of code -
In [48]: %timeit np.dot(R, (points-p).T) # Original code
10000 loops, best of 3: 79.5 µs per loop
In [49]: %timeit np.dot(R[:2], (points.T)) - np.dot(R[:2],p[:,None]) # Proposed
10000 loops, best of 3: 49.7 µs per loop
In [50]: %timeit rot_points[0]**2 + rot_points[1]**2 # Original code
100000 loops, best of 3: 12.6 µs per loop
In [51]: %timeit np.einsum('ij,ij->j',rot_points_row2,rot_points_row2) # Proposed
100000 loops, best of 3: 11.7 µs per loop
When the number of points is increased to a huge number, the runtimes look more promising. With N = 5000000 points, we get -
In [59]: %timeit np.dot(R, (points-p).T) # Original code
1 loops, best of 3: 410 ms per loop
In [60]: %timeit np.dot(R[:2], (points.T)) - np.dot(R[:2],p[:,None]) # Proposed
1 loops, best of 3: 254 ms per loop
In [61]: %timeit rot_points[0]**2 + rot_points[1]**2 # Original code
10 loops, best of 3: 144 ms per loop
In [62]: %timeit np.einsum('ij,ij->j',rot_points_row2,rot_points_row2) # Proposed
10 loops, best of 3: 77.5 ms per loop
Just wanted to mention that I compared PCA with fitLine for speed. fitLine was faster than PCA for all the cases where the number of points was greater than ~1000.
PCA directly gives you the eigenvectors, so you don't need the Rodrigues (but this time should be negligible). So, may be, the key to optimization lies in the rest of the code, unless there's a faster way to fit the model, other than fitLine and PCA.
I don't remember the PCA related math very well, so I could be wrong in the paragraph that follows.
Eigenvalues give us the variance along each dimension of the new eigenspace. If we consider a simple 2D case, I think you can take the smaller eigenvalue and multiply it with the number of points N (or is it N-1 ?) in the dataset to obtain the squared-residual-sum. Similarly we can extend it to the 3D case. As PCA gives us the eigenvalues, it takes simple scalar multiplications and additions to get the squared-residual-sum.
I'm adding the code (c++) for your reference.
RNG rng;
float a = -0.1, b = 0.1;
int rows = 3, cols = 1024*2;
Mat data = Mat::zeros(rows, cols, CV_32F);
for (int i = 0; i < cols; i++)
{
Vec3f& v = data.at<Vec3f>(i);
v = Vec3f(i+rng.uniform(a, b), i+rng.uniform(a, b), i+rng.uniform(a, b));
}
Mat datat = data.t();
Vec6f line;
fitLine(datat, line, CV_DIST_L2, 0, 1, 0.01);
PCA pca(datat, Mat(), CV_PCA_DATA_AS_ROW);
cout << "fitLine:\n" << line << endl;
cout << "\nPCA:\n" << pca.eigenvalues << endl;
cout << pca.eigenvectors << endl;