I have 10000 of matrixes with the shape (32, 32, 3). I want to create an euclidean distance matrix between all the matrixes. At the end, it is going to be like,
[0, d2, d3, d4, ...]
[d1, 0, d3, d4, ...]
[d1, d2, 0, d4, ...]
[d1, d2, d3, 0, ...]
How I can make it in the fastest way? I have tried the following, but it takes ages to finish.
import numpy as np
dists = []
for a in range(len(X_test)):
dists.append([])
for b in range(len(X_test)):
dists[a].append(np.linalg.norm(X_test[a] - X_test[b]))
print dists
You can cut the time in half by exploiting the fact that the distance matrix is symmetrical and only compute the upper triangular portion by using using
for b in range(a+1, len(X_test)):
on line 5.
I don't see any other obvious optimizations while keeping the problem exactly the same, but it also seems that you're working with 32x32 images in a three channel format. That's 3072 dimensions! Why not first down-sample to 4x4, convert to HSL color space, and keep only Hue and Lightness to get a (4,4,2) "signature" for each image. If your problem is mostly about shape, you can throw away Hue too and basically work with black-and-white images.
(4,4,2) has only 32 dimensions, for a savings of 100 compared to (32,32,3). And if you did want to do the full comparison in the (32,32,3) space, you could do that only on images that are already very similar in the (4,4,2) space.
I have read Divakar comment.
Rather than asking "Show me Divakar" I asked myself "What is this pdist/cdist stuff?" — I read about pdist and norm and I came out with the following code
Import stuff:
In [1]: import numpy as np
In [2]: from scipy.spatial.distance import pdist
Generate a random sample, not necessarily as large as the OP's one, and reshape it as suggested by Divakar
In [3]: a = np.random.random((100,32,32,3))
In [4]: b = a.reshape((100,32*32*3))
Using the magic of IPython, let's benchmark the two approaches
In [5]: %%timeit
...: dists = []
...: for i in range(len(a)):
...: dists.append([])
...: for j in range(len(a)):
...: dists[i].append(np.linalg.norm(a[i] - a[j]))
128 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit pdist(b)
12.3 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Divakar's was 1 order of magnitude faster — but what about the accuracy?
Let's repeat the computations...
In [7]: dists1 = []
...: for i in range(len(a)):
...: dists1.append([])
...: for j in range(len(a)):
...: dists1[i].append(np.linalg.norm(a[i] - a[j]))
In [8]: dists2 = pdist(b)
To compare the results, we must be aware that pdist computes only the upper triangle of the square matrix of distances (because the matrix is symmetric and the principal diagonal is identically equal to zero) so we must be careful in checking our results: hence I check the off diagonal part of the first row of dists1 with the first 99 elements of dists2 using allclose
In [9]: np.allclose(dists1[0][1:], dists2[:99])
Out[9]: True
The result is the same, nice.
What about an estimate of the time required for 10,000 elements? The feeling is that's quadratic, but let's experiment doubling the number of elements
In [10]: b = np.random.random((200,32*32*3))
In [11]: %timeit pdist(b)
48 ms ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]:
the new timing is 4 times the initial one, so my estimate for your computation, on my feeble pc and using Divakar's proposal, is 12ms x 100 x 100 = 120,000ms = 120s. You should read carefully the excellent answer by olooney and decide what you really want to do.
Related
Currently, I am working with video data and therefore I am performing statistical operations on multiple frames at once. During a debugging session I observed that the computation for numpy statistics (mean computation) in this case) over multiple axes takes longer when computed directly over the desired axes compared to computing it over each axis separately one after the other. I created a simple example to explain my observations.
from timeit import default_timer as timer
import numpy as np
rnd_frames = np.random.randn(100, 128, 128, 3)
n_reps = 1000
# -----------------------------------
# mean computation over multiple axes
# -----------------------------------
# all axes at once
ts = timer()
for i in range(n_reps):
mean_1 = np.mean(rnd_frames, axis=(1, 2))
print('Mean all at once: ', (timer()-ts)/n_reps)
# one after the other
ts = timer()
for i in range(n_reps):
mean_2 = np.mean(rnd_frames, axis=1)
mean_2 = np.mean(mean_2, axis=1)
print('Mean one after the other: ', (timer()-ts)/n_reps)
print('Difference in means: ', np.sum(np.abs(mean_1-mean_2)))
The difference is very small and results from float64 precision.
Does someone have an explanation for this? As the time differences are quite significant: One after the other is 10x faster
I wonder if this is some kind of bug? Can anyone explain this.
The time for 2 axes is the same as for a 1 axis on the equivalent reshape:
In [7]: timeit mean_1 = np.mean(rnd_frames, axis=(1, 2))
54.2 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: timeit mean_3 = np.mean(rnd_frames.reshape(100,-1,3), axis=1)
54.5 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: rnd_frames.reshape(100,-1,3).shape
Out[12]: (100, 16384, 3)
As you note this is quite a bit larger than a sequential calculation:
In [13]: %%timeit
...: mean_2 = np.mean(rnd_frames, axis=1)
...: mean_2 = np.mean(mean_2, axis=1)
7.63 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Without getting deep into the woods of compiled code it's hard to say why there this difference. While "avoiding loops" is a common performance strategy in numpy, that applies mostly to "many loops on a simple task". A few loops on a complex task can be faster. I'm not sure that applies here, but I'm not surprised that there are differences like this.
We could also explore whether putting those 2 loops at the end (inner most), or beginning of the dimensions shows this difference or not.
edit
If I move the 2 axes to either the beginning, or end, the time difference is much smaller. There's something about having that small size 3 dimension at the end (inner most) that's making your example unusually slow.
I have a 2D uint16 numpy array, I want to calculate the histogram for this array.
The function I use is:
def calc_hist(source):
hist = np.zeros(2**16, dtype='uint16')
for i in range(source.shape[0]):
for j in range(source.shape[1]):
hist[source[i, j]] = hist[source[i, j]] + 1
return hist
This function takes too much time to execute.
As I understand, there's a histogram function in the numpy module but I cant figure how to use it.
I've tried:
hist,_ = np.histogram(source.flatten(), bins=range(2**16))
But I get different results then my own function.
How can I call numpy.histogram to achieve the same result? or is there any other options?
For an input with data type uint16, numpy.bincount should work well:
hist = np.bincount(source.ravel(), minlength=2**16)
Your function is doing almost exactly what bincount does, but bincount is implemented in C.
For example, the following checks that this use of bincount gives the same result as your calc_hist function:
In [159]: rng = np.random.default_rng()
In [160]: x = rng.integers(0, 2**16, size=(1000, 1000))
In [161]: h1 = calc_hist(x)
In [162]: h2 = np.bincount(x.ravel(), minlength=2**16)
In [163]: (h1 == h2).all() # Verify that they are the same.
Out[163]: True
Check the performance with ipython %timeit command. You can see that using bincount is much faster.
In [164]: %timeit calc_hist(x)
2.66 s ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [165]: %timeit np.bincount(x.ravel(), minlength=2**16)
3.13 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As Corley Brigman pointed out, passing bins=range(x) determines the bin edges [1]. So, you will end up with x-1 bins with corresponding edges [0, 1), [1, 2), ..., [x-1, x].
In your case, you will have 2^16-1 bins. To fix it, simply use range(2**16+1).
[1] https://numpy.org/doc/stable/reference/generated/numpy.histogram.html?highlight=histogram#numpy.histogram
I wanted to compute the fourier transform of several images.
I was therefore benchmarking numpy's fft.fftn against a brute force for loop.
This is the code I used to benchmark the 2 approaches (in a jupyter notebook):
import numpy as np
x = np.random.rand(32, 256, 256)
def iterate_fft(arr):
k = np.empty_like(arr, dtype=np.complex64)
for i, a in enumerate(arr):
k[i] = np.fft.fft2(a)
return k
k_it = iterate_fft(x)
k_np = np.fft.fftn(x, axes=(1, 2))
np.testing.assert_allclose(k_it.real, k_np.real)
np.testing.assert_allclose(k_it.imag, k_np.imag)
%%timeit
k_it = iterate_fft(x)
Output: 63.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
k_np = np.fft.fftn(x, axes=(1, 2))
Output: 122 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why is there such a huge difference ?
So a person involved in the numpy fft development has answered the deep question on GitHub and it turns out that the slowdown is most likely coming from some multi dimensional array rearrangement used by pocketfft.
It will all be a memory when numpy switches to the scipy 1.4 implementation which can be shown using my benchmark to not have these drawbacks.
These routines in numpy seem to currently assume that the last dimension will always be the smallest. When this is actually true fftn is faster, sometimes by a lot.
That said, I get a much smaller difference in performance between these two methods than you (with Python 3.7.4, numpy 1.17.2). For your example, iterate_fft takes 46ms while ffn takes 50. But if I flip the axes around, to (256, 256, 32), I get 55ms and 40ms respectively. Pushing even further with a shape of (256, 256, 2) I get 21ms and 4ms respectively.
Note that if performance is really an issue, there are other FFT libraries available that perform better in some situations. Also the full fftpack in scipy can have very different performance than the more limited code in numpy.
Note that your usage of fftn basically does:
x = np.random.rand(32, 256, 256)
a = np.fft.fft(x, n=256, axis=2)
a = np.fft.fft(a, n=256, axis=1)
np.testing.assert_allclose(np.fft.fftn(x, axes=(1, 2)), a)
I am trying to understand the necessity of LU decomposition using numpy and scipy libraries. From what I understand is that we want to solve Ax = b, we first factorize A into two triangular matrices L and U then solve LUx = b by solving Ly = b then Ux = y. By solving triangular matrices, we can reduce time compare to Gaussian Elimination.
So, I tired this idea in python using numpy and scipy.
I first construct A and b using toy examples:
A = np.array([[2, 1, 0, 5], [1, 2, 1, 2], [0, 1, 2, 4], [1, 3, 6, 4.5]])
b = np.array([9, 10, -2, 3])
Then first solve this toy example in np.solve
%timeit np.linalg.solve(A, b )
The time is
9.76 µs ± 782 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Then I use factorization to solve this system:
lu, piv = linalg.lu_factor(A)
%timeit linalg.lu_solve((lu, piv), b)
I saw the output is
18.8 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
, which is quiet slow compared to np.solve.
So, my question is, why np.solve is faster than linalg.lu_factor? My guess is that numpy.solve does not use Gaussian Elimination to solve the equations? A little confused with the result here.
Edit
Now, I use a much larger matrix to do the experiment (10000 x 10000).
Here is the result:
for np.linalg.solve
8.64 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each);
for scipy.linalg.lu_solve
121 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each).
For lu_solve, I only count the time on solving, the decomposition part is not counted. It is now much faster!
Here is a partial answer, since I dispute one of your premises.
You write that "LU solve should be faster than Gaussian Elimination." You seem to misunderstand the purpose of LU decomposition. If you are solving just one such problem (Ax=b where matrix A and vector b are given), LU decomp is no faster than Gaussian elimination. Indeed, the decomposition's algorithm is very similar to the elimination and is no faster.
The advantage of LU decomposition comes when you are given matrix A and you want to solve the equation Ax=b for multiple different given vectors b. Gaussian elimination needs to start over from scratch, and each solution will take the same amount of time. In LU decomposition you can store the resulting matrices L and U from the first calculation, and that greatly speeds up the solutions to the succeeding equations that use different vectors b.
You can read more about this at the section in Numerical Recipes in C about LU Decomposition and Its Applications.
Look at the docstring for numpy.linalg.solve. It says in the "Notes" section "The solutions are computed using LAPACK routine _gesv". (The underscore is a place-holder for a character that corresponds to a data type. For example, dgesv uses double precision.)
The documentation for dgesv explains that it uses the LU decomposition. So you are more-or-less replicating the calculation, but you are doing more steps in Python, so your code is slower.
I have an ndarray, and I want to replace every value in the array with the mean of its adjacent elements. The code below can do the job, but it is super slow when I have 700 arrays all with shape (7000, 7000) , so I wonder if there are better ways to do it. Thanks!
a = np.array(([1,2,3,4,5,6,7,8,9],[4,5,6,7,8,9,10,11,12],[3,4,5,6,7,8,9,10,11]))
row,col = a.shape
new_arr = np.ndarray(a.shape)
for x in xrange(row):
for y in xrange(col):
min_x = max(0, x-1)
min_y = max(0, y-1)
new_arr[x][y] = a[min_x:(x+2),min_y:(y+2)].mean()
print new_arr
Well, that's a smoothing operation in image processing, which can be achieved with 2D convolution. You are working a bit differently on the near-boundary elements. So, if the boundary elements are let off for precision, you can use scipy's convolve2d like so -
from scipy.signal import convolve2d as conv2
out = (conv2(a,np.ones((3,3)),'same')/9.0
This specific operation is a built-in in OpenCV module as cv2.blur and is very efficient at it. The name basically describes its operation of blurring the input arrays representing images. I believe the efficiency comes from the fact that internally its implemented entirely in C for performance with a thin Python wrapper to handle NumPy arrays.
So, the output could be alternatively calculated with it, like so -
import cv2 # Import OpenCV module
out = cv2.blur(a.astype(float),(3,3))
Here's a quick show-down on timings on a decently big image/array -
In [93]: a = np.random.randint(0,255,(5000,5000)) # Input array
In [94]: %timeit conv2(a,np.ones((3,3)),'same')/9.0
1 loops, best of 3: 2.74 s per loop
In [95]: %timeit cv2.blur(a.astype(float),(3,3))
1 loops, best of 3: 627 ms per loop
Following the discussion with #Divakar, find bellow a comparison of different convolution methods present in scipy:
import numpy as np
from scipy import signal, ndimage
def conv2(A, size):
return signal.convolve2d(A, np.ones((size, size)), mode='same') / float(size**2)
def fftconv(A, size):
return signal.fftconvolve(A, np.ones((size, size)), mode='same') / float(size**2)
def uniform(A, size):
return ndimage.uniform_filter(A, size, mode='constant')
All 3 methods return exactly the same value. However, note that uniform_filter has a parameter mode='constant', which indicates the boundary conditions of the filter, and constant == 0 is the same boundary condition that the Fourier domain (in the other 2 methods) is enforced. For different use cases you can change the boundary conditions.
Now some test matrices:
A = np.random.randn(1000, 1000)
And some timings:
%timeit conv2(A, 3) # 33.8 ms per loop
%timeit fftconv(A, 3) # 84.1 ms per loop
%timeit uniform(A, 3) # 17.1 ms per loop
%timeit conv2(A, 5) # 68.7 ms per loop
%timeit fftconv(A, 5) # 92.8 ms per loop
%timeit uniform(A, 5) # 17.1 ms per loop
%timeit conv2(A, 10) # 210 ms per loop
%timeit fftconv(A, 10) # 86 ms per loop
%timeit uniform(A, 10) # 16.4 ms per loop
%timeit conv2(A, 30) # 1.75 s per loop
%timeit fftconv(A, 30) # 102 ms per loop
%timeit uniform(A, 30) # 16.5 ms per loop
So in short, uniform_filter seems faster, and it because the convolution is separable in two 1D convolutons (similar to gaussian_filter which is also separable).
Other non-separable filters with different kernels are more likely to be faster using signal module (the one in #Divakar's) solution.
The speed of both fftconvolve and uniform_filter remains constant for different kernel sizes, while convolve2d gets slightly slower.
I had a similar problem recently and had to find a different solution since I can't use scipy.
import numpy as np
a = np.random.randint(100, size=(7000,7000)) #Array of 7000 x 7000
row,col = a.shape
column_totals = a.sum(axis=0) #Dump the sum of all columns into a single array
new_array = np.zeros([row,col]) #Create an receiving array
for i in range(row):
#Resulting row = the value of all rows minus the orignal row, divided by the row number minus one.
new_array[i] = (column_totals - a[i]) / (row - 1)
print(new_array)