I am trying to understand the necessity of LU decomposition using numpy and scipy libraries. From what I understand is that we want to solve Ax = b, we first factorize A into two triangular matrices L and U then solve LUx = b by solving Ly = b then Ux = y. By solving triangular matrices, we can reduce time compare to Gaussian Elimination.
So, I tired this idea in python using numpy and scipy.
I first construct A and b using toy examples:
A = np.array([[2, 1, 0, 5], [1, 2, 1, 2], [0, 1, 2, 4], [1, 3, 6, 4.5]])
b = np.array([9, 10, -2, 3])
Then first solve this toy example in np.solve
%timeit np.linalg.solve(A, b )
The time is
9.76 µs ± 782 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Then I use factorization to solve this system:
lu, piv = linalg.lu_factor(A)
%timeit linalg.lu_solve((lu, piv), b)
I saw the output is
18.8 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
, which is quiet slow compared to np.solve.
So, my question is, why np.solve is faster than linalg.lu_factor? My guess is that numpy.solve does not use Gaussian Elimination to solve the equations? A little confused with the result here.
Edit
Now, I use a much larger matrix to do the experiment (10000 x 10000).
Here is the result:
for np.linalg.solve
8.64 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each);
for scipy.linalg.lu_solve
121 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each).
For lu_solve, I only count the time on solving, the decomposition part is not counted. It is now much faster!
Here is a partial answer, since I dispute one of your premises.
You write that "LU solve should be faster than Gaussian Elimination." You seem to misunderstand the purpose of LU decomposition. If you are solving just one such problem (Ax=b where matrix A and vector b are given), LU decomp is no faster than Gaussian elimination. Indeed, the decomposition's algorithm is very similar to the elimination and is no faster.
The advantage of LU decomposition comes when you are given matrix A and you want to solve the equation Ax=b for multiple different given vectors b. Gaussian elimination needs to start over from scratch, and each solution will take the same amount of time. In LU decomposition you can store the resulting matrices L and U from the first calculation, and that greatly speeds up the solutions to the succeeding equations that use different vectors b.
You can read more about this at the section in Numerical Recipes in C about LU Decomposition and Its Applications.
Look at the docstring for numpy.linalg.solve. It says in the "Notes" section "The solutions are computed using LAPACK routine _gesv". (The underscore is a place-holder for a character that corresponds to a data type. For example, dgesv uses double precision.)
The documentation for dgesv explains that it uses the LU decomposition. So you are more-or-less replicating the calculation, but you are doing more steps in Python, so your code is slower.
Related
Currently, I am working with video data and therefore I am performing statistical operations on multiple frames at once. During a debugging session I observed that the computation for numpy statistics (mean computation) in this case) over multiple axes takes longer when computed directly over the desired axes compared to computing it over each axis separately one after the other. I created a simple example to explain my observations.
from timeit import default_timer as timer
import numpy as np
rnd_frames = np.random.randn(100, 128, 128, 3)
n_reps = 1000
# -----------------------------------
# mean computation over multiple axes
# -----------------------------------
# all axes at once
ts = timer()
for i in range(n_reps):
mean_1 = np.mean(rnd_frames, axis=(1, 2))
print('Mean all at once: ', (timer()-ts)/n_reps)
# one after the other
ts = timer()
for i in range(n_reps):
mean_2 = np.mean(rnd_frames, axis=1)
mean_2 = np.mean(mean_2, axis=1)
print('Mean one after the other: ', (timer()-ts)/n_reps)
print('Difference in means: ', np.sum(np.abs(mean_1-mean_2)))
The difference is very small and results from float64 precision.
Does someone have an explanation for this? As the time differences are quite significant: One after the other is 10x faster
I wonder if this is some kind of bug? Can anyone explain this.
The time for 2 axes is the same as for a 1 axis on the equivalent reshape:
In [7]: timeit mean_1 = np.mean(rnd_frames, axis=(1, 2))
54.2 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: timeit mean_3 = np.mean(rnd_frames.reshape(100,-1,3), axis=1)
54.5 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: rnd_frames.reshape(100,-1,3).shape
Out[12]: (100, 16384, 3)
As you note this is quite a bit larger than a sequential calculation:
In [13]: %%timeit
...: mean_2 = np.mean(rnd_frames, axis=1)
...: mean_2 = np.mean(mean_2, axis=1)
7.63 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Without getting deep into the woods of compiled code it's hard to say why there this difference. While "avoiding loops" is a common performance strategy in numpy, that applies mostly to "many loops on a simple task". A few loops on a complex task can be faster. I'm not sure that applies here, but I'm not surprised that there are differences like this.
We could also explore whether putting those 2 loops at the end (inner most), or beginning of the dimensions shows this difference or not.
edit
If I move the 2 axes to either the beginning, or end, the time difference is much smaller. There's something about having that small size 3 dimension at the end (inner most) that's making your example unusually slow.
I have a 2D uint16 numpy array, I want to calculate the histogram for this array.
The function I use is:
def calc_hist(source):
hist = np.zeros(2**16, dtype='uint16')
for i in range(source.shape[0]):
for j in range(source.shape[1]):
hist[source[i, j]] = hist[source[i, j]] + 1
return hist
This function takes too much time to execute.
As I understand, there's a histogram function in the numpy module but I cant figure how to use it.
I've tried:
hist,_ = np.histogram(source.flatten(), bins=range(2**16))
But I get different results then my own function.
How can I call numpy.histogram to achieve the same result? or is there any other options?
For an input with data type uint16, numpy.bincount should work well:
hist = np.bincount(source.ravel(), minlength=2**16)
Your function is doing almost exactly what bincount does, but bincount is implemented in C.
For example, the following checks that this use of bincount gives the same result as your calc_hist function:
In [159]: rng = np.random.default_rng()
In [160]: x = rng.integers(0, 2**16, size=(1000, 1000))
In [161]: h1 = calc_hist(x)
In [162]: h2 = np.bincount(x.ravel(), minlength=2**16)
In [163]: (h1 == h2).all() # Verify that they are the same.
Out[163]: True
Check the performance with ipython %timeit command. You can see that using bincount is much faster.
In [164]: %timeit calc_hist(x)
2.66 s ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [165]: %timeit np.bincount(x.ravel(), minlength=2**16)
3.13 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As Corley Brigman pointed out, passing bins=range(x) determines the bin edges [1]. So, you will end up with x-1 bins with corresponding edges [0, 1), [1, 2), ..., [x-1, x].
In your case, you will have 2^16-1 bins. To fix it, simply use range(2**16+1).
[1] https://numpy.org/doc/stable/reference/generated/numpy.histogram.html?highlight=histogram#numpy.histogram
Given tensors a of shape (n, f) and b of shape (m, f), I have created a function to calculate euclidean distances between these two tensors
import tensorflow as tf
nr = tf.reduce_sum(tf.square(a), 1)
nw = tf.reduce_sum(tf.square(b), 1)
nr = tf.reshape(nr, [-1, 1])
nw = tf.reshape(nw, [1, -1])
res = nr - 2*tf.matmul(a, b, False, True) + nw
res = tf.argmin(res, axis=1)
So far so good, the code runs slightly fast (I got better performance with cKDTree, when n= 1000, m=1600, f=4, but this is not the issue now). I will check the performance versus different input sizes later.
In this example the b tensor is a rank 2, flattened version of a rank 3 tensor. I do that to be able to evaluate the euclidean distances using two tensors with same rank (that is simpler). But after evaluate the distances I need to know where on the original tensor each one of the nearest elements are. For that I have created the custom lambda function fn to convert back to the rank 3 tensor coordinates.
fn = lambda x: (x//N, x%N)
# This map takes a enormous amount of time
out = tf.map_fn(fn, res, dtype=(tf.int64, tf.int64))
return tf.stack(out, axis=1)
But sadly this tf.map_fn takes a HUGE time to run, around 300ms.
Just for comparison, if I perform a np.apply_along_axis in a dataset that exacly the same data (but a numpy array) the footprint is barely noticiable, around 50 microseconds vs. 300ms of tensorflow equivalent.
There are better approaches in tensorflow for this mapping?
TF version 2.1.0 and CUDA is enabled and working.
Just to add some timings
%timeit eucl_dist_tf_vecmap(R_tf, W_tf)
28.1 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit eucl_dist_tf_nomap(R_tf, W_tf)
2.07 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit eucl_dist_ckdtree_applyaxis(R, W)
878 µs ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit eucl_dist_ckdtree_noapplyaxis(R, W)
817 µs ± 51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The first two timings are using the custom function shown here, the first one with vectorized_map and the second one without vectorized_map and the stack (the overhead is on vectorized_map, tested.
And the last two times is an implementations based on scipy's cKDTree. The first one uses np.apply_along_axis exactly as used in vectorized map. We can see that overhead is much smaller in the numpy array.
You could try tf.vectorized_map. https://www.tensorflow.org/api_docs/python/tf/vectorized_map
If you need to change de data type, you can try to change parallel_iterations value in map_fn params, that is set to 1 by default in eager mode.
I wanted to compute the fourier transform of several images.
I was therefore benchmarking numpy's fft.fftn against a brute force for loop.
This is the code I used to benchmark the 2 approaches (in a jupyter notebook):
import numpy as np
x = np.random.rand(32, 256, 256)
def iterate_fft(arr):
k = np.empty_like(arr, dtype=np.complex64)
for i, a in enumerate(arr):
k[i] = np.fft.fft2(a)
return k
k_it = iterate_fft(x)
k_np = np.fft.fftn(x, axes=(1, 2))
np.testing.assert_allclose(k_it.real, k_np.real)
np.testing.assert_allclose(k_it.imag, k_np.imag)
%%timeit
k_it = iterate_fft(x)
Output: 63.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
k_np = np.fft.fftn(x, axes=(1, 2))
Output: 122 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why is there such a huge difference ?
So a person involved in the numpy fft development has answered the deep question on GitHub and it turns out that the slowdown is most likely coming from some multi dimensional array rearrangement used by pocketfft.
It will all be a memory when numpy switches to the scipy 1.4 implementation which can be shown using my benchmark to not have these drawbacks.
These routines in numpy seem to currently assume that the last dimension will always be the smallest. When this is actually true fftn is faster, sometimes by a lot.
That said, I get a much smaller difference in performance between these two methods than you (with Python 3.7.4, numpy 1.17.2). For your example, iterate_fft takes 46ms while ffn takes 50. But if I flip the axes around, to (256, 256, 32), I get 55ms and 40ms respectively. Pushing even further with a shape of (256, 256, 2) I get 21ms and 4ms respectively.
Note that if performance is really an issue, there are other FFT libraries available that perform better in some situations. Also the full fftpack in scipy can have very different performance than the more limited code in numpy.
Note that your usage of fftn basically does:
x = np.random.rand(32, 256, 256)
a = np.fft.fft(x, n=256, axis=2)
a = np.fft.fft(a, n=256, axis=1)
np.testing.assert_allclose(np.fft.fftn(x, axes=(1, 2)), a)
I have 10000 of matrixes with the shape (32, 32, 3). I want to create an euclidean distance matrix between all the matrixes. At the end, it is going to be like,
[0, d2, d3, d4, ...]
[d1, 0, d3, d4, ...]
[d1, d2, 0, d4, ...]
[d1, d2, d3, 0, ...]
How I can make it in the fastest way? I have tried the following, but it takes ages to finish.
import numpy as np
dists = []
for a in range(len(X_test)):
dists.append([])
for b in range(len(X_test)):
dists[a].append(np.linalg.norm(X_test[a] - X_test[b]))
print dists
You can cut the time in half by exploiting the fact that the distance matrix is symmetrical and only compute the upper triangular portion by using using
for b in range(a+1, len(X_test)):
on line 5.
I don't see any other obvious optimizations while keeping the problem exactly the same, but it also seems that you're working with 32x32 images in a three channel format. That's 3072 dimensions! Why not first down-sample to 4x4, convert to HSL color space, and keep only Hue and Lightness to get a (4,4,2) "signature" for each image. If your problem is mostly about shape, you can throw away Hue too and basically work with black-and-white images.
(4,4,2) has only 32 dimensions, for a savings of 100 compared to (32,32,3). And if you did want to do the full comparison in the (32,32,3) space, you could do that only on images that are already very similar in the (4,4,2) space.
I have read Divakar comment.
Rather than asking "Show me Divakar" I asked myself "What is this pdist/cdist stuff?" — I read about pdist and norm and I came out with the following code
Import stuff:
In [1]: import numpy as np
In [2]: from scipy.spatial.distance import pdist
Generate a random sample, not necessarily as large as the OP's one, and reshape it as suggested by Divakar
In [3]: a = np.random.random((100,32,32,3))
In [4]: b = a.reshape((100,32*32*3))
Using the magic of IPython, let's benchmark the two approaches
In [5]: %%timeit
...: dists = []
...: for i in range(len(a)):
...: dists.append([])
...: for j in range(len(a)):
...: dists[i].append(np.linalg.norm(a[i] - a[j]))
128 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit pdist(b)
12.3 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Divakar's was 1 order of magnitude faster — but what about the accuracy?
Let's repeat the computations...
In [7]: dists1 = []
...: for i in range(len(a)):
...: dists1.append([])
...: for j in range(len(a)):
...: dists1[i].append(np.linalg.norm(a[i] - a[j]))
In [8]: dists2 = pdist(b)
To compare the results, we must be aware that pdist computes only the upper triangle of the square matrix of distances (because the matrix is symmetric and the principal diagonal is identically equal to zero) so we must be careful in checking our results: hence I check the off diagonal part of the first row of dists1 with the first 99 elements of dists2 using allclose
In [9]: np.allclose(dists1[0][1:], dists2[:99])
Out[9]: True
The result is the same, nice.
What about an estimate of the time required for 10,000 elements? The feeling is that's quadratic, but let's experiment doubling the number of elements
In [10]: b = np.random.random((200,32*32*3))
In [11]: %timeit pdist(b)
48 ms ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]:
the new timing is 4 times the initial one, so my estimate for your computation, on my feeble pc and using Divakar's proposal, is 12ms x 100 x 100 = 120,000ms = 120s. You should read carefully the excellent answer by olooney and decide what you really want to do.