I have been browsing through the questions, and could find some help, but I prefer having confirmation by asking it directly. So here is my problem.
I have an (numpy) array u of dimension N, from which I want to build a square matrix k of dimension N^2. Basically, each matrix element k(i,j) is defined as k(i,j)=exp(-|u_i-u_j|^2).
My first naive way to do it was like this, which is, I believe, Fortran-like:
for i in range(N):
for j in range(N):
k[i][j]=np.exp(np.sum(-(u[i]-u[j])**2))
However, this is extremely slow. For N=1000, for example, it is taking around 15 seconds.
My other way to proceed is the following (inspired by other questions/answers):
i, j = np.ogrid[:N,:N]
k = np.exp(np.sum(-(u[i]-u[j])**2,axis=2))
This is way faster, as for N=1000, the result is almost instantaneous.
So I have two questions.
1) Why is the first method so slow, and why is the second one so fast ?
2) Is there a faster way to do it ? For N=10000, it is starting to take quite some time already, so I really don't know if this was the "right" way to do it.
Thank you in advance !
P.S: the matrix is symmetric, so there must also be a way to make the process faster by calculating only the upper half of the matrix, but my question was more related to the way to manipulate arrays, etc.
First, a small remark, there is no need to use np.sum if u can be re-written as u = np.arange(N). Which seems to be the case since you wrote that it is of dimension N.
1) First question:
Accessing indices in Python is slow, so best is to not use [] if there is a way to not use it. Plus you call multiple times np.exp and np.sum, whereas they can be called for vectors and matrices. So, your second proposal is better since you compute your k all in once, instead of elements by elements.
2) Second question:
Yes there is. You should consider using only numpy functions and not using indices (around 3 times faster):
k = np.exp(-np.power(np.subtract.outer(u,u),2))
(NB: You can keep **2 instead of np.power, which is a bit faster but has smaller precision)
edit (Take into account that u is an array of tuples)
With tuple data, it's a bit more complicated:
ma = np.subtract.outer(u[:,0],u[:,0])**2
mb = np.subtract.outer(u[:,1],u[:,1])**2
k = np.exp(-np.add(ma, mb))
You'll have to use twice np.substract.outer since it will return a 4 dimensions array if you do it in one time (and compute lots of useless data), whereas u[i]-u[j] returns a 3 dimensions array.
I used np.add instead of np.sum since it keep the array dimensions.
NB: I checked with
N = 10000
u = np.random.random_sample((N,2))
I returns the same as your proposals. (But 1.7 times faster)
Related
I would like to know the fastest way to extract a sub array from a very large numpy array.
I have an algorithm that needs to run in real time and I often have to extract a sub array which is very time consuming.
Here is how it is currently done:
array[max(0,y-q):max(0,y+q+1),max(0,x-q):max(0,x+q+1)]
To select a q*q array centered in x, y from the original one.
In most of the cases I use q=6
Is there a way to make it faster?
EDIT:
Here is the code using it
res_1 = np.mean(arr_1[max(0,y-q):max(0,y+q+1),max(0,x-q):max(0,x+q+1)])
res_2 = np.mean(arr_2[max(0,y-q):max(0,y+q+1),max(0,x-q):max(0,x+q+1)])
if (arr_0[y, x] - res_1)>0.035 and (arr_0[y, x] - res_2)>0.035:
return True
else:
return False
I did a quick benchmark with the timeit module. It is as I expected: Creating the subarray consumes almost no time. The mean computation takes all the time. Your real question should be, "How do I improve the mean computation here?"
Interestingly enough, mean runs faster in double precision rather than single precision. I guess the mean function always works in double precision internally.
You could also use array.sum() instead of mean. You know the sub-array size, so your comparison could be rewritten as ((arr_0[y, x] * N - res_1)>0.035 * N and ... I have not benchmarked this.
Can one get the cartesian outer product of two set of vectors without using two for loops? It is slow because the data is large.
[[f(x,y) for x in vectors] for y in vectors]
I am trying to do an agglomerative clustering project in python and for this, I need to create a distance matrix.
This is the code that I have to define the function for the distance matrix:
def distance_matrix(vectors):
s = np.zeros((len(vectors), len(vectors)))
for i in range(len(vectors)):
for v in range(len(vectors)):
s[i, v] = dissimilarity(vectors[i], vectors[v])
return s
What it should do is take a take a list of NumPy arrays and return a 2D NumPy array d where the entry d[i,j] should contain the dissimilarity between vectors[i] and vectors[j].
In this case, vectors is the list of NumPy arrays and the dissimilarity is calculated by:
def dissimilarity(v1, v2):
return (1-(v1.dot(v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))))
or in other words, the dissimilarity is the cosine dissimilarity between two 1D NumPy arrays.
My goal is to find a way to get the distance matrix without the double for loops but still have the computational time be very small.
In general one cannot do this. In general (but not in this case), the computation time will not change because the work is physically there and exists and nomatter the accounting tricks one might try to rewrite it (a single for loop that acts like two for loops, or a recursive implementation), at the end of the day the work must still be done irrespective of the ordering you do it.
Is there a way to get rid of the for loops? Even if it slows down the run time a bit. – Cat_Smithyyyy03
Your case is special. While one normally has no reason to consider a speedup is possible, you are asking to consider tensor products, or in this case a matrix product. When you think about general matrix multiplication AB=C, a matrix product C is basically the outer product {a dot b} of all vectors a in the rowspace of A and b in the colspace of B.
So, for your case, first normalize all your vectors (so the normalization is not required later), then stack them to form A, then let B=transp(A), then do a matrix multiply.
[[a0 a1 a2] [[a0 [b0 [z0 [aa ab ac .. az
[b0 b1 b2] a1 b1 ... z1 ba bb bc .. bz
( . )( a2] b2] z2]] ) = ( . . )
. . .
. . .
[z0 z1 z2]] za zb zc zz]
Interestingly, now you can then plug this into a fast matrix-multiply algorithm, which is actually faster than O(dim * #vecs^2). Hopefully it's also optimized for self-transpose matrices that would generate symmetric matrices (which might save a factor of 2 work... maybe it has some flag like matrixmult(a,b,outputWillBeSymmetric)).
This is faster than "O(N^2)", unintuitively: This rewrite exposed a substructure in the problem, which can be leveraged to get faster than O(dim * #vecs^2). The leveragable substructure is namely the fact that you are computing the outer product of THE SAME vectors. The fast matrix-multiply algorithm will leverage this.
edit: my original answer was wrong
You have a set of size N and you wish to compute f(a,b) for all a and b in the set.
Unless you know some values are trivial, there is no way to be asymptotically faster than this, because you have to imagine the worst case: Every pair f(a,b) may be unique... so there's no way to do less than roughly N^2 work.
However since your function f is symmetric, you could do half the work, then duplicate it:
N = len(vectors)
for i in range(N):
for v in range(N):
dissim = #...
s[i,v] = dissim
s[v,i] = dissim
(You can avoid calculating your metric in the reflexive case f(a,a) because it's trivial, but that doesn't make things asymptotically faster since the fraction of such work N/N^2 tends to zero as N increases, so it's not that great an optimization... it is reasonable only if you are working with a very small number of vectors.)
Whether you should optimize further depends on whether you need to. Such code should be able to easily handle millions of small vectors. Next steps:
Is there something fishy that is making my code very slow? What could it be? We don't have enough information to comment.
If there is nothing fishy of the sort, you can try rephrasing as matrix operations, in order to stay as much as possible in numpy's optimized C routines instead of bouncing back and forth into python. This is ugly and I would avoid doing it, because your code readability will decrease.
If you are dealing with hundreds of millions of vectors, perhaps consider a more cache-friendly approach where you do for blockI in range(N//10**6): for blockV in range(N//10**6): for i in range(blockI*10**6, (blockI+1)*10**6): for v in range(blockV*...):
If you're dealing with billions of vectors, then look into leveraging gpgpu. This is quite ideal for the gpu and might be a factor of thousands speedup.
very new to Python so apologies for the lack of vocabulary/knowledge. I would like to know if there is a better way to achieve what the code below provides. Using the loop I have made, I can generate and append all of the matrices/arrays formed from multiplying matrix A by each and every element within A. The last line of code then sums all of the elements in this array of arrays and prints out the result I want.
The problem is, when I get to about d = 600, I get SIGKILL errors, due to a lack of memory on my computer.
I have considered the mathematics behind it, which included breaking the summation into parts that dealt with different values of indices, but nothing seems to speed it up significantly.
This may be purely a memory-based issue, but I thought I would ask in case there are any Python/code based tips that could help. The code is as follows:
A = numpy.random.randint(0, 4, size=(d, d))
All = []
for n in range(0, d):
for m in range(0, d):
All.append(A*(A[n,m]))
print(numpy.sum(All))
So overall, I achieve the correct result, but due to the large size of the matrices and the number of multiplications, I cannot achieve the required d = 2000 I am looking for without a memory error. Thanks in advance.
You don't need to do looping here and building a new list if all you want is the total sum... what you're doing mathematically comes down to:
total = A.sum() ** 2
Have a look at this image:
In my application I receive from an iterator an arbitrary amount (let's say 1000 for now) of big 1-dimensional arrays arr1, arr2, arr3, ..., arr1000 (10000 entries each). Each entry is an integer between 0 and n, where in this case n = 9. My ultimate goal is to compute a 1-dimensional array result such that result[i] == the mode of arr1[i], arr2[i], arr3[i], ..., arr1000[i].
However, it is not tractable to concatenate the arrays to one big matrix and then compute the mode row-wise, since this may exceed the RAM on my machine.
An alternative would be to set up an array res2 of shape (10000, 10), then loop through every array, use each entry e as index and then to increase the value of res2[i][e] by 1. Alter looping, I would apply something like argmax. However, this is too slow.
So: Is the a way to perform the task in a fast way, maybe by using NumPy's advanced indexing?
EDIT (due to the comments):
This is basically the code which calculates the modes row-wise – avoiding to concatenate the arrays:
def foo(length, n):
counts = np.zeros((length, n), dtype=np.int_)
for arr in array_iterator():
i = 0
for e in arr:
counts[i][e] += 1
i += 1
return np.argmax(counts, axis=1)
It takes already 60 seconds for 100 arrays of size 10000 (although there is more work done behind the scenes, which results into that time – however, this work scales linearly with the amount of arrays).
Regarding the real sizes:
The amount of different arrays is really arbitrary. It's a parameter of experiments and I'd like to have the opportunity even to set this to values like 10^6. The length of each array is depending of my data set I'm working with. This could be 10000, or 100000 or even worse. However – spitting this into smaller pieces may be possible, though annoying.
My free RAM for this task is about 4 GB.
EDIT 2:
The running time I gave above leads to a wrong impression. Actually, the running time which just belongs to the inner loop (for e in arr) in the above mentioned scenario is just 5 seconds – which is now ok for me, since it's negligible compared to the remaining running time. I will leave this question open anyway for a moment, since there might be an even faster method waiting out there.
I need to invert a large, dense matrix which I hoped to use Scipy's gmres to do. Fortunately, the dense matrix A follows a pattern and I do not need to store the matrix in memory. The LinearOperator class allows us to construct an object which acts as the matrix for GMRES and can compute directly the matrix vector product A*v. That is, we write a function mv(v) which takes as input a vector v and returns mv(v) = A*v. Then, we can use the LinearOperator class to create A_LinOp = LinearOperator(shape = shape, matvec = mv). We can put the linear operator into the Scipy gmres command to evaluate the matrix vector products without ever having to fully load A into memory.
The documentation for the LinearOperator is found here: LinearOperator Documentation.
Here is my problem: to write the routine to compute the matrix vector product mv(v) = A*v, I need another input vector C. The entries in A are of the form A[i,j] = f(C[i] - C[j]). So, what I really want is for mv to be of two inputs, one fixed vector input C, and one variable input v which we want to compute A*v.
MATLAB has a similar setup, where would write x = gmres(#(v) mv(v,C),b) where b is the right hand side of the problem Ax = b, , and mv is the function that takes as variable input v which we want to compute A*v and C is the fixed, known vector which we need for the assembly of A.
My problem is that I can't figure out how to allow the LinearOperator class to accept two inputs, one variable and one "fixed" like I can in MATLAB.
Is there a way to do the analogous operation in SciPy? Alternatively, if anyone knows of a better way of inverting a large, dense matrix (50000, 50000) where the entries follow a pattern, I would greatly appreciate any suggestions.
Thanks!
EDIT: I should have stated this information actually. The matrix is actually (in block form) [A C; C^T 0], where A is N x N (N large) and C is N x 3, and the 0 is 3 x 3 and C^T is the transpose of C. This array C is the same array as the one mentioned above. The entries of A follow a pattern A[i,j] = f(C[i] - C[j]).
I wrote mv(v,C) to go row by row construct A*v[i] for i=0,N, by computing sum f(C[i]-C[j)*v[j] (actually, I do numpy.dot(FC,v) where FC[j] = f(C[i]-C[j]) which works well). Then, at the end doing the computations for the C^T rows. I was hoping to eventually replace the large for loop with a multiprocessing call to parallelize the for loop, but that's a future thing to consider. I will also look into using Cython to speed up the computations.
This is very late, but if you're still interested...
Your A matrix must be very low rank since it's a nonlinearly transformed version of a rank-2 matrix. Plus it's symmetric. That means it's trivial to inverse: get the truncated eigenvalue decompostion with, say, 5 eigenvalues: A = U*S*U', then invert that: A^-1 = U*S^-1*U'. S is diagonal so this is inexpensive. You can get the truncated eigenvalue decomposition with eigh.
That takes care of A. Then for the rest: use the block matrix inversion formula. Looks nasty, but I will bet you 100,000,000 prussian francs that it's 50x faster than the direct method you were using.
I faced the same situation (some years later than you) of trying to use more than one argument to LinearOperator, but for another problem. The solution I found was the use of global variables, to avoid passing the variables as arguments to the function.