efficient way of column sum and reciprocal in a matrix - python

I am working with large matrices(up to million X million).I want to column sum each column in a matrix and put the reciprocal of each column sum in the respective column elements where non zero elements are there.I have done two attempts on this but I still want a faster method of computation and since some columns are zero cant do direct np.reciprocal.
Here are my attempts:
A=np.array([[0,1,1,1],[0,0,1,0],[0,1,0,0],[0,0,0,0]])
d=sc.shape(A)[0]
V=sc.zeros(d)
sc.sum(A,axis=0,out=V,dtype='int')
with sc.errstate(divide='ignore', invalid='ignore'):
Vs = sc.true_divide( 1, V )
Vs[ ~ sc.isfinite( Vs )] = 0 # -inf inf NaN
print Vs
Second attempt:
A=np.array([[0,1,1,1],[0,0,1,0],[0,1,0,0],[0,0,0,0]])
d=sc.shape(A)[0]
V=sc.zeros(d)
sc.sum(A,axis=0,out=V,dtype='int')
for i in range(0,d):
if V[i]!=0:
V[i]=1/V[i]
print V
Is there a faster way than this?As my running time is very poor.
Thanks
edit1: Do you think changing everything to csr sparse matrix format would make it faster?

NumPy: Return 0 with divide by zero
discusses various divide by zero options. The accepted answer looks a lot like your first try. But there's a new answer that might (?) be faster
https://stackoverflow.com/a/37977222/901925
In [240]: V=A.sum(axis=0)
In [241]: np.divide(1,V,out=np.zeros(V.shape),where=V>0)
Out[241]: array([ 0. , 0.5, 0.5, 1. ])
Your example is too small to make meaningful time tests on. I don't have any intuition about the relative speeds (beyond my comment).
A recent SO question pointed out that the out parameter is required with where in the latest release (1.13) but optional in earlier ones.

Related

Find all the times in a (very large) array that have a difference greater than x

My array is time, so it is sorted and increasing.
I have to pull out the beginning/end where the difference in the array is greater than 30. The problem which other solutions don't cover, is that the array is thousands of values so looping through the array seems inefficient.
hugeArr = np.array([0, 2.072, 50.0, 90.0, 91.1])
My desired output for the above array would be something like: (2.072,50) (50,90).
Is there a way to accomplish this?
You can use np.diff and np.where to find the correct indices:
>>> idxs = np.where(np.diff(hugeArr) > 30)[0]
>>> list(zip(hugeArr[idxs], hugeArr[idxs + 1]))
[(2.072, 50.0), (50.0, 90.0)]
(Assuming you require only consecutive values)
And as #not_speshal mentioned, you can use np.column_stack instead of list(zip(...)) to stay within NumPy boundaries:
>>> np.column_stack((hugeArr[idxs], hugeArr[idxs+1]))
array([[ 2.072, 50. ],
[50. , 90. ]])
Try to think about what your'e trying to do. For each value in the array if the next value is larger by more then 30 you'd like to save the tuple of them.
The key words here are for each. This is a classic O(n) complexity algorithm, so decreasing its time complexity seems impossible to me.
However, you can make changes specific to your array to make the algorithm faster.
For example, if your'e looking for a difference of 30 and you know that the average difference is 1, you might be better off looking for index i at
difference = hugeArr[i+15] - hugeArr[i]
and see if this is bigger then 30. If it isn't (and it probably won't be), you can skip these 15 indices as you know that no gap between two consecutive values is larger then the big gap.
If this works for you, run tests, 15 is completely arbitrary and maybe your magic number is 25. Change it a bit and time how long your function takes to run.
A strategy that comes to mind is that we don't have to check numbers between two numbers that have a distance smaller than 30, we can do this because it is sorted. For example if the abs(hugeArr[0] - hugeArr[-1]) < 30 we dont have to check anything because nothing will have a distance of over 30.
We would start at the ends and work our way inwards. So check the starting number and ending number first. Then we go halfway hugeArr[len(hugeArr)//2] and check that number distance against the hugeArr[0] and hugeArr[-1]. Then we go into the ranges (hugeArr[0:len(hugeArr)//2] and hugeArr[len(hugeArr)//2:-1]). We break those two ranges again in half and wherever there is a distance from end to end smaller than 30 we don't check those. We can make this a recursive algorithm.
Worst case you'll have a distance over 30 everywhere and end up with O(n) but it could give you some advantage.
Something like this however you might want to refactor to numpy.
def check(arr):
pairs = []
def check_range(hugeArr):
difference = abs(hugeArr[0] - hugeArr[-1])
if difference < 30:
return
if len(hugeArr) == 2:
pairs.append((hugeArr[0], hugeArr[1]))
return
halfway = len(hugeArr)//2
check_range(hugeArr[:halfway+1])
check_range(hugeArr[halfway:])
check_range(arr)
return pairs

How to find the common eigenvectors of two matrices with distincts eigenvalues

I am looking for finding or rather building common eigenvectors matrix X between 2 matrices A and B such as :
AX=aX with "a" the diagonal matrix corresponding to the eigenvalues
BX=bX with "b" the diagonal matrix corresponding to the eigenvalues
where A and B are square and diagonalizable matrices.
I took a look in a similar post but had not managed to conclude, i.e having valid results when I build the final wanted endomorphism F defined by : F = P D P^-1
I have also read the wikipedia topic and this interesting paper but couldn't have to extract methods pretty easy to implement.
Particularly, I am interested by the eig(A,B) Matlab function.
I tried to use it like this :
% Search for common build eigen vectors between FISH_sp and FISH_xc
[V,D] = eig(FISH_sp,FISH_xc);
% Diagonalize the matrix (A B^-1) to compute Lambda since we have AX=Lambda B X
[eigenv, eigen_final] = eig(inv(FISH_xc)*FISH_sp);
% Compute the final endomorphism : F = P D P^-1
FISH_final = V*eye(7).*eigen_final*inv(V)
But the matrix FISH_final don't give good results since I can do other computations from this matrix FISH_final (this is actually a Fisher matrix) and the results of these computations are not valid.
So surely, I must have done an error in my code snippet above. In a first time, I prefer to conclude in Matlab as if it was a prototype, and after if it works, look for doing this synthesis with MKL or with Python functions. Hence also tagging python.
How can I build these common eigenvectors and finding also the eigenvalues associated? I am a little lost between all the potential methods that exist to carry it out.
The screen capture below shows that the kernel of commutator has to be different from null vector :
EDIT 1: From maths exchange, one advices to use Singular values Decomposition (SVD) on the commutator [A,B], that is in Matlab doing by :
"If 𝑣 is a common eigenvector, then ‖(𝐴𝐵−𝐵𝐴)𝑣‖=0. The SVD approach gives you a unit-vector 𝑣 that minimizes ‖(𝐴𝐵−𝐵𝐴)𝑣‖ (with the constraint that ‖𝑣‖=1)"
So I extract the approximative eigen vectors V from :
[U,S,V] = svd(A*B-B*A)
Is there a way to increase the accuracy to minimize ‖(𝐴𝐵−𝐵𝐴)𝑣‖ as much as possible ?
IMPORTANT REMARK : Maybe some of you didn't fully understand my goal.
Concerning the common basis of eigen vectors, I am looking for a combination (vectorial or matricial) of V1 and V2, or directly using null operator on the 2 input Fisher marices, to build this new basis "P" in which, with others eigenvalues than known D1 and D2 (noted D1a and D2a), we could have :
F = P (D1a+D2a) P^-1
To compute the new Fisher matrix F, I need to know P, assuming that D1a and D2a are equal respectively to D1 and D2 diagonal matrices (coming from diagonalization of A and B matrices)
If I know common basis of eigen vectors P, I could deduce D1a and Da2 from D1 and D2, couldn't I ?
The 2 Fisher matrices are available on these links :
Matrix A
Matrix B
I don't think there is a built-in facility in Matlab for computing common eigenvalues of two matrices. I'll just outline brute force way and do it in Matlab in order to highlight some of its eigenvector related methods. We will assume the matrices A and B are square and diagonalizable.
Outline of steps:
Get eigenvectors/values for A and B respectively.
Group the resultant eigenvectors by their eigenspaces.
Check for intersection of the eigenspaces by checking linear dependency among the eigenvectors of A and B one pair eigenspaces at a time.
Matlab does provide methods for (efficiently) completing each step! Except of course step 3 involves checking linear dependency many many times, which in turn means we are likely doing unnecessary computation. Not to mention, finding common eigenvectors may not require finding all eigenvectors. So this is not meant to be a general numerical recipe.
How to get eigenvector/values
The syntax is
[V,D] = eig(A)
where D(i), V(:,i) are the corresponding eigenpairs.
Just be wary of numerical errors. In other words, if you check
tol=sum(abs(A*V(:,i)-D(i)*V(:,i)));
tol<n*eps should be true for some small n for a smallish matrix A but it's probably not true for 0 or 1.
Example:
>> A = gallery('lehmer',4);
>> [V,D] = eig(A);
>> sum(abs(A*V(:,1)-D(1)*V(:,1)))<eps
ans =
logical
0
>> sum(abs(A*V(:,1)-D(1)*V(:,1)))<10*eps
ans =
logical
1
How to group eigenvectors by their eigenspaces
In Matlab, eigenvalues are not automatically sorted in the output of [V,D] = eig(A). So you need to do that.
Get diagonal entries of matrix: diag(D)
Sort and keep track of the required permutation for sorting: [d,I]=sort(diag(D))
Identify repeating elements in d: [~,ia,~]=unique(d,'stable')
ia(i) tells you the beginning index of the ith eigenspace. So you can expect d(ia(i):ia(i+1)-1) to be identical eigenvalues and thus the eigenvectors belonging to the ith eigenspace are the columns W(:,ia(i):ia(i+1)-1) where W=V(:,I). Of course, for the last one, the index is ia(end):end
The last step happens to be answered here in true generality. Here, unique is sufficient at least for small A.
(Feel free to ask a separate question on how to do this whole step of "shuffling columns of one matrix based on another diagonal matrix" efficiently. There are probably other efficient methods using built-in Matlab functions.)
For example,
>> A=[1,2,0;1,2,2;3,6,1];
>> [V,D] = eig(A),
V =
0 0 0.7071
1.0000 -0.7071 0
0 0.7071 -0.7071
D =
3 0 0
0 5 0
0 0 3
>> [d,I]=sort(diag(D));
>> W=V(:,I),
W =
0 0.7071 0
1.0000 0 -0.7071
0 -0.7071 0.7071
>> [~,ia,~]=unique(d,'stable'),
ia =
1
3
which makes sense because the 1st eigenspace is the one with eigenvalue 3 comprising of span of column 1 and 2 of W, and similarly for the 2nd space.
How to get linear intersect of (the span of) two sets
To complete the task of finding common eigenvectors, you do the above for both A and B. Next, for each pair of eigenspaces, you check for linear dependency. If there is linear dependency, the linear intersect is an answer.
There are a number of ways for checking linear dependency. One is to use other people's tools. Example: https://www.mathworks.com/matlabcentral/fileexchange/32060-intersection-of-linear-subspaces
One is to get the RREF of the matrix formed by concatenating the column vectors column-wise.
Let's say you did the computation in step 2 and arrived at V1, D1, d1, W1, ia1 for A and V2, D2, d2, W2, ia2 for B. You need to do
for i=1:numel(ia1)
for j=1:numel(ia2)
check_linear_dependency(col1,col2);
end
end
where col1 is W1(:,ia1(i):ia1(i+1)-1) as mentioned in step 2 but with the caveat for the last space and similarly for col2 and by check_linear_dependency we mean the followings. First we get RREF:
[R,p] = rref([col1,col2]);
You are looking for, first, rank([col1,col2])<size([col1,col2],2). If you have computed rref anyway, you already have the rank. You can check the Matlab documentation for details. You will need to profile your code for selecting the more efficient method. I shall refrain from guess-estimating what Matlab does in rank(). Although whether doing rank() implies doing the work in rref can make a good separate question.
In cases where rank([col1,col2])<size([col1,col2],2) is true, some rows don't have leading 1s and I believe p will help you trace back to which columns are dependent on which other columns. And you can build the intersect from here. As usual, be alert of numerical errors getting in the way of == statements. We are getting to the point of a different question -- ie. how to get linear intersect from rref() in Matlab, so I am going to leave it here.
There is yet another way using fundamental theorem of linear algebra (*sigh at that unfortunate naming):
null( [null(col1.').' ; null(col2.').'] )
The formula I got from here. I think ftla is why it should work. If that's not why or if you want to be sure that the formula works (which you probably should), please ask a separate question. Just beware that purely math questions should go on a different stackexchange site.
Now I guess we are done!
EDIT 1:
Let's be extra clear with how ia works with an example. Let's say we named everything with a trailing 1 for A and 2 for B. We need
for i=1:numel(ia1)
for j=1:numel(ia2)
if i==numel(ia1)
col1 = W1(:,ia1(end):end);
else
col1 = W1(:,ia1(i):ia1(i+1)-1);
end
if j==numel(ia2)
col2 = W2(:,ia2(j):ia2(j+1)-1);
else
col2 = W2(:,ia2(end):end);
end
check_linear_dependency(col1,col2);
end
end
EDIT 2:
I should mention the observation that common eigenvectors should be those in the nullspace of the commutator. Thus, perhaps null(A*B-B*A) yields the same result.
But still be alert of numerical errors. With the brute force method, we started with eigenpairs with low tol (see definition in earlier sections) and so we already verified the "eigen" part in the eigenvectors. With null(A*B-B*A), the same should be done as well.
Of course, with multiple methods at hand, it's good idea to compare results across methods.
I suspect this is rather a delicate matter.
First off, mathematically, A and B are simultaneously diagonalisable iff they commute, that is iff
A*B - B*A == 0 (often A*B-B*A is written [A,B])
(for if A*X = X*a and B*X = X*b with a, b diagonal then
A = X*a*inv(X), B = X*b*inv(X)
[A,B] = X*[a,b]*inv(X) = 0 since a and b, being diagonal, commute)
So I'd say the first thing to check is that your A and B do commute, and here is the first awkward issue: since [A,B] as computed is unlikely to be all zeroes due to rounding error, you'll need to decide if [A,B] being non-zero is just due to rounding error or if, actually, A and B don't commute.
Now suppose x is an eigenvector of A, with eigenvalue e. Then
A*B*x = B*A*x = B*e*x = e*B*x
And so we have, mathematically, two possibilities: either Bx is 0, or Bx is also an eigenvector of A with eigenvalue e.
A nice case is when all the elements of a are different, that is when each eigenspace of A is one dimensional. In that case:
if AX = Xa for diagonal a, then BX = Xb for diagonal b (which you'll need to compute). If you diagonalize A, and all the eigenvalues are sufficiently different, then you can assume each eigenspace is of dimension 1, but what does 'sufficiently' mean? Another delicate question, alas. If two computed eigenvalues are very close, are the eigenvalues different or is the difference rounding error?
Anyway, to compute the eigenvalues of b for each eigenvector x of A compute Bx. If ||Bx|| is small enough compared to ||x|| then the eigenvalue of B is 0, otherwise it's
x'*B*x/(x'*x)
In the general case, some of the eigenspaces may have dimension greater than 1. The one dimensional eigen spaces can be dealt with as above, but more computation is required for the higher dimensional ones.
Suppose that m eigenvectors x[1].. x[m] of A correspond to the eigenvalue e. Since A and B commute, it is easy to see that B maps the space spanned by the xs to itself. Let C be the mxm matrix
C[i,j] = x[j]'*B*x[i]
Then C is symmetric and so we can diagonalize it, ie find orthogonal V and diagonal c with
C = V'*c*V
If we define
y[l] = Sum{ k | V[l,k]*x[k] } l=1..m
then a little algebra shows that y[l] is an eigenvector of B, with eigenvalue c[l]. Moreover, since each x[i] is an eigenvector of A with the same eigenvalue e, each y[l] is also an eigenvector of A with eigenvector e.
So all in all, I think a method would be:
Compute [A,B] and if its really not 0, give up
Diagonalise A, and sort the eigenvalues to be increasing (and sort the eigenvectors!)
Identify the eigenspaces of A. For the 1 dimensional spaces the corresponding eigenvector of A is an eigenvector of B, and all you need to compute is the eigenvalue of B. For higher dimensional ones, proceed as in the previous paragraph.
A relatively expensive (in computational effort) but reasonably reliable way to test whether the commutator is zero would be to compute the svd of the commutator and take is largest singular value, c say, and also to take the largest singular value (or largest absolute value of the eigenvalues) a of A and b of B. Unless c is a lot smaller (eg 1e-10 times) the lesser of a and b, you should conclude the commutator is not zero.

NumPy: Compute mode row-wise spanning over multiple arrays from iterator

Have a look at this image:
In my application I receive from an iterator an arbitrary amount (let's say 1000 for now) of big 1-dimensional arrays arr1, arr2, arr3, ..., arr1000 (10000 entries each). Each entry is an integer between 0 and n, where in this case n = 9. My ultimate goal is to compute a 1-dimensional array result such that result[i] == the mode of arr1[i], arr2[i], arr3[i], ..., arr1000[i].
However, it is not tractable to concatenate the arrays to one big matrix and then compute the mode row-wise, since this may exceed the RAM on my machine.
An alternative would be to set up an array res2 of shape (10000, 10), then loop through every array, use each entry e as index and then to increase the value of res2[i][e] by 1. Alter looping, I would apply something like argmax. However, this is too slow.
So: Is the a way to perform the task in a fast way, maybe by using NumPy's advanced indexing?
EDIT (due to the comments):
This is basically the code which calculates the modes row-wise – avoiding to concatenate the arrays:
def foo(length, n):
counts = np.zeros((length, n), dtype=np.int_)
for arr in array_iterator():
i = 0
for e in arr:
counts[i][e] += 1
i += 1
return np.argmax(counts, axis=1)
It takes already 60 seconds for 100 arrays of size 10000 (although there is more work done behind the scenes, which results into that time – however, this work scales linearly with the amount of arrays).
Regarding the real sizes:
The amount of different arrays is really arbitrary. It's a parameter of experiments and I'd like to have the opportunity even to set this to values like 10^6. The length of each array is depending of my data set I'm working with. This could be 10000, or 100000 or even worse. However – spitting this into smaller pieces may be possible, though annoying.
My free RAM for this task is about 4 GB.
EDIT 2:
The running time I gave above leads to a wrong impression. Actually, the running time which just belongs to the inner loop (for e in arr) in the above mentioned scenario is just 5 seconds – which is now ok for me, since it's negligible compared to the remaining running time. I will leave this question open anyway for a moment, since there might be an even faster method waiting out there.

Numpy array subtraction: inconsistent values for large arrays

Here's a problem I came across today: I am trying to subtract the first row of a matrix from the (large) entire matrix. As a test, I made all rows equal. Here's a MWE:
import numpy as np
first = np.random.normal(size=10)
reference = np.repeat((first,), 10000, axis=0)
copy_a = np.copy(reference)
copy_a -= copy_a[0]
print np.all(copy_a == 0) # prints False
Oh wow - False! So I tried another thing:
copy_b = np.copy(reference)
copy_b -= reference[0]
np.all(copy_b == 0) # prints True
Examining the new copy_a array, I found that copy_a[0:818] are all zeros, copy_a[820:] are the original values, while copy_a[819] got operated partly.
In [115]: copy_a[819]
Out[115]:
array([ 0. , 0. , 0.57704706, -0.22270692, -1.83793342,
0.58976187, -0.71014837, 1.80517635, -0.98758385, -0.65062774])
Looks like midway during the operation, numpy went back and looked at copy_a[0], found it is all zeros, and hence subtracted zeros from the rest of the array. I find this weird. Is this a bug, or is it an expected numpy result?
This issue has actually been reported multiple times to the numpy repository (see below). It is considered a bug, but is very hard to fix without sacrificing performance (copying the input arrays) because correctly detecting if two arrays share memory is difficult.
Therefore, for now, you'd better just make a copy of copy_a[0] as explained in #Torben's answer.
The essence of the issue is that your are modifying the array while iterating. It happens to work until copy_a[819] simply because 8192 (819×10+2) is the size of numpy's assign buffer.
https://github.com/numpy/numpy/issues/6119
https://github.com/numpy/numpy/issues/5241
https://github.com/numpy/numpy/issues/4802
https://github.com/numpy/numpy/issues/2705
https://github.com/numpy/numpy/issues/1683
The infix operator -= modifies the array inplace, meaning that you are pulling the rug under your own feet. The effect that you see might have to do with internal caching of results (i.e. first "commit" happens after 818 rows).
The solution is to swap out the subtrahend into another array:
copy_a -= copy_a[0].copy()

Algorithm to calculate point at which to round values in an array up or down in order to least affect the mean

Consider array random array of values between 0 and 1 such as:
[0.1,0.2,0.8,0.9]
is there a way to calculate the point at which the values should be rounded down or up to an integer in order to match the mean of the un-rounded array the closest? (in above case it would be at the mean but that is purely a coincidence)
or is it just trial and error?
im coding in python
thanks for any help
Add them up, then round the sum. That's how many 1s you want. Round so you get that many 1s.
def rounding_point(l):
# if the input is sorted, you don't need the following line
l = sorted(l)
ones_needed = int(round(sum(l)))
# this may require adjustment if there are duplicates in the input
return 1.0 if ones_needed == len(l) else l[-ones_needed]
If sorting the list turns out to be too expensive, you can use a selection algorithm like quickselect. Python doesn't come with a quickselect function built in, though, so don't bother unless your inputs are big enough that the asymptotic advantage of quickselect outweighs the constant factor advantage of the highly-optimized C sorting algorithm.

Categories