Matrix multiplication and matrix powers using pandas - python

I have a square matrix of some probabilities, call it Phi, stored as a pandas Dataframe. I want to try multiplying this matrix by itself a large number of times to see what happens. I can easily 'square' the matrix with:
Phi_squared = Phi.dot(Phi)
I am wondering, say I want to try Phi^1000, is there any in-built functionality I can utilise in pandas or numpy? Or do I just need to use loops?

Phi**1000 where ** is python power operator
Unless you know this converges, the result will quickly overflow and you will end up with nowhere.
Slowy increase the to the power of N and see if the result is converging, many elements goes to 0 in this case. If it diverges, many elements will increase but overflow to 0, so you better test out from small N to larger N

Related

Pairwise distance in very large datasets

I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.
The ideia should be:
Start new_array with first row from data array
Compare new_array with the second row from data array
If pdist between they are > tol, append row to new_array
Compare new_array with the third row from data array
and so on...
One problem is RAM size. I cant compare all rows at once even in pdist.
So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.
I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.
xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
if (pdist(np.vstack([np.array(ng),z]))>tol).all():
ng.append(z)
Any suggestions for this problem?
EDIT
ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
btsem.append(j)
This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)
The computation is slow because the complexity of your algorithm is quadratic: O(k * n * n) where n is len(xyTotal) and k is the probability of the condition being true. Thus, assuming k=0.1 and n=5000000, the running time will be huge (likely hours of computation).
Hopefully, you can write a better implementation running in O(n * log(n)) time. However, this is tricky to implement. You need to add your ng points in a k-d tree and then you can search the nearest neighbor and check the distance with the current point is greater than tol.
Note that you can find Python modules implementing k-d trees and the SciPy documentation provides an example of implementation written in pure Python (so likely not very efficient).

Matrix exponentiation becomes too slow

I'm currently trying to compute matrix exponentiation, and for that I use the well known algorithm of exponentiation by squaring.
def mat_mul(a, b):
n = len(a)
c = []
for i in range(n):
c.append([0]*n)
for j in range(n):
for k in range(n) :
c[i][j] += (a[i][k]*b[k][j])
return c
def mat_pow(a, n):
if n<=0:
return None
if n==1:
return a
if n==2:
return mat_mul(a, a)
t1 = mat_pow(a, n//2)
if n%2 == 0:
return mat_mul(t1, t1)
return mat_mul(t1, mat_mul(a, t1))
The problem is that my algorithm is still too slow, and after some research, I found out that it was because contrary to what I thought, the time of matrix multiplication depends on the matrix size and on the numbers in the matrix.
In fact, the numbers in my matrix become very large, so after some time, the multiplication becomes way slower. Typically, I have M, a 13*13 matrix filled with random 1 and 0, and I want to compute M(108). The integers in the matrix can have hundreds of digits.
I'd like to know if there is a way to avoid that problem.
I've seen that I could use matrix diagonalization, but the problem is I can't use external libraries (like numpy). So the diagonalization algorithm seems a bit too complicated.
the time of matrix multiplication depends on the matrix size and on the numbers in the matrix.
Well, of course, you are multiplying integers of arbitrary size. CPUs have no support for multiplication of those, so it will be very slow, and become slower as the integers grow.
The integers in the matrix can have hundreds of digits. I'd like to know if there is a way to avoid that problem.
There are several ways:
Avoid integers, use floating point numbers and handle the error however your project needs. This will greatly increase the speed and most importantly it won't depend on the size of the numbers anymore. The memory usage will also greatly decrease too.
Use a better algorithm. You already suggested this one, but this is one of the best ways to increase performance if the better algorithm gives you way better bounds.
Optimize it in a low-level systems language. This can give you some performance back, around an order of magnitude or so. Python is a very bad choice for high-performance computing unless you use specialized libraries that do the work for you like numpy.
Ideally, you should be doing all 3 if you actually need the performance.

Python matrix multiplication 3d Array

I tried to solve a PDE numerically and in the course of this I faced the problem of a triple-nested for loop resembling the 3 spatial dimension. This construct is nested in another time loop, so you can imagine that the computing takes forever for sufficient large node numbers. The code block looks like this
for jy in range(0,cy-1):
for jx in range(0,cx-1):
for jz in range(0,cz-1):
T[n+1,jx,jy,jz] = T[n,jx,jy,jz] + s*(T[n,jx-1,jy,jz] - 2*T[n,jx,jy,jz] + T[n,jx+1,jy,jz]) + s*(T[n,jx,jy-1,jz] - 2*T[n,jx,jy,jz] + T[n,jx,jy+1,jz]) + s*(T[n,jx,jy,jz-1] - 2*T[n,jx,jy,jz] + T[n,jx,jy,jz+1])
It might look intimidating at first, but is quite easy. I have a 3 dimensional matrix representing a solid bulk material, where each point represents the current temperature. The iteratively calculated next temperature at each point is calculated taking into account each point next to that point - so 6 in total. In the case of a 1-dimensional solid the solution is just a simple matrix multiplication. Is there any chance to represent the 3-loop-system above in a simple matrix solution like in the 1D case?
Best regards!
With numpy you can easily do these kinds of matrix operations,
e.g for a 3x3 matrix
import numpy as np
T = np.random.random((3,3,3))
T = T*T - 2*T ... etc.
First off, you need to be a bit more careful with your terminology. A "matrix" is a 2-Dimensional array of numbers. So you are really talking about an array. Numpy, or better yet Scipy, has an data type called an ndarray. You need to be very careful manipulating them, because although they are sometimes used to represent matrices, there are operations that can be performed on 2-D arrays that are not mathematically legal for matrices.
I strongly recommend you use # and not * to perform multiplication of 1- or 2-D matrices, and be sure to add code to check that the operations you are doing are legal mathematically. As a trivial example, Python lets you add a 1 x n or an n x 1 vector to an n x n matrix, even though that is not mathematically correct. The reason it allows it is, as intimated above, because there is no true matrix type in Python.
It very well may be that you can reformulate your problem to use a 3-D array, and by experimentation find the particular operation you are trying to perform. Just keep in mind that the rules of linear algebra are only casually applied in Python.

Integer optimization/maximization in numpy

I need to estimate the size of a population, by finding the value of n which maximises scipy.misc.comb(n, a)/n**b where a and b are constants. n, a and b are all integers.
Obviously, I could just have a loop in range(SOME_HUGE_NUMBER), calculate the value for each n and break out of the loop once I reach an inflexion in the curve. But I wondered if there was an elegant way of doing this with (say) numpy/scipy, or is there some other elegant way of doing this just in pure Python (e.g. like an integer equivalent of Newton's method?)
As long as your number n is reasonably small (smaller than approx. 1500), my guess for the fastest way to do this is to actually try all possible values. You can do this quickly by using numpy:
import numpy as np
import scipy.misc as misc
nMax = 1000
a = 77
b = 100
n = np.arange(1, nMax+1, dtype=np.float64)
val = misc.comb(n, a)/n**b
print("Maximized for n={:d}".format(int(n[val.argmax()]+0.5)))
# Maximized for n=181
This is not especially elegant but rather fast for that range of n. Problem is that for n>1484 the numerator can already get too large to be stored in a float. This method will then fail, as you will run into overflows. But this is not only a problem of numpy.ndarray not working with python integers. Even with them, you would not be able to compute:
misc.comb(10000, 1000, exact=True)/10000**1001
as you want to have a float result in your division of two numbers larger than the maximum a float in python can hold (max_10_exp = 1024 on my system. See sys.float_info().). You couldn't use your range in that case, as well. If you really want to do something like that, you will have to take more care numerically.
You essentially have a nicely smooth function of n that you want to maximise. n is required to be integral but we can consider the function instead to be a function of the reals. In this case, the maximising integral value of n must be close to (next to) the maximising real value.
We could convert comb to a real function by using the gamma function and use numerical optimisation techniques to find the maximum. Another approach is to replace the factorials with Stirling's approximation. This gives a moderately complicated but tractable algebraic expression. This expression is not hard to differentiate and set to zero to find the extrema.
I did this and obtained
n * (b + (n-a) * log((n-a)/n) ) = a * b - a/2
This is not straightforward to solve algebraically but easy enough numerically (e.g. using Newton's method, as you suggest).
I may have made a mistake in the algebra, but I typed the a = 77, b = 100 example into Wolfram Alpha and got 180.58 so the approach seems to work.

Numpy matrix exponentiation gives negative value

I wanted to use NumPy in a Fibonacci question because of its efficiency in matrix multiplication. You know that there is a method for finding Fibonacci numbers with the matrix [[1, 1], [1, 0]].
I wrote some very simple code but after increasing n, the matrix is starting to give negative numbers.
import numpy
def fib(n):
return (numpy.matrix("1 1; 1 0")**n).item(1)
print fib(90)
# Gives -1581614984
What could be the reason for this?
Note: linalg.matrix_power also gives negative values.
Note2: I tried numbers from 0 to 100. It starts to give negative values after 47. Is it a large integer issue because NumPy is coded in C ? If so, how could I solve this ?
Edit: Using regular python list matrix with linalg.matrix_power also gave negative results. Also let me add that not all results are negative after 47, it occurs randomly.
Edit2: I tried using the method #AlbertoGarcia-Raboso suggested. It resolved the negative number problem, however another issues occured. It gives the answer as -5.168070885485832e+19 where I need -51680708854858323072L. So I tried using int(), it converted it to L, but now it seems the answer is incorrect because of a loss in precision.
The reason you see negative values appearing is because NumPy has defaulted to using the np.int32 dtype for your matrix.
The maximum positive integer this dtype can represent is 231-1 which is 2147483647. Unfortunately, this is less the 47th Fibonacci number, 2971215073. The resulting overflow is causing the negative number to appear:
>>> np.int32(2971215073)
-1323752223
Using a bigger integer type (like np.int64) would fix this, but only temporarily: you'd still run into problems if you kept on asking for larger and larger Fibonacci numbers.
The only sure fix is to use an unlimited-size integer type, such as Python's int type. To do this, modify your matrix to be of np.object type:
def fib_2(n):
return (np.matrix("1 1; 1 0", dtype=np.object)**n).item(1)
The np.object type allows a matrix or array to hold any mix of native Python types. Essentially, instead of holding machine types, the matrix is now behaving like a Python list and simply consists of pointers to integer objects in memory. Python integers will be used in the calculation of the Fibonacci numbers now and overflow is not an issue.
>>> fib_2(300)
222232244629420445529739893461909967206666939096499764990979600
This flexibility comes at the cost of decreased performance: NumPy's speed originates from direct storage of integer/float types which can be manipulated by your hardware.

Categories