python scipy.sparse.linalg.eigs giving different results for consecutive calls - python

I am trying to compute the spectral radius of a sparse matrix in python. This is what I have:
from scipy.sparse.linalg import eigs
from scipy import sparse
w = sparse.rand(10, 10, 0.1)
spec_radius = max(abs(eigs(w)[0]))
where the values of w get scaled to be in the range of [-1,1]. However, running that command gives a different result every time:
>>> print max(abs(eigs(w)[0]))
4.51859016293e-05
>>> print max(abs(eigs(w)[0]))
4.02309443625e-06
>>> print max(abs(eigs(w)[0]))
3.7611221426e-05
What gives? I would have thought it would be the same every time. Am I misunderstanding how these commands work?

Sorry for responding to an old question here, but the other answer is not quite satisfactory.
The randomness is not part of the algorithm bundled with ARPACK, but rather the initialization of the algorithm. From the scipy documentation the initialization, v0, is random, unless specified by the user. Sure enough, we see this (note the setup is slightly different--entries of w are scaled to be in [0,1]):
import numpy
from scipy.sparse.linalg import eigs
from scipy import sparse
w = sparse.rand(10, 10, 0.1)
w = w/w.max()
if we do not specify v0, we get some (slight) randomness:
>>> print max(abs(eigs(w)[0]))
0.00024188777676476916
>>> print max(abs(eigs(w)[0]))
0.00028073646868200566
>>> print max(abs(eigs(w)[0]))
0.00025250058038424729
>>> print max(abs(eigs(w)[0]))
0.00018183677959035711
but, if we specify, the initialization, we always get the same answer:
>>> print numpy.all([max(abs(eigs(w, v0 = numpy.ones(10))[0])) == 0.00026363015600771211 for k in range(1000)])
True

You apply the methods correctly and they will give you the same results if the absolute value of the largest eigenvalue is significantly larger than 0. The reason for the outcome you observe is based on the iterative nature of the algorithm that is used to determine the eigenvalues. From the documentation:
"This function is a wrapper to the ARPACK [R209] SNEUPD, DNEUPD, CNEUPD, ZNEUPD, functions which use the Implicitly Restarted Arnoldi Method to find the eigenvalues and eigenvectors [R210]." In case you are interested in details of this algorithm, you can find an explanation e.g. here.
As in all numerical methods you can determine the desired value only with a certain precision. For eigenvalues that are significantly unequal to 0, you will always obtain the same output; for values that are close to 0 you might obtain different values as you observed in the above example.
You can try to vary the parameters "maxiter" and "tol" (check the above cited documentation for details) which you can pass to "eigs". Maxiter is the maximum number of Arnoldi update iterations allowed - by increasing the number you should get more accurate results. "Tol" is the relative accuracy for eigenvalues and stopping criterion of the algorithm.

Related

Sum with Pythons uncertainties giving a different result than expected

A friend of mine is evaluating data with Pythons package uncertainties. I am her statistics consulter, and I have come up with a weird result in her code.
sum(array) and sqrt(sum(unumpy.std_devs(array)**2)) yield different results, with the second one being the variance method as usually used in engineering.
Now, I know that the variance approach is only suited for when the error is small compared to the partial derivate (because of the Taylor series) which isn't given in this case, but how does uncertainties handle this? And how can I reproduce in any way what uncertainties does!?
You forgot to square the standard error to make it the variance. This should work and be equal to the error of sum(array):
sqrt(sum(unumpy.std_devs(array)**2))
Then
from uncertainties import unumpy
import random
import math
a = [uc.ufloat(random.random(), random.random()) for _ in range(100)]
sa = unumpy.std_devs(sum(a))
sb = math.sqrt(sum(unumpy.std_devs(a)**2))
print(sa)
print(sb)
print(sa == sb)
Will result with something like
5.793714811166615
5.793714811166615
True
This results due to my array being an AffineScalarFunc (as opposed to a Variable), and thus they not only store the value but also all the variables that the value depends on [1].
Now, my values are not fully independent (which wasn't clear at all at first sight*), and thus sum(array) also considers the off-diagonal elements of my covariance matrix in accordance to this formula (sorry that the article is in German, but English Wikipedias formula isn't as intuitive), whereas sqrt(sum(unumpy.std_devs(array)**2)) obviously doesn't and just adds up the diagonal elements.
A way to reproduce what uncertainties does is:
from uncertainties import covariance_matrix
sum=0
for i in range(0,len(array)):
for j in range(0,len(array)):
sum+=covariancematrix(array)[i][j]
print(sqrt(sum))
And then unumpy.std_devs(sum(array))==sqrt(sum) is True.
*Correlation due to the use of data taken from the same interpolation (of measurements) and because the length of a measurement was calculated as the difference of two times (and meassurement were consecutive, so the times are now correlated!)

Accuracy of deriving the CDF using integration

I have two ways of deriving the probability of a normally (say) distributed random variable to be within an interval. The first and most straight-forward is the following:
import scipy.stats
print scipy.stats.norm.cdf(6) - scipy.stats.norm.cdf(5)
# 2.85664984223e-07
And the second is by integrating the pdf:
import scipy.integrate
print scipy.integrate.quad(scipy.stats.norm.pdf, 5, 6)[0]
# 2.85664984234e-07
The difference in this case is really tiny, but it doesn't mean it can't grow larger for other distributions or integration limits. Can you tell which is more accurate and why?
By the way, the first alternative seems to be at least 10 times faster, so if it is also more accurate (which would be my guess, since it is somewhat specialized), then it is perfect.
In this particular case, given those particular numbers, the quad approach will actually be more accurate. The CDF itself can be computed quickly and accurately, of course, but look at the actual numbers:
>>> scipy.stats.norm.cdf(6), scipy.stats.norm.cdf(5)
(0.9999999990134123, 0.99999971334842808)
When you're differencing two very similar quantities, you lose accuracy. Similar problems can be mitigated somewhat during integration if the coders are careful with their summations.
Anyway, we can check this against a high-resolution calculation using mpmath:
>>> via_cdf = scipy.stats.norm.cdf(6)-scipy.stats.norm.cdf(5)
>>> via_quad = scipy.integrate.quad(scipy.stats.norm.pdf, 5, 6)[0]
>>> import mpmath
>>> mpmath.mp.dps = 100
>>> def cdf(x): return 0.5 * (1 + mpmath.erf(x/mpmath.sqrt(2)))
>>> highres = cdf(6)-cdf(5)
>>> highres
mpf('0.0000002856649842341562135330514687422473118357532223619105443630157837185833042478210791954518847897468442097')
>>> float((highres - via_quad)/highres)
-2.3824773334590333e-16
>>> float((highres - via_cdf)/highres)
3.86659439572868e-11
The first calls an implementation of the cdf included in scipy.special. The latter actually does the integration. The former is probably more accurate (as it is limited only by the computer's ability do evaluate the CDF and not by any errors introduced by numerical integration). In practice, unless you need results that are good to better than 6 decimal places, you're probably fine.

numpy.linalg.svd not returning Sigma in descending order

Im currently computing an SVD on a large matrix (an image, to be exact) using numpy.linalg's svd function. The documentation and examples that I've found all seem to indicate that the Sigma values that are returned are ordered in descending order (Implying the correct ordering of U and V^T).
However, in my testing the sigma values appear unordered. So my question is whether for some reason something is going wrong in my linalg (highly unlikely I know), or if it simply returns the sigma's as unordered?
A follow-up question is then the best way to sort the sigma's so that the order in U and V^T also reflect the change.
Since linalg.svd is just an interface to LAPACK dgesdd the singular values should be ordered.
>>> import numpy as np
>>> A = np.random.randn(2400,3600)
>>> U, s, V = np.linalg.svd(A, full_matrices=False)
>>> np.allclose(A, np.dot(U*s, V))
True
>>> (s[:-1] >= s[1:]).all()
True
If you get unordered results check if the result is correct, like in the example above. If not you may have a lapack bug or (less likely) a numpy bug.

Using scipy to perform discrete integration of the sample

I am trying to port from labview to python.
In labview there is a function "Integral x(t) VI" that takes a set of samples as input, performs a discrete integration of the samples and returns a list of values (the areas under the curve) according to Simpsons rule.
I tried to find an equivalent function in scipy, e.g. scipy.integrate.simps, but those functions return the summed integral across the set of samples, as a float.
How do I get the list of integrated values as opposed to the sum of the integrated values?
Am I just looking at the problem the wrong way around?
I think you may be using scipy.integrate.simps slightly incorrectly. The area returned by scipy.integrate.simps is the total area under y (the first parameter passed). The second parameter is optional, and are sample values for the x-axis (the actual x values for each of the y values). ie:
>>> import numpy as np
>>> import scipy
>>> a=np.array([1,1,1,1,1])
>>> scipy.integrate.simps(a)
4.0
>>> scipy.integrate.simps(a,np.array([0,10,20,30,40]))
40.0
I think you want to return the areas under the same curve between different limits? To do that you pass the part of the curve you want, like this:
>>> a=np.array([0,1,1,1,1,10,10,10,10,0])
>>> scipy.integrate.simps(a)
44.916666666666671
>>> scipy.integrate.simps(a[:5])
3.6666666666666665
>>> scipy.integrate.simps(a[5:])
36.666666666666664
There is only one method in SciPy that does cumulative integration which is scipy.integrate.cumtrapz() which does what you want as long as you don't specifically need to use the Simpson rule or another method. For that, you can as suggested always write the loop on your own.

Calculate poisson probability percentage

When you use the POISSON function in Excel (or in OpenOffice Calc), it takes two arguments:
an integer
an 'average' number
and returns a float.
In Python (I tried RandomArray and NumPy) it returns an array of random poisson numbers.
What I really want is the percentage that this event will occur (it is a constant number and the array has every time different numbers - so is it an average?).
for example:
print poisson(2.6,6)
returns [1 3 3 0 1 3] (and every time I run it, it's different).
The number I get from calc/excel is 3.19 (POISSON(6,2.16,0)*100).
Am I using the python's poisson wrong (no pun!) or am I missing something?
scipy has what you want
>>> scipy.stats.distributions
<module 'scipy.stats.distributions' from '/home/coventry/lib/python2.5/site-packages/scipy/stats/distributions.pyc'>
>>> scipy.stats.distributions.poisson.pmf(6, 2.6)
array(0.031867055625524499)
It's worth noting that it's pretty easy to calculate by hand, too.
It is easy to do by hand, but you can overflow doing it that way. You can do the exponent and factorial in a loop to avoid the overflow:
def poisson_probability(actual, mean):
# naive: math.exp(-mean) * mean**actual / factorial(actual)
# iterative, to keep the components from getting too large or small:
p = math.exp(-mean)
for i in xrange(actual):
p *= mean
p /= i+1
return p
This page explains why you get an array, and the meaning of the numbers in it, at least.

Categories