Sum with Pythons uncertainties giving a different result than expected

Sum with Pythons uncertainties giving a different result than expected - python

A friend of mine is evaluating data with Pythons package uncertainties. I am her statistics consulter, and I have come up with a weird result in her code.
sum(array) and sqrt(sum(unumpy.std_devs(array)**2)) yield different results, with the second one being the variance method as usually used in engineering.
Now, I know that the variance approach is only suited for when the error is small compared to the partial derivate (because of the Taylor series) which isn't given in this case, but how does uncertainties handle this? And how can I reproduce in any way what uncertainties does!?

You forgot to square the standard error to make it the variance. This should work and be equal to the error of sum(array):
sqrt(sum(unumpy.std_devs(array)**2))
Then
from uncertainties import unumpy
import random
import math
a = [uc.ufloat(random.random(), random.random()) for _ in range(100)]
sa = unumpy.std_devs(sum(a))
sb = math.sqrt(sum(unumpy.std_devs(a)**2))
print(sa)
print(sb)
print(sa == sb)
Will result with something like
5.793714811166615
5.793714811166615
True

This results due to my array being an AffineScalarFunc (as opposed to a Variable), and thus they not only store the value but also all the variables that the value depends on [1].
Now, my values are not fully independent (which wasn't clear at all at first sight*), and thus sum(array) also considers the off-diagonal elements of my covariance matrix in accordance to this formula (sorry that the article is in German, but English Wikipedias formula isn't as intuitive), whereas sqrt(sum(unumpy.std_devs(array)**2)) obviously doesn't and just adds up the diagonal elements.
A way to reproduce what uncertainties does is:
from uncertainties import covariance_matrix
sum=0
for i in range(0,len(array)):
for j in range(0,len(array)):
sum+=covariancematrix(array)[i][j]
print(sqrt(sum))
And then unumpy.std_devs(sum(array))==sqrt(sum) is True.
*Correlation due to the use of data taken from the same interpolation (of measurements) and because the length of a measurement was calculated as the difference of two times (and meassurement were consecutive, so the times are now correlated!)

Related

python scipy.sparse.linalg.eigs giving different results for consecutive calls

I am trying to compute the spectral radius of a sparse matrix in python. This is what I have:
from scipy.sparse.linalg import eigs
from scipy import sparse
w = sparse.rand(10, 10, 0.1)
spec_radius = max(abs(eigs(w)[0]))
where the values of w get scaled to be in the range of [-1,1]. However, running that command gives a different result every time:
>>> print max(abs(eigs(w)[0]))
4.51859016293e-05
>>> print max(abs(eigs(w)[0]))
4.02309443625e-06
>>> print max(abs(eigs(w)[0]))
3.7611221426e-05
What gives? I would have thought it would be the same every time. Am I misunderstanding how these commands work?

Sorry for responding to an old question here, but the other answer is not quite satisfactory.
The randomness is not part of the algorithm bundled with ARPACK, but rather the initialization of the algorithm. From the scipy documentation the initialization, v0, is random, unless specified by the user. Sure enough, we see this (note the setup is slightly different--entries of w are scaled to be in [0,1]):
import numpy
from scipy.sparse.linalg import eigs
from scipy import sparse
w = sparse.rand(10, 10, 0.1)
w = w/w.max()
if we do not specify v0, we get some (slight) randomness:
>>> print max(abs(eigs(w)[0]))
0.00024188777676476916
>>> print max(abs(eigs(w)[0]))
0.00028073646868200566
>>> print max(abs(eigs(w)[0]))
0.00025250058038424729
>>> print max(abs(eigs(w)[0]))
0.00018183677959035711
but, if we specify, the initialization, we always get the same answer:
>>> print numpy.all([max(abs(eigs(w, v0 = numpy.ones(10))[0])) == 0.00026363015600771211 for k in range(1000)])
True

You apply the methods correctly and they will give you the same results if the absolute value of the largest eigenvalue is significantly larger than 0. The reason for the outcome you observe is based on the iterative nature of the algorithm that is used to determine the eigenvalues. From the documentation:
"This function is a wrapper to the ARPACK [R209] SNEUPD, DNEUPD, CNEUPD, ZNEUPD, functions which use the Implicitly Restarted Arnoldi Method to find the eigenvalues and eigenvectors [R210]." In case you are interested in details of this algorithm, you can find an explanation e.g. here.
As in all numerical methods you can determine the desired value only with a certain precision. For eigenvalues that are significantly unequal to 0, you will always obtain the same output; for values that are close to 0 you might obtain different values as you observed in the above example.
You can try to vary the parameters "maxiter" and "tol" (check the above cited documentation for details) which you can pass to "eigs". Maxiter is the maximum number of Arnoldi update iterations allowed - by increasing the number you should get more accurate results. "Tol" is the relative accuracy for eigenvalues and stopping criterion of the algorithm.

Python - Cosine gradually reveals small-amp oscillations ("wobbles")

I have a problem that is equal parts trig and Python. I am plotting a cosine over time interval [0,t] whose frequency changes (slightly) according to another cosine function. So what I'd expect to see is a repeating pattern of higher-to-lower frequency that repeats over the duration of the window [0,t].
Instead what I'm seeing is that over time a low-freq motif emerges in the cosine plot and repeats over time, each time becoming lower and lower in freq until eventually the cosine doesn't even oscillate properly it just "wobbles", for lack of a better term.
I don't understand how this is emerging over the course of the window [0,t] because cosine is (obviously) periodic and the function modulating it is as well. So how can "new" behavior emerge?? The behavior should be identical across all periods of the modulatory cosine that tunes the freq of the base cosine, right?
As a note, I'm technically using a modified cosine, instead of cos(wt) I'm using e^(cos(wt)) [called von mises eq or something similar].
Minimum needed Code:
cos_plot = []
for wind,pos_theta in zip(window,pos_theta_vec): #window is vec of time increments
# for ref: DBFT(pos_theta) = (1/(2*np.pi))*np.cos(np.radians(pos_theta - base_pos))
f = float(baserate+DBFT(pos_theta)) # DBFT() returns a val [-0.15,0.15] periodically depending on val of pos_theta
cos_plot.append(np.exp(np.cos(f*2*np.pi*wind)))
plt.plot(cos_plot)
plt.show()

What you are observing could depend on "aliasing", i.e. the emergence of low-frequency figures because of sampling of an high frequency function with a step that is too big.
(picture taken from the linked Wikipedia page)
If the issue is NOT aliasing consider that any function shape between -1 and 1 can be obtained with cos(f(x)*x) by simply choosing f(x).
For, consider any function -1 <= g(x) <= 1 and set f(x) = arccos(g(x))/x.
To look for the problem try plotting your "frequency" and see if anything really strange is present in it. May be you've a bug in DBFT.

In the interest of posterity, in case anyone ever needs an answer to this question:
I wanted a cosine whose frequency was a time-varying function freq(t). My mistake was simply evaluating this function at each time t like this: Acos(2pifreq(t)t). Instead you need to integrate freq(t) from 0 to t at each time point: y = cos(2%piintegral(f(t)) + 2%pi*f0*t + phase). The term for this procedure is a frequency sweep or chirp (not identical terms, but similar if you need to google/SO answers).
Thanks to those who responded with help :)
SB

Accuracy of deriving the CDF using integration

I have two ways of deriving the probability of a normally (say) distributed random variable to be within an interval. The first and most straight-forward is the following:
import scipy.stats
print scipy.stats.norm.cdf(6) - scipy.stats.norm.cdf(5)
# 2.85664984223e-07
And the second is by integrating the pdf:
import scipy.integrate
print scipy.integrate.quad(scipy.stats.norm.pdf, 5, 6)[0]
# 2.85664984234e-07
The difference in this case is really tiny, but it doesn't mean it can't grow larger for other distributions or integration limits. Can you tell which is more accurate and why?
By the way, the first alternative seems to be at least 10 times faster, so if it is also more accurate (which would be my guess, since it is somewhat specialized), then it is perfect.

In this particular case, given those particular numbers, the quad approach will actually be more accurate. The CDF itself can be computed quickly and accurately, of course, but look at the actual numbers:
>>> scipy.stats.norm.cdf(6), scipy.stats.norm.cdf(5)
(0.9999999990134123, 0.99999971334842808)
When you're differencing two very similar quantities, you lose accuracy. Similar problems can be mitigated somewhat during integration if the coders are careful with their summations.
Anyway, we can check this against a high-resolution calculation using mpmath:
>>> via_cdf = scipy.stats.norm.cdf(6)-scipy.stats.norm.cdf(5)
>>> via_quad = scipy.integrate.quad(scipy.stats.norm.pdf, 5, 6)[0]
>>> import mpmath
>>> mpmath.mp.dps = 100
>>> def cdf(x): return 0.5 * (1 + mpmath.erf(x/mpmath.sqrt(2)))
>>> highres = cdf(6)-cdf(5)
>>> highres
mpf('0.0000002856649842341562135330514687422473118357532223619105443630157837185833042478210791954518847897468442097')
>>> float((highres - via_quad)/highres)
-2.3824773334590333e-16
>>> float((highres - via_cdf)/highres)
3.86659439572868e-11

The first calls an implementation of the cdf included in scipy.special. The latter actually does the integration. The former is probably more accurate (as it is limited only by the computer's ability do evaluate the CDF and not by any errors introduced by numerical integration). In practice, unless you need results that are good to better than 6 decimal places, you're probably fine.

Scipy LinearOperator With Multiple Inputs

I need to invert a large, dense matrix which I hoped to use Scipy's gmres to do. Fortunately, the dense matrix A follows a pattern and I do not need to store the matrix in memory. The LinearOperator class allows us to construct an object which acts as the matrix for GMRES and can compute directly the matrix vector product A*v. That is, we write a function mv(v) which takes as input a vector v and returns mv(v) = A*v. Then, we can use the LinearOperator class to create A_LinOp = LinearOperator(shape = shape, matvec = mv). We can put the linear operator into the Scipy gmres command to evaluate the matrix vector products without ever having to fully load A into memory.
The documentation for the LinearOperator is found here: LinearOperator Documentation.
Here is my problem: to write the routine to compute the matrix vector product mv(v) = A*v, I need another input vector C. The entries in A are of the form A[i,j] = f(C[i] - C[j]). So, what I really want is for mv to be of two inputs, one fixed vector input C, and one variable input v which we want to compute A*v.
MATLAB has a similar setup, where would write x = gmres(#(v) mv(v,C),b) where b is the right hand side of the problem Ax = b, , and mv is the function that takes as variable input v which we want to compute A*v and C is the fixed, known vector which we need for the assembly of A.
My problem is that I can't figure out how to allow the LinearOperator class to accept two inputs, one variable and one "fixed" like I can in MATLAB.
Is there a way to do the analogous operation in SciPy? Alternatively, if anyone knows of a better way of inverting a large, dense matrix (50000, 50000) where the entries follow a pattern, I would greatly appreciate any suggestions.
Thanks!
EDIT: I should have stated this information actually. The matrix is actually (in block form) [A C; C^T 0], where A is N x N (N large) and C is N x 3, and the 0 is 3 x 3 and C^T is the transpose of C. This array C is the same array as the one mentioned above. The entries of A follow a pattern A[i,j] = f(C[i] - C[j]).
I wrote mv(v,C) to go row by row construct A*v[i] for i=0,N, by computing sum f(C[i]-C[j)*v[j] (actually, I do numpy.dot(FC,v) where FC[j] = f(C[i]-C[j]) which works well). Then, at the end doing the computations for the C^T rows. I was hoping to eventually replace the large for loop with a multiprocessing call to parallelize the for loop, but that's a future thing to consider. I will also look into using Cython to speed up the computations.

This is very late, but if you're still interested...
Your A matrix must be very low rank since it's a nonlinearly transformed version of a rank-2 matrix. Plus it's symmetric. That means it's trivial to inverse: get the truncated eigenvalue decompostion with, say, 5 eigenvalues: A = U*S*U', then invert that: A^-1 = U*S^-1*U'. S is diagonal so this is inexpensive. You can get the truncated eigenvalue decomposition with eigh.
That takes care of A. Then for the rest: use the block matrix inversion formula. Looks nasty, but I will bet you 100,000,000 prussian francs that it's 50x faster than the direct method you were using.

I faced the same situation (some years later than you) of trying to use more than one argument to LinearOperator, but for another problem. The solution I found was the use of global variables, to avoid passing the variables as arguments to the function.

Using scipy to perform discrete integration of the sample

I am trying to port from labview to python.
In labview there is a function "Integral x(t) VI" that takes a set of samples as input, performs a discrete integration of the samples and returns a list of values (the areas under the curve) according to Simpsons rule.
I tried to find an equivalent function in scipy, e.g. scipy.integrate.simps, but those functions return the summed integral across the set of samples, as a float.
How do I get the list of integrated values as opposed to the sum of the integrated values?
Am I just looking at the problem the wrong way around?

I think you may be using scipy.integrate.simps slightly incorrectly. The area returned by scipy.integrate.simps is the total area under y (the first parameter passed). The second parameter is optional, and are sample values for the x-axis (the actual x values for each of the y values). ie:
>>> import numpy as np
>>> import scipy
>>> a=np.array([1,1,1,1,1])
>>> scipy.integrate.simps(a)
4.0
>>> scipy.integrate.simps(a,np.array([0,10,20,30,40]))
40.0
I think you want to return the areas under the same curve between different limits? To do that you pass the part of the curve you want, like this:
>>> a=np.array([0,1,1,1,1,10,10,10,10,0])
>>> scipy.integrate.simps(a)
44.916666666666671
>>> scipy.integrate.simps(a[:5])
3.6666666666666665
>>> scipy.integrate.simps(a[5:])
36.666666666666664

There is only one method in SciPy that does cumulative integration which is scipy.integrate.cumtrapz() which does what you want as long as you don't specifically need to use the Simpson rule or another method. For that, you can as suggested always write the loop on your own.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum with Pythons uncertainties giving a different result than expected - python

Related

python scipy.sparse.linalg.eigs giving different results for consecutive calls

Python - Cosine gradually reveals small-amp oscillations ("wobbles")

Accuracy of deriving the CDF using integration

Scipy LinearOperator With Multiple Inputs

Using scipy to perform discrete integration of the sample

Categories

Resources