Using scipy to perform discrete integration of the sample - python

I am trying to port from labview to python.
In labview there is a function "Integral x(t) VI" that takes a set of samples as input, performs a discrete integration of the samples and returns a list of values (the areas under the curve) according to Simpsons rule.
I tried to find an equivalent function in scipy, e.g. scipy.integrate.simps, but those functions return the summed integral across the set of samples, as a float.
How do I get the list of integrated values as opposed to the sum of the integrated values?
Am I just looking at the problem the wrong way around?

I think you may be using scipy.integrate.simps slightly incorrectly. The area returned by scipy.integrate.simps is the total area under y (the first parameter passed). The second parameter is optional, and are sample values for the x-axis (the actual x values for each of the y values). ie:
>>> import numpy as np
>>> import scipy
>>> a=np.array([1,1,1,1,1])
>>> scipy.integrate.simps(a)
4.0
>>> scipy.integrate.simps(a,np.array([0,10,20,30,40]))
40.0
I think you want to return the areas under the same curve between different limits? To do that you pass the part of the curve you want, like this:
>>> a=np.array([0,1,1,1,1,10,10,10,10,0])
>>> scipy.integrate.simps(a)
44.916666666666671
>>> scipy.integrate.simps(a[:5])
3.6666666666666665
>>> scipy.integrate.simps(a[5:])
36.666666666666664

There is only one method in SciPy that does cumulative integration which is scipy.integrate.cumtrapz() which does what you want as long as you don't specifically need to use the Simpson rule or another method. For that, you can as suggested always write the loop on your own.

Related

How to get cumulative distribution function from discrete numbers in python

I'm new to this topic, this question may be dumb. I did some experiments, results and their occurrence are list below. I need to convert these discrete numbers into probability distribution and cumulative distribution (x axis is results and y-axis is probability).
import pandas as pd
data = {'Result': [1, 2, 4, 6],
'Occurrence': [2,3,4,1],
'Probability':[0.2,0.3,0.4,0.1]}
df= pd.DataFrame(data)
Then find the x corresponding to different probability level in cumulative distribution. Say 50%, 60%, 80% etc.
I did some research, but cannot find the right python package or function to achieve this. Package or function name should be good and an examples will be great. Thanks.
Working with conditional distributions or probability distributions
and cumulative distributions, you are going to have to use three
different programming styles... lambda, functional and object
orient to represent conditional , probability and
cumulative statistics.
Much like statistical maths, computational maths vary significantly in the scope of knowledge.
for cumulative distributions it would seem necessary that you use lambda functions to represent the data.
Lambda functions allow you to create expressions for relationships.
Functions are just the sum of the steps that make two objects related, or perhaps the sum of conditions that make up a codomain....
You will need to use lambda functions to create anonymous functions to represent possible relationships while you deal with the x and y, let Lambda represent the function definition in our case.
y = codomain
x = domain
f(x) = lamda functions, we dont need to set the process just yet.
import lambda
codomain = ''
anon_steps = 'unknown'
def myFunction(domain=x):
if anon_steps is domain:
codomain = 1
else:
pass
return codomain
function_object = myFunction(domain)
# this is functional relationship
variable = lambda parameters_list : expression
# this is lambda expressed functions
in the above example lambda is the definition of the function and it expresses the code in the function.

Sum with Pythons uncertainties giving a different result than expected

A friend of mine is evaluating data with Pythons package uncertainties. I am her statistics consulter, and I have come up with a weird result in her code.
sum(array) and sqrt(sum(unumpy.std_devs(array)**2)) yield different results, with the second one being the variance method as usually used in engineering.
Now, I know that the variance approach is only suited for when the error is small compared to the partial derivate (because of the Taylor series) which isn't given in this case, but how does uncertainties handle this? And how can I reproduce in any way what uncertainties does!?
You forgot to square the standard error to make it the variance. This should work and be equal to the error of sum(array):
sqrt(sum(unumpy.std_devs(array)**2))
Then
from uncertainties import unumpy
import random
import math
a = [uc.ufloat(random.random(), random.random()) for _ in range(100)]
sa = unumpy.std_devs(sum(a))
sb = math.sqrt(sum(unumpy.std_devs(a)**2))
print(sa)
print(sb)
print(sa == sb)
Will result with something like
5.793714811166615
5.793714811166615
True
This results due to my array being an AffineScalarFunc (as opposed to a Variable), and thus they not only store the value but also all the variables that the value depends on [1].
Now, my values are not fully independent (which wasn't clear at all at first sight*), and thus sum(array) also considers the off-diagonal elements of my covariance matrix in accordance to this formula (sorry that the article is in German, but English Wikipedias formula isn't as intuitive), whereas sqrt(sum(unumpy.std_devs(array)**2)) obviously doesn't and just adds up the diagonal elements.
A way to reproduce what uncertainties does is:
from uncertainties import covariance_matrix
sum=0
for i in range(0,len(array)):
for j in range(0,len(array)):
sum+=covariancematrix(array)[i][j]
print(sqrt(sum))
And then unumpy.std_devs(sum(array))==sqrt(sum) is True.
*Correlation due to the use of data taken from the same interpolation (of measurements) and because the length of a measurement was calculated as the difference of two times (and meassurement were consecutive, so the times are now correlated!)

python scipy.sparse.linalg.eigs giving different results for consecutive calls

I am trying to compute the spectral radius of a sparse matrix in python. This is what I have:
from scipy.sparse.linalg import eigs
from scipy import sparse
w = sparse.rand(10, 10, 0.1)
spec_radius = max(abs(eigs(w)[0]))
where the values of w get scaled to be in the range of [-1,1]. However, running that command gives a different result every time:
>>> print max(abs(eigs(w)[0]))
4.51859016293e-05
>>> print max(abs(eigs(w)[0]))
4.02309443625e-06
>>> print max(abs(eigs(w)[0]))
3.7611221426e-05
What gives? I would have thought it would be the same every time. Am I misunderstanding how these commands work?
Sorry for responding to an old question here, but the other answer is not quite satisfactory.
The randomness is not part of the algorithm bundled with ARPACK, but rather the initialization of the algorithm. From the scipy documentation the initialization, v0, is random, unless specified by the user. Sure enough, we see this (note the setup is slightly different--entries of w are scaled to be in [0,1]):
import numpy
from scipy.sparse.linalg import eigs
from scipy import sparse
w = sparse.rand(10, 10, 0.1)
w = w/w.max()
if we do not specify v0, we get some (slight) randomness:
>>> print max(abs(eigs(w)[0]))
0.00024188777676476916
>>> print max(abs(eigs(w)[0]))
0.00028073646868200566
>>> print max(abs(eigs(w)[0]))
0.00025250058038424729
>>> print max(abs(eigs(w)[0]))
0.00018183677959035711
but, if we specify, the initialization, we always get the same answer:
>>> print numpy.all([max(abs(eigs(w, v0 = numpy.ones(10))[0])) == 0.00026363015600771211 for k in range(1000)])
True
You apply the methods correctly and they will give you the same results if the absolute value of the largest eigenvalue is significantly larger than 0. The reason for the outcome you observe is based on the iterative nature of the algorithm that is used to determine the eigenvalues. From the documentation:
"This function is a wrapper to the ARPACK [R209] SNEUPD, DNEUPD, CNEUPD, ZNEUPD, functions which use the Implicitly Restarted Arnoldi Method to find the eigenvalues and eigenvectors [R210]." In case you are interested in details of this algorithm, you can find an explanation e.g. here.
As in all numerical methods you can determine the desired value only with a certain precision. For eigenvalues that are significantly unequal to 0, you will always obtain the same output; for values that are close to 0 you might obtain different values as you observed in the above example.
You can try to vary the parameters "maxiter" and "tol" (check the above cited documentation for details) which you can pass to "eigs". Maxiter is the maximum number of Arnoldi update iterations allowed - by increasing the number you should get more accurate results. "Tol" is the relative accuracy for eigenvalues and stopping criterion of the algorithm.

Accuracy of deriving the CDF using integration

I have two ways of deriving the probability of a normally (say) distributed random variable to be within an interval. The first and most straight-forward is the following:
import scipy.stats
print scipy.stats.norm.cdf(6) - scipy.stats.norm.cdf(5)
# 2.85664984223e-07
And the second is by integrating the pdf:
import scipy.integrate
print scipy.integrate.quad(scipy.stats.norm.pdf, 5, 6)[0]
# 2.85664984234e-07
The difference in this case is really tiny, but it doesn't mean it can't grow larger for other distributions or integration limits. Can you tell which is more accurate and why?
By the way, the first alternative seems to be at least 10 times faster, so if it is also more accurate (which would be my guess, since it is somewhat specialized), then it is perfect.
In this particular case, given those particular numbers, the quad approach will actually be more accurate. The CDF itself can be computed quickly and accurately, of course, but look at the actual numbers:
>>> scipy.stats.norm.cdf(6), scipy.stats.norm.cdf(5)
(0.9999999990134123, 0.99999971334842808)
When you're differencing two very similar quantities, you lose accuracy. Similar problems can be mitigated somewhat during integration if the coders are careful with their summations.
Anyway, we can check this against a high-resolution calculation using mpmath:
>>> via_cdf = scipy.stats.norm.cdf(6)-scipy.stats.norm.cdf(5)
>>> via_quad = scipy.integrate.quad(scipy.stats.norm.pdf, 5, 6)[0]
>>> import mpmath
>>> mpmath.mp.dps = 100
>>> def cdf(x): return 0.5 * (1 + mpmath.erf(x/mpmath.sqrt(2)))
>>> highres = cdf(6)-cdf(5)
>>> highres
mpf('0.0000002856649842341562135330514687422473118357532223619105443630157837185833042478210791954518847897468442097')
>>> float((highres - via_quad)/highres)
-2.3824773334590333e-16
>>> float((highres - via_cdf)/highres)
3.86659439572868e-11
The first calls an implementation of the cdf included in scipy.special. The latter actually does the integration. The former is probably more accurate (as it is limited only by the computer's ability do evaluate the CDF and not by any errors introduced by numerical integration). In practice, unless you need results that are good to better than 6 decimal places, you're probably fine.

Scipy LinearOperator With Multiple Inputs

I need to invert a large, dense matrix which I hoped to use Scipy's gmres to do. Fortunately, the dense matrix A follows a pattern and I do not need to store the matrix in memory. The LinearOperator class allows us to construct an object which acts as the matrix for GMRES and can compute directly the matrix vector product A*v. That is, we write a function mv(v) which takes as input a vector v and returns mv(v) = A*v. Then, we can use the LinearOperator class to create A_LinOp = LinearOperator(shape = shape, matvec = mv). We can put the linear operator into the Scipy gmres command to evaluate the matrix vector products without ever having to fully load A into memory.
The documentation for the LinearOperator is found here: LinearOperator Documentation.
Here is my problem: to write the routine to compute the matrix vector product mv(v) = A*v, I need another input vector C. The entries in A are of the form A[i,j] = f(C[i] - C[j]). So, what I really want is for mv to be of two inputs, one fixed vector input C, and one variable input v which we want to compute A*v.
MATLAB has a similar setup, where would write x = gmres(#(v) mv(v,C),b) where b is the right hand side of the problem Ax = b, , and mv is the function that takes as variable input v which we want to compute A*v and C is the fixed, known vector which we need for the assembly of A.
My problem is that I can't figure out how to allow the LinearOperator class to accept two inputs, one variable and one "fixed" like I can in MATLAB.
Is there a way to do the analogous operation in SciPy? Alternatively, if anyone knows of a better way of inverting a large, dense matrix (50000, 50000) where the entries follow a pattern, I would greatly appreciate any suggestions.
Thanks!
EDIT: I should have stated this information actually. The matrix is actually (in block form) [A C; C^T 0], where A is N x N (N large) and C is N x 3, and the 0 is 3 x 3 and C^T is the transpose of C. This array C is the same array as the one mentioned above. The entries of A follow a pattern A[i,j] = f(C[i] - C[j]).
I wrote mv(v,C) to go row by row construct A*v[i] for i=0,N, by computing sum f(C[i]-C[j)*v[j] (actually, I do numpy.dot(FC,v) where FC[j] = f(C[i]-C[j]) which works well). Then, at the end doing the computations for the C^T rows. I was hoping to eventually replace the large for loop with a multiprocessing call to parallelize the for loop, but that's a future thing to consider. I will also look into using Cython to speed up the computations.
This is very late, but if you're still interested...
Your A matrix must be very low rank since it's a nonlinearly transformed version of a rank-2 matrix. Plus it's symmetric. That means it's trivial to inverse: get the truncated eigenvalue decompostion with, say, 5 eigenvalues: A = U*S*U', then invert that: A^-1 = U*S^-1*U'. S is diagonal so this is inexpensive. You can get the truncated eigenvalue decomposition with eigh.
That takes care of A. Then for the rest: use the block matrix inversion formula. Looks nasty, but I will bet you 100,000,000 prussian francs that it's 50x faster than the direct method you were using.
I faced the same situation (some years later than you) of trying to use more than one argument to LinearOperator, but for another problem. The solution I found was the use of global variables, to avoid passing the variables as arguments to the function.

Categories