stats.rv_continuous slow when when using custom pdf - python

Ultimately I am trying to visualise the copula between two PDFs which are estimated from data (both via a KDE). Suppose, for one of the KDEs, I have discrete x,y data sorted in a tuple called data. I need to generate random variables with this distribution in order to perform the probability integral transform (and ultimately to obtain the uniform distribution). My methodology to generate random variables is as follows:
import scipy.stats as st
from scipy import interpolate, integrate
pdf1 = interpolate.interp1d(data[0], data[1])
class pdf1_class(st.rv_continuous):
def _pdf(self,x):
return pdf1(x)
pdf1_rv = pdf1_class(a = data[0][0], b= data[0][-1], name = 'pdf1_class')
pdf1_samples = pdf1_rv.rvs(size=10000)
However, this method is extremely slow. I also get the following warnings:
IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
warnings.warn(msg, IntegrationWarning)
IntegrationWarning: The occurrence of roundoff error is detected, which prevents
the requested tolerance from being achieved. The error may be
warnings.warn(msg, IntegrationWarning)
Is there a better way to generate the random variables?

As per suggestion by #unutbu I implemented _cdf and _ppf, which makes the calculation of 10000 samples instantaneous. To do this I added the following to the above code:
discrete_cdf1 = integrate.cumtrapz(y=data[1], x = data[0])
cdf1 = interpolate.interp1d(data[0][1:], discrete_cdf1)
ppf1 = interpolate.interp1d(discerete_cdf1, data[0][:-1])
I then add the following two methods to pdf1_class
def _cdf(self,x):
return cdf1(x)
def _ppf(self,x):
return ppf1(x)


What is the difference between sample() and rsample()?

When I sample from a distribution in PyTorch, both sample and rsample appear to give similar results:
import torch, seaborn as sns
x = torch.distributions.Normal(torch.tensor([0.0]), torch.tensor([1.0]))
When should I use sample(), and when should I use rsample()?
Using rsample allows for pathwise derivatives:
The other way to implement these stochastic/policy gradients would be to use the reparameterization trick from the rsample() method, where the parameterized random variable can be constructed via a parameterized deterministic function of a parameter-free random variable. The reparameterized sample therefore becomes differentiable.
sample(): random sampling from the probability distribution. So, we cannot backpropagate, because it is random! (the computation graph is cut off).
See the source code of sample in torch.distributions.normal.Normal:
def sample(self, sample_shape=torch.Size()):
shape = self._extended_shape(sample_shape)
with torch.no_grad():
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
torch.normal returns a tensor of random numbers. Also, torch.no_grad() context prevents the computation graph from growing any further.
You see, we cannot backprop. The returned tensor of sample() contains just some numbers, not the whole computational graph.
So, what is rsample()?
By using rsample, we can backpropagate, because it keeps the computation graph alive.
How? By putting the randomness aside in a separate parameter. This is called the "reparameterization trick".
rsample: sampling using reparameterization trick.
There is eps in the source code:
def rsample(self, sample_shape=torch.Size()):
shape = self._extended_shape(sample_shape)
eps = _standard_normal(shape, dtype=self.loc.dtype, device=self.loc.device)
return self.loc + eps * self.scale
# `self.loc` is the mean and `self.scale` is the standard deviation.
eps is the separate parameter responsible for the randomness of the sampling.
Look at the return: mean + eps * standard deviation
eps does not depend on the parameters you want to differentiate with respect to.
So, now you can freely backpropagate(=differentiate) because eps does not change when the parameters change.
(If we change the parameters, the distribution of the reparameterized samples does change because self.loc and self.scale change, but the distribution of the eps does not change.)
Note that the randomness of the sampling comes from the random sampling of the eps. There is no randomness in the computation graph itself. Once eps is chosen, it is fixed. (the distribution of the elements of the eps is fixed, after they are sampled.)
For example, in an implementation of the SAC(Soft Actor-Critic) algorithm in reinforcement learning, eps may consist of elements corresponding to a single minibatch of actions (and one action may consist of many elements).

Internal working of scipy.integrate.ode

I'm using scipy.integrate.ode and would like to know, what happens internally when I get the message UserWarning: zvode: Excess work done on this call. (Perhaps wrong MF.) 'Unexpected istate=%s' % istate))
This appears when I call ode.integrate(t1) for too big t1, so I'm forced to use a for-loop and incrementally integrate my equation, what lowers the speed since the solver can not use adaptive step size very effectively. I already tried different methods and setting for the integrator. The maximal number of steps nsteps=100000 is very big already but with this setting I still can't integrate up to 1000 in one call, which I would like to do.
The code I use is:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import ode
h_bar=0.658212 #reduced Planck's constant (meV*ps)
m0=0.00568563 #free electron mass (meV*ps**2/nm**2)
m_e=0.067*m0 #effective electron mass (meV*ps**2/nm**2)
m_h=0.45*m0 #effective hole mass (meV*ps**2/nm**2)
m_reduced=1/((1/m_e)+(1/m_h)) #reduced mass of electron and holes combined
kB=0.08617 #Boltzmann's constant (meV/K)
mu_e=-50 #initial chemical potential for electrons
mu_h=-100 #initial chemical potential for holes
k_array=np.arange(0,1.5,0.02) #a list of different k-values
n_k=len(k_array) #number of k-values
def derivative(t,y_list,Gamma,g,kappa,k_list,n_k):
#initialize output vector
return y_out
def dynamics(t_list,N_ini=1e-3, T=300, Gamma=1.36,kappa=0.02,g=0.095):
#initial values
t0=0 #initial time
y_initial[0:n_k]=1/(1+np.exp(((h_bar*k_array)**2/(2*m_e)-mu_e)/(kB*T))) #Fermi-Dirac distributions
t_list=t_list[1:] #remove t=0 from list (not feasable for integrator)
r=ode(derivative).set_integrator('zvode',method='adams', atol=10**-6, rtol=10**-6,nsteps=100000) #define ode solver
#create array for output (the +1 accounts values at t0=0)
#insert initial data in output array
#perform integration for time steps given by t_list (the +1 account for the initial values already in the array)
for i in range(len(t_list)):
print(r't = %s' % t_list[i])
if not (r.successful()):
print('Integration not successful!!')
return y_output
data=dynamics(t_list,N_ini=1e-3, T=300, Gamma=1.36,kappa=0.02,g=1.095)
The message means that the method reached the number of steps specified by nsteps parameter. Since you asked about internals, I looked into the Fortran source, which offers this explanation:
-1 means an excessive amount of work (more than MXSTEP steps) was done on this call, before completing the requested task, but the integration was otherwise successful as far as T. (MXSTEP is an optional input and is normally 500.)
The conditional statement that brings up the error is this "GO TO 500".
According to LutzL, for your ODE the solver chooses step size 2e-4, which means 5000000 steps to integrate up to 1000. Your options are:
try such a large value of nsteps (which translates to MXSTEP in aforementioned Fortran routine)
reduce error tolerance
run a for loop, as you already do.

How can I do a least squares fit in python, using data that is only an upper limit?

I am trying to perform a least squares fit in python to a known function with three variables. I am able to complete this task for randomly generated data with errors, but the actual data that I need to fit includes some data points that are upper limits on the values. The function describes the flux as a function of wavelength, but in some cases the flux measured at the given wavelength is not an absolute value with an error but rather a maximum value for the flux, with the real value being anything below that down to zero.
Is there some way of telling the fitting task that some data points are upper limits? Additionally, I have to do this for a number of data sets, and the number of data points which could be upper limits is different for each one, so being able to do this automatically would be beneficial but not a necessity.
I apologise if any of this is unclear, I will endeavour to explain it more clearly if it is needed.
The code I am using to fit my data is included below.
import numpy as np
from scipy.optimize import leastsq
import math as math
import matplotlib.pyplot as plt
def f_all(x,p):
return np.exp(p[0])/((x**(3+p[1]))*((np.exp(14404.5/((x*1000000)*p[2])))-1))
def residual(p,y,x,error):
return err
p,cov,infodict,mesg,ier=leastsq(residual, p0, args = (flux, wavelength, errors), full_output=True)
print p
Scipy.optimize.leastsq is a convenient way to fit data, but the work underneath is the minimization of a function. Scipy.optimize contains many minimization functions, some of then having the capacity of handling constraints. Here I explain with fmin_slsqp which I know, perhaps the others can do also; see Scipy.optimize doc
fmin_slsqp requires a function to minimize and an initial value for the parameter. The function to minimize is the sum of the square of the residuals. For the parameters, I perform first a traditional leastsq fit and use the result as an initial value for the constrained minimization problem. Then there are several ways to impose constraints (see doc); the simpler is the f_ieqcons parameters: it requires a function which returns an array whose values must always be positive (that's the constraints). Here the function returns positive values if, for all maximal values points, the fit function is below the point.
import numpy
import scipy.optimize as scimin
import matplotlib.pyplot as mpl
datax=numpy.array([1,2,3,4,5]) # data coordinates
constraintmaxx=numpy.array([0]) # list of maximum constraints
# least square fit without constraints
def fitfunc(x,p): # model $f(x)=a x^2+c
return c+a*x**2
def residuals(p): # array of residuals
return datay-fitfunc(datax,p)
p0=[1,2] # initial parameters guess
pwithout,cov,infodict,mesg,ier=scimin.leastsq(residuals, p0,full_output=True) #traditionnal least squares fit
# least square fir with constraints
def sum_residuals(p): # the function we want to minimize
return sum(residuals(p)**2)
def constraints(p): # the constraints: all the values of the returned array will be >=0 at the end
return constraintmaxy-fitfunc(constraintmaxx,p)
pwith=scimin.fmin_slsqp(sum_residuals,pwithout,f_ieqcons=constraints) # minimization with constraint
# plotting
ax.plot(constraintmaxx,constraintmaxy,ls="",marker="x",color="red",mew=2.0,label="Max points")
ax.plot(morex,fitfunc(morex,pwithout),color="blue",label="Fit without constraints")
ax.plot(morex,fitfunc(morex,pwith),color="red",label="Fit with constraints")
In this example I fit an imaginary sample of points on a parabola. Here is the result, without and with constraint (the red cross on left):
I hope this will do for your data sample; otherwise, please post one of your data files so that we can try with real data. I know my example does not takes care of error bars on data, but you can easily handle them by modifying the residuals function.

autocorrelation function of time series data with numpy

I have been trying to calculate an autocorrelation function, as defined in statistical mechanics, using numpy. Most of the documentation I found is relative to functions like correlate and convolve. However, for a given random variable x these functions just seem to calculate the sum
ACF(dt) = sum_{t=0}^T [(x(t)*x(t+dt)]
instead of the average
ACF(dt) = mean[x(t)*x(t+dt)]
so in fact for calculating an autocorrelation function one would need to do something like:
acf = np.correlate(x,x,mode='full')
acf_half = acf[acf.size / 2:]
ldata = len(acf)
acf = np.array([x/(ldata-i) for i,x in enumerate(acf_half)])
Of course we would need to subtract mean(x)**2 from the resulting acf to be correct.
Can anyone confirm that this is correct?
Generally speaking, the autocorrelation, correlation, etc. is the sum (integral). Sometimes it is normalized, but not averaged in the sense as you've written above. This is because they are defined in terms of the mathematical convolution operation, which is simply the integral that you've written as a sum above.
The brackets at the stat mech page indicate a thermal average, which is an ensemble or time average over the 'experiment' taking place many times at many different states at some temperature. This (the finite temperature) causes the fluctuations that give rise to the 'statistical' nature of the problem, and cause the decay of the correlation (loss of long range order). This simply means that you should find the autocorrelation of several datasets, and average those together, but do not take the mean of the function.
As far as I can tell, your code is attempting to weigh the correlation at dt by the length of the overlap length dt, but I do not believe that this is correct.
With respect to the subtraction of <s>2, that's in the case of the spin model, where <s> would be the mean spin (magnetization), so I believe you are correct in that you should use mean(x)**2.
As a side-note, I would suggest using mode='same' instead of 'full' so that the domain of your correlation matches the domain of your input without having to look at just one-half of the output (here the output is symmetric, so it doesn't really make a difference).

Simple Automatic Classification of the (R-->R) Functions

Given data values of some real-time physical process (e.g. network traffic) it is to find a name of the function which "at best" matches with the data.
I have a set of functions of type y=f(t) where y and t are real:
funcs = set([cos, tan, exp, log])
and a list of data values:
vals = [59874.141, 192754.791, 342413.392, 1102604.284, 3299017.372]
What is the simplest possible method to find a function from given set which will generate the very similar values?
PS: t is increasing starting from some positive value by almost-equal intervals
Just write the error ( quadratic sum of error at each point for instance ) for each function of the set and choose the function giving the minimum.
But you should still fit each function before choosing
Scipy has functions for fitting data, but they use polynomes or splines. You can use one of Gauß' many discoveries, the method of least squares to fit other functions.
I would try an approach based on fitting too. For each of the four test functions (f1-4 see below), the values of a and b that minimizes the squared error.
f1(t) = a*cos(b*t)
f2(t) = a*tan(b*t)
f3(t) = a*exp(b*t)
f4(t) = a*log(b*t)
After fitting the squared error of the four functions can be used for evaluating the fit goodness (low values means a good fit).
If fitting is not allowed at all, the four functions can be divided into two distinct subgroups, repeating functions (cos and tan) and strict increasing functions (exp and log).
Strict increasing functions can be identified by checking if all the given values are increasing throughout the measuring interval.
In pseudo code an algorithm could be structured like
if(vals are strictly increasing):
% Exp or log
if(increasing more rapidly at the end of the interval):
% exp detected
% log detected
% tan or cos
if(large changes in vals over a short period is detected):
% tan detected
% cos detected
Be aware that this method is not that stable and will be easy to trick into faulty conclusions.
See Curve Fitting
