Python - unexpected shape parameter behavior in scipy genextreme fit

Python - unexpected shape parameter behavior in scipy genextreme fit - python

I've been trying to fit the GEV distribution to some annual maximum river discharge using Scipy's stats.genextreme function, but I've found some weird behavior of the fit. Depending on how small your data is (i.e., 1e-5 vs. 1e-1), the shape parameter that is returned can be dramatically different. For example:
import scipy as scipy
import numpy as np
from scipy.stats import genextreme as gev
from scipy.stats import gumbel_r as gumbel
#Set up arrays of values to fit curve to
sample=np.random.rand(1,30) #Random set of decimal values
smallVals = sample*1e-5 #Scale to smaller values
#If the above is not creating different values, this instance of random numbers has:
bugArr = np.array([[0.25322987, 0.81952358, 0.94497455, 0.36295543, 0.72272746, 0.49482558,0.65674877, 0.40876558, 0.64952248, 0.23171052, 0.24645658, 0.35359126,0.27578928, 0.24820775, 0.69789187, 0.98876361, 0.22104156,0.40019593,0.0756707, 0.12342556, 0.3601186, 0.54137089,0.43477705, 0.44622486,0.75483338, 0.69766687, 0.1508741, 0.75428996, 0.93706003, 0.1191987]])
bugArr_small = bugArr*1e-5
#This array of random numbers gives the same shape parameter regardless
fineArr = np.array([[0.7449611, 0.82376693, 0.32601009, 0.18544293, 0.56779629, 0.30495415,
0.04670362, 0.88106521, 0.34013959, 0.84598841, 0.24454428, 0.57981437,
0.57129427, 0.8857514, 0.96254429, 0.64174078, 0.33048637, 0.17124045,
0.11512589, 0.31884749, 0.48975204, 0.87988863, 0.86898236, 0.83513966,
0.05858769, 0.25889509, 0.13591874, 0.89106616, 0.66471263, 0.69786708]])
fineArr_small = fineArr*1e-5
#GEV fit for both arrays - shouldn't dramatically change distribution
gev_fit = gev.fit(sample)
gevSmall_fit = gev.fit(smallVals)
gevBug = gev.fit(bugArr)
gevSmallBug = gev.fit(bugArr_small)
gevFine = gev.fit(fineArr)
gevSmallFine = gev.fit(fineArr_small)
I get the following output for the GEV parameters estimated for the bugArr/bugArr_small and fineArr/fineArr_small:
Known bug array
Random values: (0.12118250540401079, 0.36692231766996053, 0.23142400358716353)
Random values scaled: (-0.8446554391074808, 3.0751769299431084e-06, 2.620390405092363e-06)
Known fine array
Random values: (0.6745399522587823, 0.47616297212022757, 0.34117425062278584)
Random values scaled: (0.6745399522587823, 4.761629721202293e-06, 3.411742506227867e-06)
Why would the shape parameter change so dramatically when the only difference in the data is a change in scaling? I would've expected the behavior to be consistent with the FineArr results (no change in shape parameter, and appropriate scaling of location and scale parameters). I've repeated the test in Matlab, but the results there are in line with what I expected (i.e., no change in shape parameter).

I think I know why this might be happening. It is possible to pass initial shape parameter estimates when fitting, see the documentation for scipy.stats.rv_continuous.fit where it states "Starting value(s) for any shape-characterizing arguments (those not provided will be determined by a call to _fitstart(data)). No default value." Here is some extremely ugly, functional, code using my pyeq3 statistical distribution fitter which internally attempts to use different estimates, fit them, and return the parameters for best nnlf of the different fits. This example code does not show the behavior you observe, and gives the same shape parameters regardless of scaling. You would need to install pyeq3 with "pip3 install pyeq3" to run this code. The pyeq3 code is designed for text input from a web interface on zunzun.com, so hold you nose - here is the example code:
import numpy as np
#Set up arrays of values to fit curve to
sample=np.random.rand(1,30) #Random set of decimal values
smallVals = sample*1e-5 #Scale to smaller values
#If the above is not creating different values, this instance of random numbers has:
bugArr = np.array([0.25322987, 0.81952358, 0.94497455, 0.36295543, 0.72272746, 0.49482558,0.65674877, 0.40876558, 0.64952248, 0.23171052, 0.24645658, 0.35359126,0.27578928, 0.24820775, 0.69789187, 0.98876361, 0.22104156,0.40019593,0.0756707, 0.12342556, 0.3601186, 0.54137089,0.43477705, 0.44622486,0.75483338, 0.69766687, 0.1508741, 0.75428996, 0.93706003, 0.1191987])
bugArr_small = bugArr*1e-5
#This array of random numbers gives the same shape parameter regardless
fineArr = np.array([0.7449611, 0.82376693, 0.32601009, 0.18544293, 0.56779629, 0.30495415,
0.04670362, 0.88106521, 0.34013959, 0.84598841, 0.24454428, 0.57981437,
0.57129427, 0.8857514, 0.96254429, 0.64174078, 0.33048637, 0.17124045,
0.11512589, 0.31884749, 0.48975204, 0.87988863, 0.86898236, 0.83513966,
0.05858769, 0.25889509, 0.13591874, 0.89106616, 0.66471263, 0.69786708])
fineArr_small = fineArr*1e-5
bugArr_str = ''
for i in range(len(bugArr)):
bugArr_str += str(bugArr[i]) + '\n'
bugArr_small_str = ''
for i in range(len(bugArr_small)):
bugArr_small_str += str(bugArr_small[i]) + '\n'
fineArr_str = ''
for i in range(len(fineArr)):
fineArr_str += str(fineArr[i]) + '\n'
fineArr_small_str = ''
for i in range(len(fineArr_small)):
fineArr_small_str += str(fineArr_small[i]) + '\n'
import pyeq3
simpleObject_bugArr = pyeq3.IModel.IModel()
simpleObject_bugArr._dimensionality = 1
pyeq3.dataConvertorService().ConvertAndSortColumnarASCII(bugArr_str, simpleObject_bugArr, False)
solver = pyeq3.solverService()
result_bugArr = solver.SolveStatisticalDistribution('genextreme', simpleObject_bugArr.dataCache.allDataCacheDictionary['IndependentData'][0], 'nnlf')
simpleObject_bugArr_small = pyeq3.IModel.IModel()
simpleObject_bugArr_small._dimensionality = 1
pyeq3.dataConvertorService().ConvertAndSortColumnarASCII(bugArr_small_str, simpleObject_bugArr_small, False)
solver = pyeq3.solverService()
result_bugArr_small = solver.SolveStatisticalDistribution('genextreme', simpleObject_bugArr_small.dataCache.allDataCacheDictionary['IndependentData'][0], 'nnlf')
simpleObject_fineArr = pyeq3.IModel.IModel()
simpleObject_fineArr._dimensionality = 1
pyeq3.dataConvertorService().ConvertAndSortColumnarASCII(fineArr_str, simpleObject_fineArr, False)
solver = pyeq3.solverService()
result_fineArr = solver.SolveStatisticalDistribution('genextreme', simpleObject_fineArr.dataCache.allDataCacheDictionary['IndependentData'][0], 'nnlf')
simpleObject_fineArr_small = pyeq3.IModel.IModel()
simpleObject_fineArr_small._dimensionality = 1
pyeq3.dataConvertorService().ConvertAndSortColumnarASCII(fineArr_small_str, simpleObject_fineArr_small, False)
solver = pyeq3.solverService()
result_fineArr_small = solver.SolveStatisticalDistribution('genextreme', simpleObject_fineArr_small.dataCache.allDataCacheDictionary['IndependentData'][0], 'nnlf')
print('ba',result_bugArr[1]['fittedParameters'])
print('ba_s',result_bugArr_small[1]['fittedParameters'])
print()
print('fa',result_fineArr[1]['fittedParameters'])
print('fa_s',result_fineArr_small[1]['fittedParameters'])

Related

Solving an ODE with scipy.integrate.solve_bvp, the return of my function is not able to be broadcast into the appropriate array shape for solve_bvp

I am trying to solve a second order ODE with solve_bvp. I have split the second order ODE into a system of tow first oder ODEs. I have a changing set of constants depending on the x (mesh) value. So I am passing these as an array of shape (N,) into my function numdens. While trying to run solve_bvp I get the error that the returns have different shapes namely (N,) and (N-1,) and thus cannot be broadcast into one array. But when I check each return back manually outside of the function it has the shape (N,).
If I run the solver without my changing constants I get a solution akin to the right one.
import numpy as np
from scipy.integrate import solve_bvp,odeint
import matplotlib.pyplot as plt
E_0 = 1 * 0.0000016021773 #erg: gcm^2/s^2
m_H = 1.6*10**(-24) #g
c = 3e11 #cm
sigma_c = 2*10**(-23)
n_0 = 1*10**(20) #1/cm^3
v_0 = (2*E_0/m_H)**(0.5) #cm/s
T = 10**7
b = 20.3
n_eq = b*T**3
n_s = 2.03*10**(19)
Q = 1
def velocity(v,x):
dvdx = -sigma_c*n_0*v_0*((8*v_0*v-7*v**2-v_0**2)/(2*v*c))
return dvdx
n_num = 100
x_num = np.linspace(-1*10**(6),3*10**(6), n_num)
sol_velo = odeint(velocity,0.999999999999*v_0,x_num)
sol_new = np.reshape(sol_velo,n_num)
def constants(v):
D1 = (c*v/(3*n_0*v_0*sigma_c))
D2 = ((v**2-8*v_0*v+v_0**2)/(6*v))
D3 = sigma_c*n_0*v_0*((8*v_0*v-7*v**2-v_0**2)/(2*v*c))
return D1,D2,D3
def numdens(x,y):
v = sol_new
D1,D2,D3 = constants(v)
return np.vstack((y[1],(-D2*y[1]-D3*y[0]+Q*((1-y[0])/n_eq))/(D1)))
def bc_num(ya, yb):
return np.array([ya[0]-n_s,yb[0]-n_eq])
y_num = np.array([np.linspace(n_s, n_eq, n_num),np.linspace(n_s, n_eq, n_num)])
sol_num = solve_bvp(numdens, bc_num, x_num, y_num)
plt.plot(sol_num.x, sol_num.y[0], label='$n(x)$')
plt.plot(x_num, sol_velo-v_0/7, label='$v(x)$')
plt.yscale('log')
plt.grid(alpha=0.5)
plt.legend(framealpha=1)
plt.show()

You need to take into account that the BVP solver uses an adaptive mesh. That is, after refining the initial guess on the initial grid the solver identifies regions with overly large errors and creates new mesh nodes there. As far as I have seen, the opposite is not implemented, even if it may be in some applications sensible to reduce the number of mesh nodes on especially "nice" segments.
Thus what you are doing the the numdens function is incomprehensible, it has to function exactly like any other function that you would pass to an ODE solver. If I had to propose some fast fix, and without knowing what the underlying problem is that you want to solve, I would change the assignment of v to
v = np.interp(x,x_num,sol_velo)
as that should at least produce an array of the correct format.

Issues with using parameters for a K-S test and understand the result

I'm trying to run a K-S test on some data. Now I have the code working, but I'm not sure I understaned whats going on, and I also get an error when trying to set the loc. Essentially I get both the KS and P-test value. But I'm not sure I fully grasp it, enough to use the result.
I'm using the scipy.stats.ks_2samp module found here.
This is the code I am running
from scipy import stats
np.random.seed(12345678) #fix random seed to get the same result
n1 = len(low_ni_sample) # size of first sample
n2 = len(high_ni_sample) # size of second sample
# Scale is standard deviation
scale = 3
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, scale=scale)
rvs2 = stats.norm.rvs(high_ni_sample[:,0], size=n2, scale=scale)
ksresult = stats.ks_2samp(rvs1, rvs2)
ks_val = ksresult[0]
p_val = ksresult[1]
print('K-S Statistics ' + str(ks_val))
print('P-value ' + str(p_val))
Which gives this:
K-S Statistics 0.04507948306145837
P-value 0.8362207851676332
Now for those examples I've seen, the loc is added in as this:
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, loc=0., scale=scale)
rvs2 = stats.norm.rvs(high_ni_sample[:,0], size=n2, loc=0.5, scale=scale)
If I do that however, I get this error:
Traceback (most recent call last):
File "<ipython-input-342-aa890a947919>", line 13, in <module>
rvs1 = stats.norm.rvs(low_ni_sample[:,0], size=n1, loc=0., scale=scale)
File "/home/kongstad/anaconda3/envs/tensorflow/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py", line 937, in rvs
args, loc, scale, size = self._parse_args_rvs(*args, **kwds)
TypeError: _parse_args_rvs() got multiple values for argument 'loc'
Here is a snapshot, showing the content of the two datasets being used.
low_ni_sample, high_ni_sample.
So my questions are:
Why cant I add a loc value and what does it represent?
Changing the scale changes the result significantly, why and what to go by?
How would I plot this out in such a way it makes sense?
After running Silma's suggestion I stumbled upon a new error.
from scipy import stats
np.random.seed(12345678) #fix random seed to get the same result
n1 = len(low_ni_sample) # size of first sample
n2 = len(high_ni_sample) # size of second sample
# Scale is standard deviation
scale = 3
ndist = stats.norm(loc=0., scale=scale)
rvs1 = ndist.rvs(low_ni_sample[:,0],size=n1)
rvs2 = ndist.rvs(high_ni_sample[:,0],size=n2)
#rvs1 = stats.norm.rvs(low_ni_sample[:,2], size=n1, scale=scale)
#rvs2 = stats.norm.rvs(high_ni_sample[:,2], size=n2, scale=scale)
ksresult = stats.ks_2samp(rvs1, rvs2)
ks_val = ksresult[0]
p_val = ksresult[1]
print('K-S Statistics ' + str(ks_val))
print('P-value ' + str(p_val))
With this error message
rvs1 = ndist.rvs(low_ni_sample[:,0],size=n1)
TypeError: rvs() got multiple values for argument 'size'

The error comes from the fact that you should first create an instance of the normal distribution before using it:
ndist = stats.norm(loc=0., scale=scale)
then do
rvs1 = ndist.rvs(size=n1)
to generate n1 samples drawn from a normal distribution centered on 0 and with a standard deviation scale.
The location is therefore the mean of your distribution.
Changing the scale changes the variance of your distribution (you get more variability), so this obviously impacts the KS test...
As for the plot, I'm not sure I see what you mean... if you want to plot the histograms, then do
import matplotlib.pyplot as plt
plt.hist(rvs1)
plt.show()
Or even better, install seaborn and use their distplot methods, for instance the KDE.
Overall I would advise you to try to read a little more on distributions and KS tests before you go any further, see for instance the wikipedia page.
EDIT
the code shown above is used to generate random samples from a standard distribution (which I assumed was your goal, to compare with your samples).
If what you want to do is directly compare your two sample data, then all you need is
ksresult = stats.ks_2samp(low_ni_sample[:,0], high_ni_sample[:,0])
again, this is assuming that low_ni_sample[:,0]and high_ni_sample[:,0] are 1D-arrays containing many measurements of the quantity of interest, cf. ks_2samp documentation

PYMC3: How to use math.switch for high dimensional random variables

I am currently trying to implement change point detection using this guide: http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Ch1_Introduction_PyMC3.ipynb
It uses a switch statement to decide between the parameters of distributions for before and after the change point.
lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)
I am also trying to find a changepoint, but using data that is assumed to come from a multivariate distribution.
Here is my code:
tau = pm.Uniform("tau_", lower = x_data[0], upper = x_data[-1])
mus_1 = pm.Uniform("mus1", lower = min(y_data[0]), upper = max(y_data[0]), shape = 10)
mus_2 = pm.Uniform("mus2", lower = min(y_data[0]), upper = max(y_data[0]), shape = 10)
mus_ = pm.math.switch(tau > x_data, mus_1, mus_2)
I put shape as 10 for the distribution assumed is a multivariate normal distribution with 10 variables.
I assumed that the switch statement would assign the shape 10 random variable element wise to the x_data (7919 points)
However, I get the following error:
ValueError: Input dimension mis-match. (input[0].shape[0] = 7919, input[1].shape[0] = 10)
It seems like the switch statement only allows you to switch between one-dimensional random variables, how do I work around this?

I don't have access to the rest of your model, but I had run into this same issue and was able to develop a work around by choosing the mus1 and mus2 index during the switch function. So assuming you have some index array idx, the code would look as follows,
tau = pm.Uniform("tau_", lower = x_data[0], upper = x_data[-1])
mus_1 = pm.Uniform("mus1", lower = min(y_data[0]), upper = max(y_data[0]), shape = 10)
mus_2 = pm.Uniform("mus2", lower = min(y_data[0]), upper = max(y_data[0]), shape = 10)
mus_ = pm.math.switch(tau > x_data, mus_1[idx], mus_2[idx])

How to define General deterministic function in PyMC

In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
#Priors
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
fig.show()
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Update
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?

You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.

How does zero-padding work for 2D arrays in scipy.fftpack?

I'm trying to improve the speed of a function that calculates the normalized cross-correlation between a search image and a template image by using the anfft module, which provides Python bindings for the FFTW C library and seems to be ~2-3x quicker than scipy.fftpack for my purposes.
When I take the FFT of my template, I need the result to be padded to the same size as my search image so that I can convolve them. Using scipy.fftpack.fftn I would just use the shape parameter to do padding/truncation, but anfft.fftn is more minimalistic and doesn't do any zero-padding itself.
When I try and do the zero padding myself, I get a very different result to what I get using shape. This example uses just scipy.fftpack, but I have the same problem with anfft:
import numpy as np
from scipy.fftpack import fftn
from scipy.misc import lena
img = lena()
temp = img[240:281,240:281]
def procrustes(a,target,padval=0):
# Forces an array to a target size by either padding it with a constant or
# truncating it
b = np.ones(target,a.dtype)*padval
aind = [slice(None,None)]*a.ndim
bind = [slice(None,None)]*a.ndim
for dd in xrange(a.ndim):
if a.shape[dd] > target[dd]:
diff = (a.shape[dd]-b.shape[dd])/2.
aind[dd] = slice(np.floor(diff),a.shape[dd]-np.ceil(diff))
elif a.shape[dd] < target[dd]:
diff = (b.shape[dd]-a.shape[dd])/2.
bind[dd] = slice(np.floor(diff),b.shape[dd]-np.ceil(diff))
b[bind] = a[aind]
return b
# using scipy.fftpack.fftn's shape parameter
F1 = fftn(temp,shape=img.shape)
# doing my own zero-padding
temp_padded = procrustes(temp,img.shape)
F2 = fftn(temp_padded)
# these results are quite different
np.allclose(F1,F2)
I suspect I'm probably making a very basic mistake, since I'm not overly familiar with the discrete Fourier transform.

Just do the inverse transform and you'll see that scipy does slightly different padding (only to top and right edges):
plt.imshow(ifftn(fftn(procrustes(temp,img.shape))).real)
plt.imshow(ifftn(fftn(temp,shape=img.shape)).real)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - unexpected shape parameter behavior in scipy genextreme fit - python

Related

Solving an ODE with scipy.integrate.solve_bvp, the return of my function is not able to be broadcast into the appropriate array shape for solve_bvp

Issues with using parameters for a K-S test and understand the result

PYMC3: How to use math.switch for high dimensional random variables

How to define General deterministic function in PyMC

How does zero-padding work for 2D arrays in scipy.fftpack?

Categories

Resources