scipy.stats.kde and scipy.stats.kstest - python

How can I use scipy.stats.kde.gaussian_kde and scipy.stats.kstest in a conformal way?
For example, the code:
from numpy import inf
import scipy.stat
my_pdf = scipy.stats.kde.gaussian_kde(sample)
scipy.stats.kstest(sample, lambda x: my_pdf.integrate_box_1d(-inf, x))
Gives the following answer:
(0.5396735893479544, 0.0)
Which is not true because a sample obviously belongs to the distribution which was constructed on this sample.

First of all, the right test to use for testing if two samples may have come from the same distribution is the two-sample KS test, implemented in scipy.stats.ks_2samp, which directly compares the empirical CDFs. KDE is density estimation, which smooths out the CDF, and is therefore a bunch of unnecessary work that also makes your estimate worse, statistically speaking.
But the reason you're seeing this problem is that the signature for your CDF parameter isn't quite right. kstest calls cdf(vals) (source), where vals is the sorted samples, to get out the CDF value for each of your samples. In your code, this ends up calling my_pdf.integrate_box_1d(-np.inf, samps), but integrate_box_1d wants both arguments to be scalars. The signature is wrong, and if you tried this with most arrays it'd crash with a ValueError:
>>> my_pdf.integrate_box_1d(-np.inf, samp[:10])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-81d0253a33bf> in <module>()
----> 1 my_pdf.integrate_box_1d(-np.inf, samp[:10])
/Library/Python/2.7/site-packages/scipy-0.12.0.dev_ddd617d_20120725-py2.7-macosx-10.8-x86_64.egg/scipy/stats/kde.pyc in integrate_box_1d(self, low, high)
311
312 normalized_low = ravel((low - self.dataset) / stdev)
--> 313 normalized_high = ravel((high - self.dataset) / stdev)
314
315 value = np.mean(special.ndtr(normalized_high) - \
ValueError: operands could not be broadcast together with shapes (10) (1,1000)
but unfortunately, when the second argument is samp, it can broadcast just fine since the arrays are the same shape, and then everything goes to hell. Presumably integrate_box_1d should check the shape of its arguments, but here's one way to do it correctly:
>>> my_cdf = lambda ary: np.array([my_pdf.integrate_box_1d(-np.inf, x) for x in ary])
>>> scipy.stats.kstest(sample, my_cdf)
(0.015597917205996903, 0.96809912578616597)
You could also use np.vectorize if you felt like it.
(But again, you probably actually want to use ks_2samp.)

Related

Error message in Python with differentiation

I am computing these derivatives using the Montecarlo approach for a generic call option. I am interested in this combined derivative (with respect to both S and Sigma). Doing this with the algorithmic differentiation, I get an error that can be seen at the end of the page. What could be a possible solution? Just to explain something regarding the code, I am going to attach the formula used to compute the "X" in the code below:
from jax import jit, grad, vmap
import jax.numpy as jnp
from jax import random
Underlying_asset = jnp.linspace(1.1,1.4,100)
volatilities = jnp.linspace(0.5,0.6,100)
def second_derivative_mc(S,vol):
N = 100
j,T,q,r,k = 10000,1.,0,0,1.
S0 = jnp.array([S]).T #(Nx1) vector underlying asset
C = jnp.identity(N)*vol #matrix of volatilities with 0 outside diagonal
e = jnp.array([jnp.full(j,1.)])#(1xj) vector of "1"
Rand = np.random.RandomState()
Rand.seed(10)
U= Rand.normal(0,1,(N,j)) #Random number for Brownian Motion
sigma2 = jnp.array([vol**2]).T #Vector of variance Nx1
first = jnp.dot(sigma2,e) #First part equation
second = jnp.dot(C,U) #Second part equation
X = -0.5*first+jnp.sqrt(T)*second
St = jnp.exp(X)*S0
P = jnp.maximum(St-k,0)
payoff = jnp.average(P, axis=-1)*jnp.exp(-q*T)
return payoff
greek = vmap(grad(grad(second_derivative_mc, argnums=1), argnums=0)(Underlying_asset,volatilities)
This is the error message:
> UnfilteredStackTrace Traceback (most recent call
> last) <ipython-input-78-0cc1da97ae0c> in <module>()
> 25
> ---> 26 greek = vmap(grad(grad(second_derivative_mc, argnums=1), argnums=0))(Underlying_asset,volatilities)
>
> 18 frames UnfilteredStackTrace: TypeError: Gradient only defined for
> scalar-output functions. Output had shape: (100,).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
> TypeError Traceback (most recent call
> last) /usr/local/lib/python3.7/dist-packages/jax/_src/api.py in
> _check_scalar(x)
> 894 if isinstance(aval, ShapedArray):
> 895 if aval.shape != ():
> --> 896 raise TypeError(msg(f"had shape: {aval.shape}"))
> 897 else:
> 898 raise TypeError(msg(f"had abstract value {aval}"))
> TypeError: Gradient only defined for scalar-output functions. Output had shape: (100,).
As the error message indicates, gradients can only be computed for functions that return a scalar. Your function returns a vector:
print(len(second_derivative_mc(1.1, 0.5)))
# 100
For vector-valued functions, you can compute the jacobian (which is similar to a multi-dimensional gradient). Is this what you had in mind?
from jax import jacobian
greek = vmap(jacobian(jacobian(second_derivative_mc, argnums=1), argnums=0))(Underlying_asset,volatilities)
Also, this is not what you asked about, but the function above will probably not work as you intend even if you solve the issue in the question. Numpy RandomState objects are stateful, and thus will generally not work correctly with jax transforms like grad, jit, vmap, etc., which require side-effect-free code (see Stateful Computations In JAX). You might try using jax.random instead; see JAX: Random Numbers for more information.

`ValueError: A value in x_new is above the interpolation range.` - what other reasons than not ascending values?

I receive this error in scipy interp1d function. Normally, this error would be generated if the x was not monotonically increasing.
import scipy.interpolate as spi
def refine(coarsex,coarsey,step):
finex = np.arange(min(coarsex),max(coarsex)+step,step)
intfunc = spi.interp1d(coarsex, coarsey,axis=0)
finey = intfunc(finex)
return finex, finey
for num, tfile in enumerate(files):
tfile = tfile.dropna(how='any')
x = np.array(tfile['col1'])
y = np.array(tfile['col2'])
finex, finey = refine(x,y,0.01)
The code is correct, because it successfully worked on 6 data files and threw the error for the 7th. So there must be something wrong with the data. But as far as I can tell, the data increase all the way down.
I am sorry for not providing an example, because I am not able to reproduce the error on an example.
There are two things that could help me:
Some brainstorming - if the data are indeed monotonically
increasing, what else could produce this error? Another hint,
regarding the decimals, could be in this question, but I think
my solution (the min and max of x) is robust enough to avoid it. Or
isn't it?
Is it possible (how?) to return the value of x_new and
it's index when throwing the ValueError: A value in x_new is above the interpolation range. so that I could actually see where in the
file is the problem?
UPDATE
So the problem is that, for some reason, max(finex) is larger than max(coarsex) (one is .x39 and the other is .x4). I hoped rounding the original values to 2 significant digits would solve the problem, but it didn't, it displays fewer digits but still counts with the undisplayed. What can I do about it?
If you are running Scipy v. 0.17.0 or newer, then you can pass fill_value='extrapolate' to spi.interp1d, and it will extrapolate to accomadate these values of your's that lie outside the interpolation range. So define your interpolation function like so:
intfunc = spi.interp1d(coarsex, coarsey,axis=0, fill_value="extrapolate")
Be forewarned, however!
Depending on what your data looks like and the type on interpolation you are performing, the extrapolated values can be erroneous. This is especially true if you have noisy or non-monotonic data. In your case you might be ok because your x_new value is only slighly beyond your interpolation range.
Here's simple demonstration of how this feature can work nicely but also give erroneous results.
import scipy.interpolate as spi
import numpy as np
x = np.linspace(0,1,100)
y = x + np.random.randint(-1,1,100)/100
x_new = np.linspace(0,1.1,100)
intfunc = spi.interp1d(x,y,fill_value="extrapolate")
y_interp = intfunc(x_new)
import matplotlib.pyplot as plt
plt.plot(x_new,y_interp,'r', label='interp/extrap')
plt.plot(x,y, 'b--', label='data')
plt.legend()
plt.show()
So the interpolated portion (in red) worked well, but the extrapolated portion clearly fails to follow the otherwise linear trend in this data because of the noise. So have some understanding of your data and proceed with caution.
A quick test of your finex calc shows that it can (always?) gets into the extrapolation region.
In [124]: coarsex=np.random.rand(100)
In [125]: max(coarsex)
Out[125]: 0.97393109991816473
In [126]: step=.01;finex=np.arange(min(coarsex), max(coarsex)+step, step);(max(
...: finex),max(coarsex))
Out[126]: (0.98273730602114795, 0.97393109991816473)
In [127]: step=.001;finex=np.arange(min(coarsex), max(coarsex)+step, step);(max
...: (finex),max(coarsex))
Out[127]: (0.97473730602114794, 0.97393109991816473)
Again it is a quick test, and may be missing some critical step or value.

python numpy data type error and extremely inefficient use of pyplot :(

[Using windows 10 and python 3.5 with newest modules]
Hello!
I have two slightly different problems that belong together because one is the buggy solution of the other. The first function here is extremely slow with datapoints over 75000 and does not work with 150000. This on does exactly what I want though.
#I call the functions like this:
plt.plot(logtime[:recmax-(degree*2-1)] - (logtime[0]-degree), smoothListTriangle(cpm, degree), color="green", linewidth=2, label="Smoothed n="+degree)
plt.plot(logtime[:recmax] - logtime[0], smoothListGaussian2(str(cpm), degree), color="lime", linewidth=5, label="")
#And cpm is always:
cpm = cpm.astype(int) #Array of big number of values
def smoothListTriangle(cpm,degree): #Thank you Scott from swharden.com!
weight=[]
window=degree*2-1
smoothed=[0.0]*(len(cpm)-window)
for x in range(1,2*degree):
weight.append(degree-abs(degree-x))
w=np.array(weight)
for i in range(len(smoothed)):
smoothed[i]=sum(np.array(cpm[i:i+window])*w)/float(sum(w))
#Very, VERY slow...
return smoothed
The higher "degree" is the longer it takes. But with lesser degree it would not look good.
...
The second function here should be (way?) more efficient, but i cant resolve the data type error:
def smoothListGaussian2(myarray, degree):
myarray = np.pad(myarray, (degree-1,degree-1), mode='edge')
window = degree*2-1
weight = np.arange(-degree+1, degree)/window
weight = np.exp(-(16*weight**2))
weight /= sum(weight)
#weight = weight.astype(int) #Does throw the "invalid literal" error
smoothed = np.convolve(myarray, weight, mode='valid')
return smoothed
#TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'
Im desperately trying to resolve this data type error here with numpy. Its killing me! IT seems to be the array "weight" thats the one who's float64, but converting it throws more errors like:
ValueError: invalid literal for int() with base 10: '[31 31 33 ..., 48 49 51]'
So... Im new to python and use this to log data from my geiger counter. Do you have any idea how to either make the first function WAY more efficient or resolve the error in the second? Im at a loss here.
I found the scripts here: http://www.swharden.com/wp/2008-11-17-linear-data-smoothing-in-python/#comments (I found Scotts other triangle-smooth-function on this site, but i couldnt get this to work either. Its more complicated)
Note that the number of data points are depending on the length in seconds of the measurement and this length can very well be several days. I guess one million data points and more are not unusual.
Thank you!
I just had a revelation of some sort. All i had to do is convert the "myarray" to float before convolving.
I had to do so many conversions to make the whole code work correctly, its ridiculous! I thought this is easy in Python, but no.. :(( Seems to me that c++ is better in that case.
def smoothListGaussian2(myarray, degree):
myarray = np.pad(myarray, (degree - 1, degree - 1), mode='edge')
window = degree * 2 - 1
weight = np.arange(-degree + 1, degree) / window
weight = np.exp(-(16 * weight ** 2))
weight /= sum(weight)
myarray = myarray.astype(float)
smoothed = np.convolve(myarray, weight, mode='valid')
return smoothed
Since this works now I coud test the speed and its pretty fast. I cant see a difference in speed between 40k and 150k data points anymore. cool

Why does `scipy.interpolate.griddata` fail for readonly arrays?

I have some data which I try to interpolate using scipy.interpolate.griddata. In my use-case I marked some of the numpy arrays read-only, which apparently breaks the interpolation:
import numpy as np
from scipy import interpolate
x0 = 10 * np.random.randn(100, 2)
y0 = np.random.randn(100)
x1 = np.random.randn(3, 2)
x0.flags.writeable = False
# x1.flags.writeable = False
interpolate.griddata(x0, y0, x1)
yields the following exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-a6e09dbdd371> in <module>()
6 # x1.flags.writeable = False
7
----> 8 interpolate.griddata(x0, y0, x1)
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/interpolate/ndgriddata.pyc in griddata(points, values, xi, method, fill_value, rescale)
216 ip = LinearNDInterpolator(points, values, fill_value=fill_value,
217 rescale=rescale)
--> 218 return ip(xi)
219 elif method == 'cubic' and ndim == 2:
220 ip = CloughTocher2DInterpolator(points, values, fill_value=fill_value,
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.NDInterpolatorBase.__call__ (scipy/interpolate/interpnd.c:3930)()
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.LinearNDInterpolator._evaluate_double (scipy/interpolate/interpnd.c:5267)()
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.LinearNDInterpolator._do_evaluate (scipy/interpolate/interpnd.c:6006)()
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/interpolate/interpnd.so in View.MemoryView.memoryview_cwrapper (scipy/interpolate/interpnd.c:17829)()
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/interpolate/interpnd.so in View.MemoryView.memoryview.__cinit__ (scipy/interpolate/interpnd.c:14104)()
ValueError: buffer source array is read-only
Clearly, the interpolation function doesn't like that the arrays are write-protected. However, I don't understand why they want to change this – I certainly don't expect my input to be mutated by a call to the interpolation function and this is also not mentioned in the documentation as far as I can tell. Why would the function behave like this?
Note that setting x1 readonly instead of x0 leads to a similar error.
The relevant code is written in Cython, and when Cython requests a memoryview of the input array, it always asks for a writeable one, even if you don't need it.
Since an array flagged as non-writeable will refuse to provide a writeable memoryview, the code fails, even though it didn't need to write to the array in the first place.

sklearn.gaussian_process fit() not working with array sizes greater than 100

I am generating a random.uniform(low=0.0, high=100.0, size=(150,150)) array.
I input this into a function that generates the X, x, and y.
However, if the random test matrix is greater than 100, I get the error below.
I have tried playing around with theta values.
Has anyone had this problem? Is this a bug?
I am using python2.6 and scikit-learn-0.10. Should I try python3?
Any suggestions or comments are welcome.
Thank you.
gp.fit( XKrn, yKrn )
File "/usr/lib/python2.6/scikit_learn-0.10_git-py2.6-linux-x86_64.egg/sklearn/gaussian_process/gaussian_process.py", line 258, in fit
raise ValueError("X and y must have the same number of rows.")
ValueError: X and y must have the same number of rows.
ValueError: X and y must have the same number of rows. means that in your case XKrn.shape[0] should be equal to yKrn.shape[0]. You probably have an error in the code generating the dataset.
Here is a working example:
In [1]: from sklearn.gaussian_process import GaussianProcess
In [2]: import numpy as np
In [3]: X, y = np.random.randn(150, 10), np.random.randn(150)
In [4]: GaussianProcess().fit(X, y)
Out[4]:
GaussianProcess(beta0=None,
corr=<function squared_exponential at 0x10d42aaa0>, normalize=True,
nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla',
random_start=1,
random_state=<mtrand.RandomState object at 0x10b4c8360>,
regr=<function constant at 0x10d42a488>, storage_mode='full',
theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
Python 3 is not supported yet and the latest released version of scikit-learn is 0.12.1 at this time.
My original post was deleted. Thanks, Flexo.
I had the same problem, and number of rows I was passing in was the same in my X and y.
In my case, the problem was in fact that I was passing in a number of features to fit against in my output. Gaussian processes fit to a single output feature.
The "number of rows" error was misleading, and stemmed from the fact that I wasn't using the package correctly. To fit multiple output features like this, you'll need a GP for each feature.

Categories