I am computing these derivatives using the Montecarlo approach for a generic call option. I am interested in this combined derivative (with respect to both S and Sigma). Doing this with the algorithmic differentiation, I get an error that can be seen at the end of the page. What could be a possible solution? Just to explain something regarding the code, I am going to attach the formula used to compute the "X" in the code below:
from jax import jit, grad, vmap
import jax.numpy as jnp
from jax import random
Underlying_asset = jnp.linspace(1.1,1.4,100)
volatilities = jnp.linspace(0.5,0.6,100)
def second_derivative_mc(S,vol):
N = 100
j,T,q,r,k = 10000,1.,0,0,1.
S0 = jnp.array([S]).T #(Nx1) vector underlying asset
C = jnp.identity(N)*vol #matrix of volatilities with 0 outside diagonal
e = jnp.array([jnp.full(j,1.)])#(1xj) vector of "1"
Rand = np.random.RandomState()
Rand.seed(10)
U= Rand.normal(0,1,(N,j)) #Random number for Brownian Motion
sigma2 = jnp.array([vol**2]).T #Vector of variance Nx1
first = jnp.dot(sigma2,e) #First part equation
second = jnp.dot(C,U) #Second part equation
X = -0.5*first+jnp.sqrt(T)*second
St = jnp.exp(X)*S0
P = jnp.maximum(St-k,0)
payoff = jnp.average(P, axis=-1)*jnp.exp(-q*T)
return payoff
greek = vmap(grad(grad(second_derivative_mc, argnums=1), argnums=0)(Underlying_asset,volatilities)
This is the error message:
> UnfilteredStackTrace Traceback (most recent call
> last) <ipython-input-78-0cc1da97ae0c> in <module>()
> 25
> ---> 26 greek = vmap(grad(grad(second_derivative_mc, argnums=1), argnums=0))(Underlying_asset,volatilities)
>
> 18 frames UnfilteredStackTrace: TypeError: Gradient only defined for
> scalar-output functions. Output had shape: (100,).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
> TypeError Traceback (most recent call
> last) /usr/local/lib/python3.7/dist-packages/jax/_src/api.py in
> _check_scalar(x)
> 894 if isinstance(aval, ShapedArray):
> 895 if aval.shape != ():
> --> 896 raise TypeError(msg(f"had shape: {aval.shape}"))
> 897 else:
> 898 raise TypeError(msg(f"had abstract value {aval}"))
> TypeError: Gradient only defined for scalar-output functions. Output had shape: (100,).
As the error message indicates, gradients can only be computed for functions that return a scalar. Your function returns a vector:
print(len(second_derivative_mc(1.1, 0.5)))
# 100
For vector-valued functions, you can compute the jacobian (which is similar to a multi-dimensional gradient). Is this what you had in mind?
from jax import jacobian
greek = vmap(jacobian(jacobian(second_derivative_mc, argnums=1), argnums=0))(Underlying_asset,volatilities)
Also, this is not what you asked about, but the function above will probably not work as you intend even if you solve the issue in the question. Numpy RandomState objects are stateful, and thus will generally not work correctly with jax transforms like grad, jit, vmap, etc., which require side-effect-free code (see Stateful Computations In JAX). You might try using jax.random instead; see JAX: Random Numbers for more information.
I use the function leastsq from scipy.optimize to fit sphere coordinates and radius from 3D coordinates.
So my code looks like this :
def distance(pc,point):
xc,yc,zc,rd = pc
x ,y ,z = point
return np.sqrt((xc-x)**2+(yc-y)**2+(zc-z)**2)
def sphere_params(coords):
from scipy import optimize
err = lambda pc,point : distance(pc,point) - pc[3]
pc = [0, 0, 0, 1]
pc, success = optimize.leastsq(err, pc[:], args=(coords,))
return pc
(Built thanks to : how do I fit 3D data.)
I started working with the variable coords as a list of tuples (each tuple being an x,y,z coordinate):
>> coords
>> [(0,0,0),(0,0,1),(-1,0,0),(0.57,0.57,0.57),...,(1,0,0),(0,1,0)]
Which lead me to an error :
>> pc = sphere_params(coords)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/michel/anaconda/lib/python2.7/site-packages/scipy/optimize/minpack.py", line 374, in leastsq
raise TypeError('Improper input: N=%s must not exceed M=%s' % (n, m))
TypeError: Improper input: N=4 must not exceed M=3
Where N is the number of parameters stored in pc, and M the number of data points. Which makes it look like I haven't given enough data points while my list coords actually regroups 351 tuples versus 4 parameters in pc !
From what I read in minipack the actual culprit seems to be this line (from _check_func()) :
res = atleast_1d(thefunc(*((x0[:numinputs],) + args)))
Unless i'm mistaken, in my case it translates into
res = atleast_1d(distance(*(pc[:len(pc)],) + args)
But I'm having a terrible time trying to understand what this mean alongs with the rest of the _check_func() function.
I ended up changing coords into an array before giving it as an argument to sphere_param() : coords = np.asarray(coords).T and it started working just fine. I would really like to understand why the data format was giving me trouble though !
In advance, many thanks for your answers!
EDIT : I notice my use of coords for the "distance" and "err" functions was really unwise and misleading, it wasn't so in my original code so it was not the core of the problem. Now make more sense.
Your err function must take the full list of coords and return a full list of distances. leastsq will then take the list of errors, square and sum them, and minimize that squared sum.
There are also distance functions in scipy.spatial.distance, so I would recommend that:
from scipy.spatial.distance import cdist
from scipy.optimize import leastsq
def distance_cdist(pc, coords):
return cdist([pc], coords).squeeze()
def distance_norm(pc, points):
""" pc must be shape (D+1,) array
points can be (N, D) or (D,) array """
c = np.asarray(pc[:3])
points = np.atleast_2d(points)
return np.linalg.norm(points-c, axis=1)
def sphere_params(coords):
err = lambda pc, coords: distance(pc[:3], coords) - pc[3]
pc = [0, 0, 0, 1]
pc, success = leastsq(err, pc, args=(coords,))
return pc
coords = [(0,0,0),(0,0,1),(-1,0,0),(0.57,0.57,0.57),(1,0,0),(0,1,0)]
sphere_params(coords)
While I haven't used this function much, as best I can tell, coords is passed as is to your distance function. At least it would if the error checking allowed. In fact it is likely that the error checking tries to do that, and raises an error if distance raises an error. So lets try that.
In [91]: coords=[(0,0,0),(0,0,1),(-1,0,0),(0.57,0.57,0.57),(1,0,0),(0,1,0)]
In [92]: distance([0,0,0,0],coords)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-92-113da104affb> in <module>()
----> 1 distance([0,0,0,0],coords)
<ipython-input-89-64c557cd95e0> in distance(pc, coords)
2
3 xc,yx,zx,rd = pc
----> 4 x ,y ,z = coords
5 return np.sqrt((xc-x)**2+(yc-y)**2+(zc-z)**2)
6
ValueError: too many values to unpack (expected 3)
So that's where the 3 comes from - your x, y, z = coords.
distance([0,0,0,0],np.array(coords))
same error.
distance([0,0,0,0],np.array(coords).T)
gets past that issue (3 rows that can be split into 3 variables), raises another error: NameError: name 'yc' is not defined
That looks like a typo in the code you gave us, !naughty, naughty!.
Correcting that:
In [97]: def distance(pc,coords):
xc,yc,zc,rd = pc
x ,y ,z = coords
return np.sqrt((xc-x)**2+(yc-y)**2+(zc-z)**2)
....:
In [98]: distance([0,0,0,0],np.array(coords).T)
Out[98]: array([ 0. , 1. , 1. , 0.98726896, 1. , 1. ])
# and wrapping the array in a tuple, as `leastsq` does
In [102]: distance([0,0,0,0],*(np.array(coords).T,))
Out[102]: array([ 0. , 1. , 1. , 0.98726896, 1. , 1. ])
I get a 5 element array, one value for each 'point' in coords. Is that what you want?
Where did you get the idea that leastsq feeds your coords one tuple at a time to your lambda?
args : tuple
Any extra arguments to func are placed in this tuple.
In general with these optimize functions, it you want to perform the operation on a set of conditions, then you need to iterate over those conditions, calling the optimize on each one. Or if you want to optimize over the whole set at once, then you need to write your function (err,etc) to work with the whole set at once.
So here is what I came up with from previous help :
import numpy as np
from scipy.optimize import leastsq
def a_dist(a,B):
# works with - a : reference point - B : coordinates matrix
return np.linalg.norm(a-B, axis=1)
def parametric(coords):
err = lambda pc,point : a_dist(pc,point) - 18
pc = [0, 0, 0] # Initial guess for the parameters
pc, success = leastsq(err, pc[:], args=(coords,))
return pc
It definitely works with both a list of tuples and an array of shape (N,3)
>> cluster #it's more than 6000 point you won't have the same result
>> [(4, 30, 19), (3, 30, 19), (5, 30, 19), ..., (4, 30, 3), (4, 30, 35)]
>> sphere_params(cluster)
>> array([ -5.25734467, 20.73419249, 9.73428766])
>> np.asarray(cluster).shape
>> (6017,3)
>> sphere_params(np.asarray(cluster))
>> array([ -5.25734467, 20.73419249, 9.73428766])
Combining this version with Askewchan's, ie having :
def sphere_params(coords):
err = lambda pc, coords: distance(pc[:3], coords) - pc[3]
pc = [0, 0, 0, 1]
pc, success = leastsq(err, pc, args=(coords,))
return pc
Also works fine, to be honest I didn't take the time to try your solution. I definitely stopped taking the radius as a fit parameter however. I found it not robust at all (even 6000 -noisy- data points were not enough to get the right curvature !).
When comparing to my first code I'm still not quite sure what was wrong though, I probably messed up with global/local variables although I really don't recall using any "global" statement in any of my functions.
I'm trying to apply my own custom distance metric function when using knn regression model.
My dataset is a mixture of nominal, ordinal, numeric and binary types of fields
Code:
def cus_distance(array1, array2, **kwargs):
# calculate the distance, return a float
pass
knn = neighbors.KNeighborsRegressor(weights='distance', metric=cus_distance)
# train_data is a pandas dataframe obj
knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
The last line will cause an exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-284-04520b227b8a> in <module>()
----> 1 knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
587 X, y = check_arrays(X, y, sparse_format="csr")
588 self._y = y
--> 589 return self._fit(X)
590
591
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
214 self._tree = BallTree(X, self.leaf_size,
215 metric=self.effective_metric_,
--> 216 **self.effective_metric_kwds_)
217 elif self._fit_method == 'kd_tree':
218 self._tree = KDTree(X, self.leaf_size,
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
318
319 """
--> 320 return array(a, dtype, copy=False, order=order)
321
322 def asanyarray(a, dtype=None, order=None):
ValueError: could not convert string to float: Unknown
I know this error caused by string values(the 'Unknown' is one of them) in my dataset.
This confused me, in my understanding, the function cus_distance should take care of these str values, and the KNeighborsRegressor just use the return value of my function.
Q:
* Is this the right way to use a custom defined distance metric in KNN Regression?
* If it is, why I met this exception?
* If not, what is the right way?
The Ball Tree and KD Tree require floating point data, regardless of the metric used. If your data cannot be converted to floating point, then you will get this sort of error.
>>> import numpy as np
>>> data = [1, "Unknown", 2]
>>> np.asarray(data, dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
----> 1 np.asarray(data, dtype=float)
ValueError: could not convert string to float: Unknown
Thanks #jakevdp .
The scikit-learn supports Brute Force, Ball Tree and KD Tree, and according to #jakevdp 's answer, the only one I can use is Brute Force algorighm, so my code change to:
knn = neighbors.KNeighborsRegressor(weights='distance', metric=cus_distance, algorithm='brute')
knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
This time it won't raise error anymore, Thanks jakevdp!
But new question came, when I try to use this knn object:
knn.predict(check_data.ix[:, fields_list])
this will cause a same error in my question. So I look into the scikit-learn's source code, found this line cause this error:
elif callable(metric):
# Check matrices first (this is usually done by the metric).
X, Y = check_pairwise_arrays(X, Y)
n_x, n_y = X.shape[0], Y.shape[0]
the function check_pairwise_arrays will try to convert all values to float, "Unknown" cause the error again.
I think this is kind of bug, because scikit's builtin metrics don't support mixture types of dataset, I write a customer metric function, but this line still force the dataset to be pure float type.
And as the comment above this line said, the checking works should be done by customer metrics, so I just commented this line, reload this module, my knn object can work perfectly now :)
ps: I'm working on pushing this change to the scikit-learn official github repo.
How can I use scipy.stats.kde.gaussian_kde and scipy.stats.kstest in a conformal way?
For example, the code:
from numpy import inf
import scipy.stat
my_pdf = scipy.stats.kde.gaussian_kde(sample)
scipy.stats.kstest(sample, lambda x: my_pdf.integrate_box_1d(-inf, x))
Gives the following answer:
(0.5396735893479544, 0.0)
Which is not true because a sample obviously belongs to the distribution which was constructed on this sample.
First of all, the right test to use for testing if two samples may have come from the same distribution is the two-sample KS test, implemented in scipy.stats.ks_2samp, which directly compares the empirical CDFs. KDE is density estimation, which smooths out the CDF, and is therefore a bunch of unnecessary work that also makes your estimate worse, statistically speaking.
But the reason you're seeing this problem is that the signature for your CDF parameter isn't quite right. kstest calls cdf(vals) (source), where vals is the sorted samples, to get out the CDF value for each of your samples. In your code, this ends up calling my_pdf.integrate_box_1d(-np.inf, samps), but integrate_box_1d wants both arguments to be scalars. The signature is wrong, and if you tried this with most arrays it'd crash with a ValueError:
>>> my_pdf.integrate_box_1d(-np.inf, samp[:10])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-81d0253a33bf> in <module>()
----> 1 my_pdf.integrate_box_1d(-np.inf, samp[:10])
/Library/Python/2.7/site-packages/scipy-0.12.0.dev_ddd617d_20120725-py2.7-macosx-10.8-x86_64.egg/scipy/stats/kde.pyc in integrate_box_1d(self, low, high)
311
312 normalized_low = ravel((low - self.dataset) / stdev)
--> 313 normalized_high = ravel((high - self.dataset) / stdev)
314
315 value = np.mean(special.ndtr(normalized_high) - \
ValueError: operands could not be broadcast together with shapes (10) (1,1000)
but unfortunately, when the second argument is samp, it can broadcast just fine since the arrays are the same shape, and then everything goes to hell. Presumably integrate_box_1d should check the shape of its arguments, but here's one way to do it correctly:
>>> my_cdf = lambda ary: np.array([my_pdf.integrate_box_1d(-np.inf, x) for x in ary])
>>> scipy.stats.kstest(sample, my_cdf)
(0.015597917205996903, 0.96809912578616597)
You could also use np.vectorize if you felt like it.
(But again, you probably actually want to use ks_2samp.)