Related
I'm using Shogun to run MMD (quadratic) and compare two nonparametric distributions based on their samples (code below is for 1D, but I've also looked at 2D samples). In the toy problem shown below, I try to change the ratio between training and testing samples in the process of selecting an optimized kernel (KSM_MAXIMIZE_MMD is the selection strategy; I've also used KSM_MEDIAN_HEURISTIC). It appears that any ratio other than 1 yields an error.
Am I allowed to change this ratio in this setting?
(I see that it is used at: http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html, but it is set to 1 there)
Concise version of the my code (inspired by the notebook available at: http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html):
import shogun as sg
import numpy as np
from scipy.stats import laplace, norm
n = 220
mu = 0.0
sigma2 = 1
b=np.sqrt(0.5)
X = sg.RealFeatures((norm.rvs(size=n) * np.sqrt(sigma2) + mu).reshape(1,-1))
Y = sg.RealFeatures(laplace.rvs(size=n, loc=mu, scale=b).reshape(1,-1))
mmd = sg.QuadraticTimeMMD(X, Y)
mmd.add_kernel(sg.GaussianKernel(10, 1.0))
mmd.set_kernel_selection_strategy(sg.KSM_MAXIMIZE_MMD)
mmd.set_train_test_mode(True)
mmd.set_train_test_ratio(1)
mmd.select_kernel()
mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
kernel_width = mmd_kernel.get_width()
statistic = mmd.compute_statistic()
p_value = mmd.compute_p_value(statistic)
print p_value
This exact version runs and prints p-values just fine.
If I change the argument passed to mmd.set_train_test_ratio() from 1 to 2, I get:
SystemErrorTraceback (most recent call last)
<ipython-input-30-dd5fcb933287> in <module>()
25 kernel_width = mmd_kernel.get_width()
26
---> 27 statistic = mmd.compute_statistic()
28 p_value = mmd.compute_p_value(statistic)
29
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90: assertion kernel_matrix.num_rows==size && kernel_matrix.num_cols==size failed in float32_t shogun::internal::mmd::ComputeMMD::operator()(const shogun::SGMatrix<T>&) const [with T = float; float32_t = float] file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90
It gets worse, if I use the value below 1. In addition to the following error,
jupyter notebook kernel crashes every time (after which I need to rerun the entire notebook; the message says: "The kernel appears to have died. It will restart automatically.").
SystemErrorTraceback (most recent call last)
<ipython-input-31-cb4a5224f4ef> in <module>()
20 mmd.set_train_test_ratio(0.5)
21
---> 22 mmd.select_kernel()
23
24 mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/kernel/Kernel.h line 210: GaussianKernel::kernel(): index out of Range: idx_a=146/146 idx_b=0/146
Complete code (in a jypyter notebook) can be found at: http://nbviewer.jupyter.org/url/dmitry.duplyakin.org/p/jn/kernel-minimal.ipynb
Please let me know if I am missing a step or need to try a different approach.
Side questions:
Both http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html and http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html show examples of using sg.GaussianKernel(10, <width>). I couldn't find more information about the 1st parameter other than its name, cache size. How and when am I supposed to change it?
As mentioned in the referenced notebook, mmd.get_kernel_selection_strategy().get_name() returns only the generic name, specifically KernelSelectionStrategy. How can I obtain a more specific name for the selected strategy (e.g., KSM_MEDIAN_HEURISTIC) from an instance of the sg.QuadraticTimeMMD class?
Any relevant information or references will be greatly appreciated.
Shogun version: v6.1.3_2017-12-7_19:14
The train_test_ratio attribute is the ratio between the number of samples used in training and the number of samples used in testing. When you have train_test_mode turned on, the way it decides how many samples to fetch in each mode goes something like this.
num_training_samples = m_num_samples * train_test_ratio / (train_test_ratio + 1)
num_testing_samples = m_num_samples / (train_test_ratio + 1)
It implicitly assumes the divisibility. A train_test_ratio of 2 would, therefore, try to use 2/3rd of the data for training and 1/3rd of the data for testing, which is problematic for the total number of samples you have, 220. By the logic, it sets num_training_samples = 146 and num_testing_samples = 73, which doesn't add up to 220. Similar issues arise when using 0.5 as the train-test ratio. If you use some other values for the train_test_ratio which splits the total number of samples perfectly, I think these errors would go away.
I am not totally sure but I think the cache makes sense when you're using SVMLight with Shogun. Please check http://svmlight.joachims.org/ for details. From their page
-m [5..] - size of cache for kernel evaluations in MB (default 40)
The larger the faster...
There's no pretty-print for the kernel-selection strategy being used, but you could do mmd.get_kernel_selection_strategy().get_method() which returns you the enum value (of type EKernelSelectionMethod) which might be helpful. Since it's not documented yet in Shogun api-doc, here's the C++ equivalent for this that you might use.
enum EKernelSelectionMethod
{
KSM_MEDIAN_HEURISTIC,
KSM_MAXIMIZE_MMD,
KSM_MAXIMIZE_POWER,
KSM_CROSS_VALIDATION,
KSM_AUTO = KSM_MAXIMIZE_POWER
};
Summary (from comments):
The bug does not show up in the latest code
Solution is in: https://github.com/shogun-toolbox/shogun/pull/4134
I'm trying to code my own logistic regression, and compare different methods of maximizing the log-likelihood. Using the Newton-CG method, I get the error message "ValueError: setting an array element with a sequence". Reading around, it seems this error rises if the function sought to be minimized returns a non-skalar, but that is not the case here. I need the three methods given below to give the same result (approximately), but when running on my real data, one does not converge, and the other one gives a worse LL than the initial guess, and the third does not run at all.
Why do I get the ValueError message and how can I fix it?
My code (with dummy data, the real data is ~100 measurements) is as follows:
import numpy as np
from numpy import linalg
import scipy
from scipy.optimize import minimize
def CalcLL(beta,xinlist,yinlist):
LL=0.0
ncol=len(beta)
pi=FindPi(xinlist,beta.reshape(ncol,1))
for i in range(len(yinlist)):
LL=LL+np.where(yinlist[i]==1,np.log(pi[i]),np.log(1-pi[i]))
return -LL
def Jacobian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
Jac=np.transpose(np.matrix(yinlist-pi))*np.matrix(xinlist)
return Jac
def Hessian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
W=FindW(pi)
Hes=np.matrix(np.transpose(xinlist))*(np.matrix(W)*np.matrix(xinlist))
return Hes
def FindPi(xinlist,beta):
rows=np.shape(xinlist)[0]# Number of rows in x_new
cols=np.shape(xinlist)[1]# Number of columns in x_new
expon=np.dot(xinlist,beta)
expon=np.array(expon).reshape(rows,1)
pi=np.exp(expon)/(1+np.exp(expon))
return pi
def FindW(pi):
W=np.zeros(len(pi)*len(pi)).reshape(len(pi),len(pi))
for i in range(len(pi)):
W[i,i]=float(pi[i]*(1-pi[i]))
return W
xinlist=np.matrix([[1,1],[0,1],[1,1],[1,1],[1,1],[0,1],[0,1],[1,1],[1,1],[0,1]])
yinlist=np.transpose(np.matrix([0,0,0,0,0,1,1,1,1,1]))
ncol=np.shape(xinlist)[1]
beta1=np.zeros(ncol).reshape(ncol,1) # Initial guess for parameter values
limit=0.000001 # selfwritten Newton-Raphson method
iter_i=limit+1
while iter_i>limit:
Hes=Hessian(beta1,xinlist,yinlist)
Jac=np.transpose(Jacobian(beta1,xinlist,yinlist))
root_diff=np.array(linalg.inv(Hes)*Jac)
beta1=beta1+root_diff
iter_i=np.sum(root_diff*root_diff)
print "When running self-written algorithm, the log-likelihood is",-CalcLL(beta1,xinlist,yinlist)
beta2=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta2,args=(xinlist,yinlist),method='Nelder-Mead',options={'xtol':1e-8,'disp':True,'maxiter':10000})
beta2=res.x
print "The log-likelihood using Nelder-Mead is", -CalcLL(beta2,xinlist,yinlist)
beta3=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
beta3=res.x
print "The log-likelihood using Newton-CG is", -CalcLL(beta3,xinlist,yinlist)
EDIT:
The errorstack is as follows:
Traceback (most recent call last):
File "MyLogisticRegression2.py", line 62, in
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
File C:\Python27\lib\site-packages\scipy\optimize_minimize.py, line 447, in minimize **options)
File C:\Python27\lib\site-packages\scipy\optimize\optimize.py, line 2393, in _minimize_newtoncg eta=numpy.min([0.5, numpy.sqrt(maggrad)])
File C:\Python27\lib\site-packages\numpy\core\fromnumeric.py, line 2393, in amin out=out, **kwargs)
File C:\Python27\lib\site-packages\numpy\core_methods.py, line 29, in _amin return umr_minimum(a,axis,None,out,keepdims)
ValueError: setting an array element with a sequence
I found out the problem rose from beta arrays having shape (2,1) instead of (2,), and likewise for the Jacobian. Reshaping these two solved the problem.
The Newton-CG solver needs only 1d arrays for the Jacobian apparently.
I keep getting errors when I tried to solve a system of three equations using the following code in python3:
import sympy
from sympy import Symbol, solve, nsolve
x = Symbol('x')
y = Symbol('y')
z = Symbol('z')
eq1 = x - y + 3
eq2 = x + y
eq3 = z - y
print(nsolve( (eq1, eq2, eq3), (x,y,z), (-50,50)))
Here is the error message:
Traceback (most recent call last):
File
"/usr/lib/python3/dist-packages/mpmath/calculus/optimization.py", line
928, in findroot
fx = f(*x0)
TypeError: () missing 1 required positional argument:
'_Dummy_15'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "", line 1, in
File "", line 12, in File
"/usr/lib/python3/dist-packages/sympy/solvers/solvers.py", line 2498,
in nsolve
x = findroot(f, x0, J=J, **kwargs)
File
"/usr/lib/python3/dist-packages/mpmath/calculus/optimization.py", line
931, in findroot
fx = f(x0[0])
TypeError: () missing 2 required positional arguments:
'_Dummy_14' and '_Dummy_15'
The strange thing is, the error message goes away if I only solve the first two equation --- by changing the last line of the code to
print(nsolve( (eq1, eq2), (x,y), (-50,50)))
output:
exec(open('bug444.py').read())
[-1.5]
[ 1.5]
I'm baffled; your help is most appreciated!
A few pieces of additional info:
I'm using python3.4.0 + sympy 0.7.6-3 on ubuntu 14.04. I got the same error in python2
I could solve this system using
solve( [eq1,eq2,eq3], [x,y,z] )
but this system is just a toy example; in the actual applications the system is non-linear and I need higher precision, and I don't see how to adjust the precision for solve, whereas for nsolve I could use nsolve(... , prec=100)
THANKS!
In your print statement, you are missing your guess for z
print(nsolve((eq1, eq2, eq3), (x, y, z), (-50, 50)))
try this (in most cases, using 1 for all the guesses is fine):
print(nsolve((eq1, eq2, eq3), (x, y, z), (1, 1, 1)))
Output:
[-1.5]
[ 1.5]
[ 1.5]
You can discard the initial guesses/dummies if you use linsolve:
>>> from sympy import linsolve
>>> print(linsolve((eq1, eq2, eq3), x,y,z))
{(-3/2, 3/2, 3/2)}
And then you can use nonlinsolve for your non linear problem set.
The Problem is number of variables should be equal to the number of guess vectors,
print(nsolve((eq1, eq2, eq3), (x,y,z), (-50,50,50)))
If you're using a numerical solver on a multidimensional problem, it wants to start from somewhere and follow a gradient to the solution.
the guess vector is where you start.
if there are multiple local minima / maxima in the space, different guess vectors can lead to diffierent outputs.
Or an unfortunate guess vector may not converge at all.
For a one-dimensional problem the guess vector is just x0.
For most functions you can write down easily, almost any vector will converge to the one global solutions.
so (1,1,1) guess vectors here is as good as (-50,50,50)
Just don't leave a null space for the sake of program
your code should be:
nsolve([eq1, eq2, eq3], [x,y,z], [1,1,1])
your code was:
nsolve([eq1, eq2, eq3], [x,y,z], [1,1])
you were mising one guess value in the last argument.
point is: if you are solving for n unknown terms you provide a guess for each unknown term (n guesses in the last argument)
I am generating a random.uniform(low=0.0, high=100.0, size=(150,150)) array.
I input this into a function that generates the X, x, and y.
However, if the random test matrix is greater than 100, I get the error below.
I have tried playing around with theta values.
Has anyone had this problem? Is this a bug?
I am using python2.6 and scikit-learn-0.10. Should I try python3?
Any suggestions or comments are welcome.
Thank you.
gp.fit( XKrn, yKrn )
File "/usr/lib/python2.6/scikit_learn-0.10_git-py2.6-linux-x86_64.egg/sklearn/gaussian_process/gaussian_process.py", line 258, in fit
raise ValueError("X and y must have the same number of rows.")
ValueError: X and y must have the same number of rows.
ValueError: X and y must have the same number of rows. means that in your case XKrn.shape[0] should be equal to yKrn.shape[0]. You probably have an error in the code generating the dataset.
Here is a working example:
In [1]: from sklearn.gaussian_process import GaussianProcess
In [2]: import numpy as np
In [3]: X, y = np.random.randn(150, 10), np.random.randn(150)
In [4]: GaussianProcess().fit(X, y)
Out[4]:
GaussianProcess(beta0=None,
corr=<function squared_exponential at 0x10d42aaa0>, normalize=True,
nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla',
random_start=1,
random_state=<mtrand.RandomState object at 0x10b4c8360>,
regr=<function constant at 0x10d42a488>, storage_mode='full',
theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
Python 3 is not supported yet and the latest released version of scikit-learn is 0.12.1 at this time.
My original post was deleted. Thanks, Flexo.
I had the same problem, and number of rows I was passing in was the same in my X and y.
In my case, the problem was in fact that I was passing in a number of features to fit against in my output. Gaussian processes fit to a single output feature.
The "number of rows" error was misleading, and stemmed from the fact that I wasn't using the package correctly. To fit multiple output features like this, you'll need a GP for each feature.
I'm trying to create a distribution based on some data I have, then draw randomly from that distribution. Here's what I have:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv()
if __name__ == "__main__":
# pretend this is real data
data = numpy.concatenate((numpy.random.normal(2,5,100), numpy.random.normal(25,5,100)))
d = getDistribution(data)
print d.rvs(size=100) # this usually fails
I think this is doing what I want it to, but I frequently get an error (see below) when I try to do d.rvs(), and d.rvs(100) never works. Am I doing something wrong? Is there an easier or better way to do this? If it's a bug in scipy, is there some way to get around it?
Finally, is there more documentation on creating custom distributions somewhere? The best I've found is the scipy.stats.rv_continuous documentation, which is pretty spartan and contains no useful examples.
The traceback:
Traceback (most recent call last): File "testDistributions.py", line
19, in
print d.rvs(size=100) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 696, in rvs
vals = self._rvs(*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1193, in _rvs
Y = self._ppf(U,*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1212, in _ppf
return self.vecfunc(q,*args) File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.1-py2.6-linux-x86_64.egg/numpy/lib/function_base.py",
line 1862, in call
theout = self.thefunc(*newargs) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1158, in _ppf_single_call
return optimize.brentq(self._ppf_to_solve, self.xa, self.xb, args=(q,)+args, xtol=self.xtol) File
"/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/optimize/zeros.py",
line 366, in brentq
r = _zeros._brentq(f,a,b,xtol,maxiter,args,full_output,disp) ValueError: f(a) and f(b) must have different signs
Edit
For those curious, following the advice in the answer below, here's code that works:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _rvs(self, *x, **y):
# don't ask me why it's using self._size
# nor why I have to cast to int
return kernel.resample(int(self._size))
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
def _pdf(self, x):
return kernel.evaluate(x)
return rv(name='kdedist', xa=-200, xb=200)
Specifically to your traceback:
rvs uses the inverse of the cdf, ppf, to create random numbers. Since you are not specifying ppf, it is calculated by a rootfinding algorithm, brentq. brentq uses lower and upper bounds on where it should search for the value at with the function is zero (find x such that cdf(x)=q, q is quantile).
The default for the limits, xa and xb, are too small in your example. The following works for me with scipy 0.9.0, xa, xb can be set when creating the function instance
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv(name='kdedist', xa=-200, xb=200)
There is currently a pull request for scipy to improve this, so in the next release xa and xb will be expanded automatically to avoid the f(a) and f(b) must have different signs exception.
There is not much documentation on this, the easiest is to follow some examples (and ask on the mailing list).
edit: addition
pdf: Since you have the density function also given by gaussian_kde, I would add the _pdf method, which will make some calculations more efficient.
edit2: addition
rvs: If you are interested in generating random numbers, then gaussian_kde has a resample method. Random Samples can be generated by sampling from the data and adding gaussian noise. So, this will be faster than the generic rvs using the ppf method. I would write a ._rvs method that just calls gaussian_kde's resample method.
precomputing ppf: I don't know of any general way to precompute the ppf. However, the way I thought of doing it (but never tried so far) is to precompute the ppf at many points and then use linear interpolation to approximate the ppf function.
edit3: about _rvs to answer Srivatsan's question in the comment
_rvs is the distribution specific method that is called by the public method rvs. rvs is a generic method that does some argument checking, adds location and scale, and sets the attribute self._size which is the size of the requested array of random variables, and then calls the distribution specific method ._rvs or it's generic counterpart. The extra arguments in ._rvs are shape parameters, but since there are none in this case, *x and **y are redundant and unused.
I don't know how well the size or shape of the .rvs method works in the multivariate case. These distributions are designed for univariate distributions, and might not fully work for the multivariate case, or might need some reshapes.