Shogun / quadratic MMD error caused by varying train_test_ratio

Shogun / quadratic MMD error caused by varying train_test_ratio - python

I'm using Shogun to run MMD (quadratic) and compare two nonparametric distributions based on their samples (code below is for 1D, but I've also looked at 2D samples). In the toy problem shown below, I try to change the ratio between training and testing samples in the process of selecting an optimized kernel (KSM_MAXIMIZE_MMD is the selection strategy; I've also used KSM_MEDIAN_HEURISTIC). It appears that any ratio other than 1 yields an error.
Am I allowed to change this ratio in this setting?
(I see that it is used at: http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html, but it is set to 1 there)
Concise version of the my code (inspired by the notebook available at: http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html):
import shogun as sg
import numpy as np
from scipy.stats import laplace, norm
n = 220
mu = 0.0
sigma2 = 1
b=np.sqrt(0.5)
X = sg.RealFeatures((norm.rvs(size=n) * np.sqrt(sigma2) + mu).reshape(1,-1))
Y = sg.RealFeatures(laplace.rvs(size=n, loc=mu, scale=b).reshape(1,-1))
mmd = sg.QuadraticTimeMMD(X, Y)
mmd.add_kernel(sg.GaussianKernel(10, 1.0))
mmd.set_kernel_selection_strategy(sg.KSM_MAXIMIZE_MMD)
mmd.set_train_test_mode(True)
mmd.set_train_test_ratio(1)
mmd.select_kernel()
mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
kernel_width = mmd_kernel.get_width()
statistic = mmd.compute_statistic()
p_value = mmd.compute_p_value(statistic)
print p_value
This exact version runs and prints p-values just fine.
If I change the argument passed to mmd.set_train_test_ratio() from 1 to 2, I get:
SystemErrorTraceback (most recent call last)
<ipython-input-30-dd5fcb933287> in <module>()
25 kernel_width = mmd_kernel.get_width()
26
---> 27 statistic = mmd.compute_statistic()
28 p_value = mmd.compute_p_value(statistic)
29
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90: assertion kernel_matrix.num_rows==size && kernel_matrix.num_cols==size failed in float32_t shogun::internal::mmd::ComputeMMD::operator()(const shogun::SGMatrix<T>&) const [with T = float; float32_t = float] file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90
It gets worse, if I use the value below 1. In addition to the following error,
jupyter notebook kernel crashes every time (after which I need to rerun the entire notebook; the message says: "The kernel appears to have died. It will restart automatically.").
SystemErrorTraceback (most recent call last)
<ipython-input-31-cb4a5224f4ef> in <module>()
20 mmd.set_train_test_ratio(0.5)
21
---> 22 mmd.select_kernel()
23
24 mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/kernel/Kernel.h line 210: GaussianKernel::kernel(): index out of Range: idx_a=146/146 idx_b=0/146
Complete code (in a jypyter notebook) can be found at: http://nbviewer.jupyter.org/url/dmitry.duplyakin.org/p/jn/kernel-minimal.ipynb
Please let me know if I am missing a step or need to try a different approach.
Side questions:
Both http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html and http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html show examples of using sg.GaussianKernel(10, <width>). I couldn't find more information about the 1st parameter other than its name, cache size. How and when am I supposed to change it?
As mentioned in the referenced notebook, mmd.get_kernel_selection_strategy().get_name() returns only the generic name, specifically KernelSelectionStrategy. How can I obtain a more specific name for the selected strategy (e.g., KSM_MEDIAN_HEURISTIC) from an instance of the sg.QuadraticTimeMMD class?
Any relevant information or references will be greatly appreciated.
Shogun version: v6.1.3_2017-12-7_19:14

The train_test_ratio attribute is the ratio between the number of samples used in training and the number of samples used in testing. When you have train_test_mode turned on, the way it decides how many samples to fetch in each mode goes something like this.
num_training_samples = m_num_samples * train_test_ratio / (train_test_ratio + 1)
num_testing_samples = m_num_samples / (train_test_ratio + 1)
It implicitly assumes the divisibility. A train_test_ratio of 2 would, therefore, try to use 2/3rd of the data for training and 1/3rd of the data for testing, which is problematic for the total number of samples you have, 220. By the logic, it sets num_training_samples = 146 and num_testing_samples = 73, which doesn't add up to 220. Similar issues arise when using 0.5 as the train-test ratio. If you use some other values for the train_test_ratio which splits the total number of samples perfectly, I think these errors would go away.
I am not totally sure but I think the cache makes sense when you're using SVMLight with Shogun. Please check http://svmlight.joachims.org/ for details. From their page
-m [5..] - size of cache for kernel evaluations in MB (default 40)
The larger the faster...
There's no pretty-print for the kernel-selection strategy being used, but you could do mmd.get_kernel_selection_strategy().get_method() which returns you the enum value (of type EKernelSelectionMethod) which might be helpful. Since it's not documented yet in Shogun api-doc, here's the C++ equivalent for this that you might use.
enum EKernelSelectionMethod
{
KSM_MEDIAN_HEURISTIC,
KSM_MAXIMIZE_MMD,
KSM_MAXIMIZE_POWER,
KSM_CROSS_VALIDATION,
KSM_AUTO = KSM_MAXIMIZE_POWER
};

Summary (from comments):
The bug does not show up in the latest code
Solution is in: https://github.com/shogun-toolbox/shogun/pull/4134

Related

pnoise2() function in noise module produces SegFault python 3.8

I am trying to procedurally generate some maps for a game I am working on, and I am trying to use the perlin noise function from the noise module.
By following some tutorials on the internet, I found out that I had to use it this way:
import random
import noise
from noise import pnoise2
factor = 15 #it doesn't need to be 15, it can be anything different from greater than 1 or lower than -1, otherwise it will return 0.0
oct=1
seed=random.randint(0, 100)
print(pnoise2(x/factor, y/factor, octaves=oct, base=seed))
I can pass it the values x and y but if the factor meets the conditions in the comment, the program will throw a fatal error, segfault, when calling the pnoise2 function. What am I doing wrong?
I am also using pygame for the game, and the exact error I get is the following:
Fatal Python error: (pygame parachute) Segmentation Fault
Python runtime state: initialized
Current thread 0x00007fe3247f2740 (most recent call first):
File "scripts/world_manager.py", line 78 in generate
File "main.py", line 153 in reload_chunks
File "main.py", line 257 in update
File "main.py", line 207 in run
File "main.py", line 406 in <module>
Aborted
Here is the generate() function in world_manager.py:
# Generate a chunk at given coordinates using pnoise2 and adding it to the chunk list
def generate(self, chunkx, chunky):
# print("Generating chunk at", chunkx, chunky)
GRASS = "grass"
MOUNTAIN = "mountain"
EMPTY = "void"
floor_void_diff = 0.3
mountain = floor_void_diff + 0.2
factor = 3
floor = {}
items = {}
chunk = (chunkx, chunky)
if chunk not in self.chunks:
# print("Generating chunk at {}".format(chunk))
for y in range(chunky * CHUNKSIZE, chunky * CHUNKSIZE + CHUNKSIZE):
for x in range(chunkx * CHUNKSIZE, chunkx * CHUNKSIZE + CHUNKSIZE):
i = pnoise2(x/factor, y/factor, base=int(self.seed))
print(i)
if i >= floor_void_diff:
if i > mountain:
floor.update({(x, y): MOUNTAIN})
else:
floor.update({(x, y): GRASS})
spawner = random.randint(-1, ITEM_SPAWN_RATIO)
random_item = random.randint(0, len(ITEM_LIST)-1)
if spawner == 0:
items.update({(x, y): ITEM_LIST[random_item]})
elif i < floor_void_diff:
floor.update({(x, y): EMPTY})
else:
floor.update({(x, y): EMPTY})
self.chunks.update({chunk: {"floor": floor, "items": items}})
self.unsaved += 1
CHUNKSIZE is defined in a settings.py file, and right now CHUNKSIZE=4

I have not found a way to solve this problem, but an alternative. I turns out that the way I was using the noise module breaks it some how, so I started looking for perlin noise functions and I found a pretty good one at https://rosettacode.org/wiki/Perlin_noise#Python

It looks like you're passing seed in, for what is actually the octaves parameter. I don't how how this causes a segmentation fault, but a large number would cause the addition loop to take on a wild number of iterations. I don't know that this library (if it's indeed https://github.com/caseman/noise) actually supports seeding, but maybe I'm missing where it's supported.
Further, why not use the snoise2 function? pnoise2 (actual Perlin noise) creates a lot of 45 and 90 degree bias (image comparison, Perlin on top). The "Perlin" algorithm has the iconic name, but it's probably rarely the best choice for noise anymore. The snoise2 (Simplex noise 2D) is more subtle about grid alignment. This library (indirect shameless plug) does generate seedable noise in the Simplex category, and the output is a bit higher quality than the one in the other lib I would say, but a downside might be that it's implemented directly in Python instead of wrapping a potentially much faster C implementation. You would also need to implement octave summation / fractal brownian motion yourself if you were after that effect. The other lib provides it with the octaves parameter.

statsmodels - printing summary of ARMA fit throws error

I want to fit an ARMA(p,q) model to simulated data, y, and check the effect of different estimation methods on the results. However, fitting a model to the same object like so
model = tsa.ARMA(y,(1,1))
results_mle = model.fit(trend='c', method='mle', disp=False)
results_css = model.fit(trend='c', method='css', disp=False)
and printing the results
print result_mle.summary()
print result_css.summary()
generates the following error
File "C:\Anaconda\lib\site-packages\statsmodels\tsa\arima_model.py", line 1572, in summary
smry.add_table_params(self, alpha=alpha, use_t=False)
File "C:\Anaconda\lib\site-packages\statsmodels\iolib\summary.py", line 885, in add_table_params
use_t=use_t)
File "C:\Anaconda\lib\site-packages\statsmodels\iolib\summary.py", line 475, in summary_params
exog_idx]
IndexError: index 3 is out of bounds for axis 0 with size 3
If, instead, I do this
model1 = tsa.ARMA(y,(1,1))
model2 = tsa.ARMA(y,(1,1))
result_mle = model1.fit(trend='c',method='css-mle',disp=False)
print result_mle.summary()
result_css = model2.fit(trend='c',method='css',disp=False)
print result_css.summary()
no error occurs. Is that supposed to be or a Bug that should be fixed?
BTW the ARMA process I generated as follows
from __future__ import division
import statsmodels.tsa.api as tsa
import numpy as np
# generate arma
a = -0.7
b = -0.7
c = 2
s = 10
y1 = np.random.normal(c/(1-a),s*(1+(a+b)**2/(1-a**2)))
e = np.random.normal(0,s,(100,))
y = [y1]
for t in xrange(e.size-1):
arma = c + a*y[-1] + e[t+1] + b*e[t]
y.append(arma)
y = np.array(y)

You could report this as a bug, even though it looks like a consequence of the current design.
Some attributes of the model change when the estimation method is changed, which should in general be avoided. Since both results instances access the same model, the older one is inconsistent with it in this case.
http://www.statsmodels.org/dev/pitfalls.html#repeated-calls-to-fit-with-different-parameters
In general, statsmodels tries to keep all parameters that need to change the model in the model.__init__ and not as arguments in fit, and attach the outcome of fit and results to the Results instance.
However, this is not followed everywhere, especially not in older models that gained new options along the way.
trend is an example that is supposed to go into the ARMA.__init__ because it is now handled together with the exog (which is an ARMAX model), but wasn't in pure ARMA. The estimation method belongs in fit and should not cause problems like these.
Aside: There is a helper function to simulate an ARMA process that uses scipy.signal.lfilter and should be much faster than an iteration loop in Python.

scipy.stats.kde and scipy.stats.kstest

How can I use scipy.stats.kde.gaussian_kde and scipy.stats.kstest in a conformal way?
For example, the code:
from numpy import inf
import scipy.stat
my_pdf = scipy.stats.kde.gaussian_kde(sample)
scipy.stats.kstest(sample, lambda x: my_pdf.integrate_box_1d(-inf, x))
Gives the following answer:
(0.5396735893479544, 0.0)
Which is not true because a sample obviously belongs to the distribution which was constructed on this sample.

First of all, the right test to use for testing if two samples may have come from the same distribution is the two-sample KS test, implemented in scipy.stats.ks_2samp, which directly compares the empirical CDFs. KDE is density estimation, which smooths out the CDF, and is therefore a bunch of unnecessary work that also makes your estimate worse, statistically speaking.
But the reason you're seeing this problem is that the signature for your CDF parameter isn't quite right. kstest calls cdf(vals) (source), where vals is the sorted samples, to get out the CDF value for each of your samples. In your code, this ends up calling my_pdf.integrate_box_1d(-np.inf, samps), but integrate_box_1d wants both arguments to be scalars. The signature is wrong, and if you tried this with most arrays it'd crash with a ValueError:
>>> my_pdf.integrate_box_1d(-np.inf, samp[:10])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-81d0253a33bf> in <module>()
----> 1 my_pdf.integrate_box_1d(-np.inf, samp[:10])
/Library/Python/2.7/site-packages/scipy-0.12.0.dev_ddd617d_20120725-py2.7-macosx-10.8-x86_64.egg/scipy/stats/kde.pyc in integrate_box_1d(self, low, high)
311
312 normalized_low = ravel((low - self.dataset) / stdev)
--> 313 normalized_high = ravel((high - self.dataset) / stdev)
314
315 value = np.mean(special.ndtr(normalized_high) - \
ValueError: operands could not be broadcast together with shapes (10) (1,1000)
but unfortunately, when the second argument is samp, it can broadcast just fine since the arrays are the same shape, and then everything goes to hell. Presumably integrate_box_1d should check the shape of its arguments, but here's one way to do it correctly:
>>> my_cdf = lambda ary: np.array([my_pdf.integrate_box_1d(-np.inf, x) for x in ary])
>>> scipy.stats.kstest(sample, my_cdf)
(0.015597917205996903, 0.96809912578616597)
You could also use np.vectorize if you felt like it.
(But again, you probably actually want to use ks_2samp.)

sklearn.gaussian_process fit() not working with array sizes greater than 100

I am generating a random.uniform(low=0.0, high=100.0, size=(150,150)) array.
I input this into a function that generates the X, x, and y.
However, if the random test matrix is greater than 100, I get the error below.
I have tried playing around with theta values.
Has anyone had this problem? Is this a bug?
I am using python2.6 and scikit-learn-0.10. Should I try python3?
Any suggestions or comments are welcome.
Thank you.
gp.fit( XKrn, yKrn )
File "/usr/lib/python2.6/scikit_learn-0.10_git-py2.6-linux-x86_64.egg/sklearn/gaussian_process/gaussian_process.py", line 258, in fit
raise ValueError("X and y must have the same number of rows.")
ValueError: X and y must have the same number of rows.

ValueError: X and y must have the same number of rows. means that in your case XKrn.shape[0] should be equal to yKrn.shape[0]. You probably have an error in the code generating the dataset.
Here is a working example:
In [1]: from sklearn.gaussian_process import GaussianProcess
In [2]: import numpy as np
In [3]: X, y = np.random.randn(150, 10), np.random.randn(150)
In [4]: GaussianProcess().fit(X, y)
Out[4]:
GaussianProcess(beta0=None,
corr=<function squared_exponential at 0x10d42aaa0>, normalize=True,
nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla',
random_start=1,
random_state=<mtrand.RandomState object at 0x10b4c8360>,
regr=<function constant at 0x10d42a488>, storage_mode='full',
theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
Python 3 is not supported yet and the latest released version of scikit-learn is 0.12.1 at this time.

My original post was deleted. Thanks, Flexo.
I had the same problem, and number of rows I was passing in was the same in my X and y.
In my case, the problem was in fact that I was passing in a number of features to fit against in my output. Gaussian processes fit to a single output feature.
The "number of rows" error was misleading, and stemmed from the fact that I wasn't using the package correctly. To fit multiple output features like this, you'll need a GP for each feature.

Why are LASSO in sklearn (python) and matlab statistical package different?

I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.
I used matlab and replicate the example given in
http://www.mathworks.se/help/stats/lasso-and-elastic-net.html
to get a figure like this
Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got
Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).
Matlab Code:
rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;
save randomData.mat % To be used in python code
[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');
disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)
Python Code:
import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV
data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()
model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_
eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])
pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()

I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.
Even if you run 2 times the cross-validation in python you can obtain 2 different results.
consider this example :
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
0.00645093258722
0.00691712356467

it's possible that alpha = lambda / n_samples
where n_samples = X.shape[0] in scikit-learn
another remark is that your path is not very piecewise linear as it could/should be. Consider reducing the tol and increasing max_iter.
hope this helps

I know this is an old thread, but:
I'm actually working on piping over to LassoCV from glmnet (in R), and I found that LassoCV doesn't do too well with normalizing the X matrix first (even if you specify the parameter normalize = True).
Try normalizing the X matrix first when using LassoCV.
If it is a pandas object,
(X - X.mean())/X.std()
It seems you also need to multiple alpha by 2

Though I am unable to figure out what is causing the problem, there is a logical direction in which to continue.
These are the facts:
Mathworks have selected an example and decided to include it in their documentation
Your matlab code produces exactly the result as the example.
The alternative does not match the result, and has provided inaccurate results in the past
This is my assumption:
The chance that mathworks have chosen to put an incorrect example in their documentation is neglectable compared to the chance that a reproduction of this example in an alternate way does not give the correct result.
The logical conclusion: Your matlab implementation of this example is reliable and the other is not.
This might be a problem in the code, or maybe in how you use it, but either way the only logical conclusion would be that you should continue with Matlab to select your model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.