How could I use a custom distance metric for KNeighboursRegressor? - python

I'm trying to apply my own custom distance metric function when using knn regression model.
My dataset is a mixture of nominal, ordinal, numeric and binary types of fields
Code:
def cus_distance(array1, array2, **kwargs):
# calculate the distance, return a float
pass
knn = neighbors.KNeighborsRegressor(weights='distance', metric=cus_distance)
# train_data is a pandas dataframe obj
knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
The last line will cause an exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-284-04520b227b8a> in <module>()
----> 1 knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
587 X, y = check_arrays(X, y, sparse_format="csr")
588 self._y = y
--> 589 return self._fit(X)
590
591
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
214 self._tree = BallTree(X, self.leaf_size,
215 metric=self.effective_metric_,
--> 216 **self.effective_metric_kwds_)
217 elif self._fit_method == 'kd_tree':
218 self._tree = KDTree(X, self.leaf_size,
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
318
319 """
--> 320 return array(a, dtype, copy=False, order=order)
321
322 def asanyarray(a, dtype=None, order=None):
ValueError: could not convert string to float: Unknown
I know this error caused by string values(the 'Unknown' is one of them) in my dataset.
This confused me, in my understanding, the function cus_distance should take care of these str values, and the KNeighborsRegressor just use the return value of my function.
Q:
* Is this the right way to use a custom defined distance metric in KNN Regression?
* If it is, why I met this exception?
* If not, what is the right way?

The Ball Tree and KD Tree require floating point data, regardless of the metric used. If your data cannot be converted to floating point, then you will get this sort of error.
>>> import numpy as np
>>> data = [1, "Unknown", 2]
>>> np.asarray(data, dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
----> 1 np.asarray(data, dtype=float)
ValueError: could not convert string to float: Unknown

Thanks #jakevdp .
The scikit-learn supports Brute Force, Ball Tree and KD Tree, and according to #jakevdp 's answer, the only one I can use is Brute Force algorighm, so my code change to:
knn = neighbors.KNeighborsRegressor(weights='distance', metric=cus_distance, algorithm='brute')
knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
This time it won't raise error anymore, Thanks jakevdp!
But new question came, when I try to use this knn object:
knn.predict(check_data.ix[:, fields_list])
this will cause a same error in my question. So I look into the scikit-learn's source code, found this line cause this error:
elif callable(metric):
# Check matrices first (this is usually done by the metric).
X, Y = check_pairwise_arrays(X, Y)
n_x, n_y = X.shape[0], Y.shape[0]
the function check_pairwise_arrays will try to convert all values to float, "Unknown" cause the error again.
I think this is kind of bug, because scikit's builtin metrics don't support mixture types of dataset, I write a customer metric function, but this line still force the dataset to be pure float type.
And as the comment above this line said, the checking works should be done by customer metrics, so I just commented this line, reload this module, my knn object can work perfectly now :)
ps: I'm working on pushing this change to the scikit-learn official github repo.

Related

SMOTE is giving array size / ValueError for all-categorical dataset

I am using SMOTE-NC for oversampling my categorical data. I have only 1 feature and 10500 samples.
While running the below code, I am getting the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-151-a261c423a6d8> in <module>()
16 print(X_new.shape) # (10500, 1)
17 print(X_new)
---> 18 sm.fit_sample(X_new, Y_new)
~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
81 )
82
---> 83 output = self._fit_resample(X, y)
84
85 y_ = (label_binarize(output[1], np.unique(y))
~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\over_sampling\_smote.py in _fit_resample(self, X, y)
926
927 X_continuous = X[:, self.continuous_features_]
--> 928 X_continuous = check_array(X_continuous, accept_sparse=["csr", "csc"])
929 X_minority = _safe_indexing(
930 X_continuous, np.flatnonzero(y == class_minority)
~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
592 " a minimum of %d is required%s."
593 % (n_features, array.shape, ensure_min_features,
--> 594 context))
595
596 if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:
ValueError: Found array with 0 feature(s) (shape=(10500, 0)) while a minimum of 1 is required.
Code:
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(random_state=27,categorical_features=[0,])
X_new = np.array(X_train.values.tolist())
Y_new = np.array(y_train.values.tolist())
print(X_new.shape) # (10500,)
print(Y_new.shape) # (10500,)
X_new = np.reshape(X_new, (-1, 1)) # SMOTE require 2-D Array, Hence changing the shape of X_mew
print(X_new.shape) # (10500, 1)
print(X_new)
sm.fit_sample(X_new, Y_new)
If i understand correctly, the shape of X_new should be (n_samples, n_features) which is 10500 X 1. I am not sure why in the ValueError it is considering it as shape=(10500,0)
Can someone please help me here ?
I have reproduced your issue adapting the example in the docs for a single categorical feature in the data:
from collections import Counter
from numpy.random import RandomState
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTENC
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=1, n_redundant=0, flip_y=0,
n_features=1, n_clusters_per_class=1, n_samples=1000, random_state=10)
# simulate the only column to be a categorical feature
X[:, 0] = RandomState(10).randint(0, 4, size=(1000))
X.shape
# (1000, 1)
sm = SMOTENC(random_state=42, categorical_features=[0,]) # same behavior with categorical_features=[0]
X_res, y_res = sm.fit_resample(X, y)
which gives the same error:
ValueError: Found array with 0 feature(s) (shape=(1000, 0)) while a minimum of 1 is required.
The reason is actually quite simple, but you have to dig a little to the original SMOTE paper; quoting from the relevant section (emphasis mine):
While our SMOTE approach currently does not handle data sets with all
nominal features, it was generalized to handle mixed datasets of
continuous and nominal features. We call this approach Synthetic
Minority Over-sampling TEchnique-Nominal Continuous [SMOTE-NC]. We
tested this approach on the Adult dataset from the UCI repository. The
SMOTE-NC algorithm is described below.
Median computation: Compute the median of standard deviations of all continuous features for the minority class. If the nominal
features differ between a sample and its potential nearest neighbors,
then this median is included in the Euclidean distance computation. We
use median to penalize the difference of nominal features by an amount
that is related to the typical difference in continuous feature
values.
Nearest neighbor computation: Compute the Euclidean distance between the feature vector for which k-nearest neighbors are being
identified (minority class sample) and the other feature vectors
(minority class samples) using the continuous feature space. For every
differing nominal feature between the considered feature vector and
its potential nearest-neighbor, include the median of the standard
deviations previously computed, in the Euclidean distance computation.
In other words, and although not stated explicitly, it is apparent that, in order for the algorithm to work, it needs at least one continuous feature. This is not the case here, so the algorithm rather unsurprisingly fails.
I guess that, internally, during step 1 (median computation), the algorithm temporarily removes all categorical features from the data; in doing so here, it is faced indeed with a shape of (1000, 0) (or (10500, 0) in your case), i.e. no data, hence the specific reference in the error message.
So, there is not any actual programming issue here to be remedied, it's just that what you try to do is actually impossible with the SMOTE-NC algorithm (notice that the very initials NC in the algorithm name mean Nominal-Continuous).

PyMC3, NUTS sampler, what's happening here?

Can someone point me to the docs that will explain what I'm seeing?
Pink stuff in a Jupyter notebook makes me think something is wrong.
Using PyMC3 (btw, it's an exercise for a class and I have no idea what I'm doing).
I plugged in the numbers, initially got an error about 0s on the diagonal, swapped alpha_est and rate_est to be 1/alpha_est and 1/rate_est (and stopped getting the error), but I still get the pink stuff.
This code came with the exercise:
# An initial guess for the gamma distribution's alpha and beta
# parameters can be made as described here:
# https://wiki.analytica.com/index.php?title=Gamma_distribution
alpha_est = np.mean(no_insurance)**2 / np.var(no_insurance)
beta_est = np.var(no_insurance) / np.mean(no_insurance)
# PyMC3 Gamma seems to use rate = 1/beta
rate_est = 1/beta_est
# Initial parameter estimates we'll use below
alpha_est, rate_est
And then the code I'm supposed to add:
Should the pink stuff make me nervous or do I just say "No errors, move on"?
=======
The "zero problem"
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 110, in run
self._start_loop()
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
point, stats = self._compute_point()
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
point, stats = self._step_method.step(self._point)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
apoint, stats = self.astep(array)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 130, in astep
self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 231, in raise_ok
raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `alpha__log__`.ravel()[0] is zero.
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `alpha__log__`.ravel()[0] is zero.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-14-36f8e5cebbe5> in <module>
13 g = pm.Gamma('g', alpha=alpha_, beta=rate_, observed=no_insurance)
14
---> 15 trace = pm.sample(10000)
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, **kwargs)
435 _print_step_hierarchy(step)
436 try:
--> 437 trace = _mp_sample(**sample_args)
438 except pickle.PickleError:
439 _log.warning("Could not pickle model, sampling singlethreaded.")
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, **kwargs)
967 try:
968 with sampler:
--> 969 for draw in sampler:
970 trace = traces[draw.chain - chain]
971 if (trace.supports_sampler_stats
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py in __iter__(self)
391
392 while self._active:
--> 393 draw = ProcessAdapter.recv_draw(self._active)
394 proc, is_last, draw, tuning, stats, warns = draw
395 if self._progress is not None:
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
295 else:
296 error = RuntimeError("Chain %s failed." % proc.chain)
--> 297 raise error from old_error
298 elif msg[0] == "writing_done":
299 proc._readable = True
RuntimeError: Chain 0 failed.
is the "hint" in the instructions here telling me I should use 1/rate_est?
You are now going to create your own PyMC3 model!
Use an exponential prior for alpha. Call this stochastic variable alpha_.
Similarly, use an exponential prior for the rate ( 1/𝛽 ) parameter in PyMC3's Gamma.
Call this stochastic variable rate_ (but it will be supplied as pm.Gamma's beta parameter). Hint: to set up a prior with an exponential distribution for 𝑥 where you have an initial estimate for 𝑥 of 𝑥0 , use a scale parameter of 1/𝑥0 .
Create your Gamma distribution with your alpha_ and rate_ stochastic variables and the observed data.
Perform 10000 draws.
The zero problem could be because you are sampling zeros from exponential distribution.
Ah:
rate_est is 0.00021265346963636103
rate_ci = np.percentile(trace['rate_'], [2.5, 97.5])
rate_ci = [0.00022031, 0.00028109]
1/rate_est is 4702.486170152818
I can believe I am sampling zeros if I use rate_est.
I have doubts about your 1/alpha step. See this discussion: https://discourse.pymc.io/t/help-with-fitting-gamma-distribution/2630
The zero problem could be because you are sampling zeros from exponential distribution.
You could look here: https://docs.pymc.io/notebooks/PyMC3_tips_and_heuristic.html cell[6]
I think you are okay with the sampler output. You can check your distributions by using traceplot.

Why does `scipy.interpolate.griddata` fail for readonly arrays?

I have some data which I try to interpolate using scipy.interpolate.griddata. In my use-case I marked some of the numpy arrays read-only, which apparently breaks the interpolation:
import numpy as np
from scipy import interpolate
x0 = 10 * np.random.randn(100, 2)
y0 = np.random.randn(100)
x1 = np.random.randn(3, 2)
x0.flags.writeable = False
# x1.flags.writeable = False
interpolate.griddata(x0, y0, x1)
yields the following exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-a6e09dbdd371> in <module>()
6 # x1.flags.writeable = False
7
----> 8 interpolate.griddata(x0, y0, x1)
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/interpolate/ndgriddata.pyc in griddata(points, values, xi, method, fill_value, rescale)
216 ip = LinearNDInterpolator(points, values, fill_value=fill_value,
217 rescale=rescale)
--> 218 return ip(xi)
219 elif method == 'cubic' and ndim == 2:
220 ip = CloughTocher2DInterpolator(points, values, fill_value=fill_value,
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.NDInterpolatorBase.__call__ (scipy/interpolate/interpnd.c:3930)()
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.LinearNDInterpolator._evaluate_double (scipy/interpolate/interpnd.c:5267)()
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.LinearNDInterpolator._do_evaluate (scipy/interpolate/interpnd.c:6006)()
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/interpolate/interpnd.so in View.MemoryView.memoryview_cwrapper (scipy/interpolate/interpnd.c:17829)()
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/interpolate/interpnd.so in View.MemoryView.memoryview.__cinit__ (scipy/interpolate/interpnd.c:14104)()
ValueError: buffer source array is read-only
Clearly, the interpolation function doesn't like that the arrays are write-protected. However, I don't understand why they want to change this – I certainly don't expect my input to be mutated by a call to the interpolation function and this is also not mentioned in the documentation as far as I can tell. Why would the function behave like this?
Note that setting x1 readonly instead of x0 leads to a similar error.
The relevant code is written in Cython, and when Cython requests a memoryview of the input array, it always asks for a writeable one, even if you don't need it.
Since an array flagged as non-writeable will refuse to provide a writeable memoryview, the code fails, even though it didn't need to write to the array in the first place.

scipy.stats.kde and scipy.stats.kstest

How can I use scipy.stats.kde.gaussian_kde and scipy.stats.kstest in a conformal way?
For example, the code:
from numpy import inf
import scipy.stat
my_pdf = scipy.stats.kde.gaussian_kde(sample)
scipy.stats.kstest(sample, lambda x: my_pdf.integrate_box_1d(-inf, x))
Gives the following answer:
(0.5396735893479544, 0.0)
Which is not true because a sample obviously belongs to the distribution which was constructed on this sample.
First of all, the right test to use for testing if two samples may have come from the same distribution is the two-sample KS test, implemented in scipy.stats.ks_2samp, which directly compares the empirical CDFs. KDE is density estimation, which smooths out the CDF, and is therefore a bunch of unnecessary work that also makes your estimate worse, statistically speaking.
But the reason you're seeing this problem is that the signature for your CDF parameter isn't quite right. kstest calls cdf(vals) (source), where vals is the sorted samples, to get out the CDF value for each of your samples. In your code, this ends up calling my_pdf.integrate_box_1d(-np.inf, samps), but integrate_box_1d wants both arguments to be scalars. The signature is wrong, and if you tried this with most arrays it'd crash with a ValueError:
>>> my_pdf.integrate_box_1d(-np.inf, samp[:10])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-81d0253a33bf> in <module>()
----> 1 my_pdf.integrate_box_1d(-np.inf, samp[:10])
/Library/Python/2.7/site-packages/scipy-0.12.0.dev_ddd617d_20120725-py2.7-macosx-10.8-x86_64.egg/scipy/stats/kde.pyc in integrate_box_1d(self, low, high)
311
312 normalized_low = ravel((low - self.dataset) / stdev)
--> 313 normalized_high = ravel((high - self.dataset) / stdev)
314
315 value = np.mean(special.ndtr(normalized_high) - \
ValueError: operands could not be broadcast together with shapes (10) (1,1000)
but unfortunately, when the second argument is samp, it can broadcast just fine since the arrays are the same shape, and then everything goes to hell. Presumably integrate_box_1d should check the shape of its arguments, but here's one way to do it correctly:
>>> my_cdf = lambda ary: np.array([my_pdf.integrate_box_1d(-np.inf, x) for x in ary])
>>> scipy.stats.kstest(sample, my_cdf)
(0.015597917205996903, 0.96809912578616597)
You could also use np.vectorize if you felt like it.
(But again, you probably actually want to use ks_2samp.)

Python + GNU Plot: dealing with missing values

For clarity I have isolated my problem and used a small but complete snippet to describe it.
I have a bunch of data but there is a lot of missing pieces. I want to ignore these (a break in the graph if it were a line graph). I have set "?" to be the symbol for missing data. Here is my snippet:
import math
import Gnuplot
gp = Gnuplot.Gnuplot(persist=1)
gp("set datafile missing '?'")
x = range(1000)
y = [math.sin(a) + math.cos(a) + math.tan(a) for a in x]
# Force a piece of missing data
y[4] = '?'
data = Gnuplot.Data(x, y, title='Plotting from Python')
gp.plot(data);
gp.hardcopy(filename="pyplot.png",terminal="png")
But it doesn't work:
> python missing_test.py
Traceback (most recent call last):
File "missing_test.py", line 8, in <module>
data = Gnuplot.Data(x, y, title='Plotting from Python')
File "/usr/lib/python2.6/dist-packages/Gnuplot/PlotItems.py", line 560, in Data
data = utils.float_array(data)
File "/usr/lib/python2.6/dist-packages/Gnuplot/utils.py", line 33, in float_array
return numpy.asarray(m, numpy.float32)
File "/usr/lib/python2.6/dist-packages/numpy/core/numeric.py", line 230, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
What's going wrong?
Gnuplot is calling numpy.asarray to convert your Python list into a numpy array.
Unfortunately, this command (with dtype=numpy.float32) is incompatible with a Python list that contains strings.
You can reproduce the error like this:
In [36]: np.asarray(['?',1.0,2.0],np.float32)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/lib/python2.6/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
228
229 """
--> 230 return array(a, dtype, copy=False, order=order)
231
232 def asanyarray(a, dtype=None, order=None):
ValueError: setting an array element with a sequence.
Furthermore, the Gnuplot python module (version 1.7) docs say
There is no provision for missing data points in array data (which
gnuplot allows via the 'set missing' command).
I'm not sure if this has been fixed in version 1.8.
How married are you to gnuplot? Have you tried matplotlib?

Categories