Can someone point me to the docs that will explain what I'm seeing?
Pink stuff in a Jupyter notebook makes me think something is wrong.
Using PyMC3 (btw, it's an exercise for a class and I have no idea what I'm doing).
I plugged in the numbers, initially got an error about 0s on the diagonal, swapped alpha_est and rate_est to be 1/alpha_est and 1/rate_est (and stopped getting the error), but I still get the pink stuff.
This code came with the exercise:
# An initial guess for the gamma distribution's alpha and beta
# parameters can be made as described here:
# https://wiki.analytica.com/index.php?title=Gamma_distribution
alpha_est = np.mean(no_insurance)**2 / np.var(no_insurance)
beta_est = np.var(no_insurance) / np.mean(no_insurance)
# PyMC3 Gamma seems to use rate = 1/beta
rate_est = 1/beta_est
# Initial parameter estimates we'll use below
alpha_est, rate_est
And then the code I'm supposed to add:
Should the pink stuff make me nervous or do I just say "No errors, move on"?
=======
The "zero problem"
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 110, in run
self._start_loop()
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
point, stats = self._compute_point()
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
point, stats = self._step_method.step(self._point)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
apoint, stats = self.astep(array)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 130, in astep
self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 231, in raise_ok
raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `alpha__log__`.ravel()[0] is zero.
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `alpha__log__`.ravel()[0] is zero.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-14-36f8e5cebbe5> in <module>
13 g = pm.Gamma('g', alpha=alpha_, beta=rate_, observed=no_insurance)
14
---> 15 trace = pm.sample(10000)
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, **kwargs)
435 _print_step_hierarchy(step)
436 try:
--> 437 trace = _mp_sample(**sample_args)
438 except pickle.PickleError:
439 _log.warning("Could not pickle model, sampling singlethreaded.")
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, **kwargs)
967 try:
968 with sampler:
--> 969 for draw in sampler:
970 trace = traces[draw.chain - chain]
971 if (trace.supports_sampler_stats
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py in __iter__(self)
391
392 while self._active:
--> 393 draw = ProcessAdapter.recv_draw(self._active)
394 proc, is_last, draw, tuning, stats, warns = draw
395 if self._progress is not None:
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
295 else:
296 error = RuntimeError("Chain %s failed." % proc.chain)
--> 297 raise error from old_error
298 elif msg[0] == "writing_done":
299 proc._readable = True
RuntimeError: Chain 0 failed.
is the "hint" in the instructions here telling me I should use 1/rate_est?
You are now going to create your own PyMC3 model!
Use an exponential prior for alpha. Call this stochastic variable alpha_.
Similarly, use an exponential prior for the rate ( 1/𝛽 ) parameter in PyMC3's Gamma.
Call this stochastic variable rate_ (but it will be supplied as pm.Gamma's beta parameter). Hint: to set up a prior with an exponential distribution for 𝑥 where you have an initial estimate for 𝑥 of 𝑥0 , use a scale parameter of 1/𝑥0 .
Create your Gamma distribution with your alpha_ and rate_ stochastic variables and the observed data.
Perform 10000 draws.
The zero problem could be because you are sampling zeros from exponential distribution.
Ah:
rate_est is 0.00021265346963636103
rate_ci = np.percentile(trace['rate_'], [2.5, 97.5])
rate_ci = [0.00022031, 0.00028109]
1/rate_est is 4702.486170152818
I can believe I am sampling zeros if I use rate_est.
I have doubts about your 1/alpha step. See this discussion: https://discourse.pymc.io/t/help-with-fitting-gamma-distribution/2630
The zero problem could be because you are sampling zeros from exponential distribution.
You could look here: https://docs.pymc.io/notebooks/PyMC3_tips_and_heuristic.html cell[6]
I think you are okay with the sampler output. You can check your distributions by using traceplot.
Related
I have been trying to use the pymc3 package but have constantly been receiving errors. First off, when I import the pymc3 package, here is what happens:
import pymc3 as pm
WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Afterwards, here is my code:
x = np.linspace(-5,5, 50) # Wavelength data. Here we have fifty points between -5 and 5
sigma = 1
mu = 0
A = 20
B = 100
# define underlying model -- Gaussian
y = A * np.exp( - (x - mu)**2 / (2 * sigma**2)) + B
y_noise = np.random.normal(0, 1, 50) # Let's add some noise
data = y+y_noise
# Set model
basic_model = pm.Model()
with basic_model:
# Priors for unknown model parameters
A = pm.Uniform("A", lower=0, upper=50)
B = pm.Uniform("B", lower=0, upper=200)
sigma = 1
# Expected value of outcome
y_m = A*np.exp(-(x)**2/2)+B
# Likelihood of observations
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=np.sqrt(data), observed=data)
# Now sample
with basic_model:
# draw posterior samples
trace = pm.sample_smc(100, parallel=True)
And here is the error output:
RemoteTraceback Traceback (most recent call
last) RemoteTraceback: """ Traceback (most recent call last): File
"/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/multiprocessing/pool.py",
line 125, in worker
result = (True, func(*args, **kwds)) File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/multiprocessing/pool.py",
line 51, in starmapstar
return list(itertools.starmap(args[0], args[1])) File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/pymc3/smc/sample_smc.py",
line 267, in sample_smc_int
smc.setup_kernel() File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/pymc3/smc/smc.py",
line 135, in setup_kernel
self.likelihood_logp_func = logp_forw([self.model.datalogpt], self.variables, shared) File
"/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/pymc3/smc/smc.py",
line 288, in logp_forw
f = theano_function([inarray0], out_list[0]) File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/init.py",
line 337, in function
fn = pfunc( File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/pfunc.py",
line 524, in pfunc
return orig_function( File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/types.py",
line 1970, in orig_function
m = Maker( File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/types.py",
line 1573, in init
self._check_unused_inputs(inputs, outputs, on_unused_input) File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/types.py",
line 1745, in _check_unused_inputs
raise UnusedInputError(msg % (inputs.index(i), i.variable, err_msg)) theano.compile.function.types.UnusedInputError:
theano.function was asked to create a function computing outputs given
certain inputs, but the provided input variable at index 0 is not part
of the computational graph needed to compute the outputs: inarray. To
make this error into a warning, you can pass the parameter
on_unused_input='warn' to theano.function. To disable it completely,
use on_unused_input='ignore'. """
The above exception was the direct cause of the following exception:
UnusedInputError Traceback (most recent call
last) Input In [7], in <cell line: 2>()
1 # Now sample
2 with basic_model:
3 # draw posterior samples
----> 4 trace = pm.sample_smc(100, parallel=True)
File
~/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/pymc3/smc/sample_smc.py:196,
in sample_smc(draws, kernel, n_steps, start, tune_steps, p_acc_rate,
threshold, save_sim_data, save_log_pseudolikelihood, model,
random_seed, parallel, chains, cores)
194 loggers = [_log] + [None] * (chains - 1)
195 pool = mp.Pool(cores)
--> 196 results = pool.starmap(
197 sample_smc_int, [(*params, random_seed[i], i, loggers[i]) for i in range(chains)]
198 )
200 pool.close()
201 pool.join()
File
~/anaconda3/envs/pymc3Env/lib/python3.10/multiprocessing/pool.py:372,
in Pool.starmap(self, func, iterable, chunksize)
366 def starmap(self, func, iterable, chunksize=None):
367 '''
368 Like map() method but the elements of the iterable are expected to
369 be iterables as well and will be unpacked as arguments. Hence
370 func and (a, b) becomes func(a, b).
371 '''
--> 372 return self._map_async(func, iterable, starmapstar, chunksize).get()
File
~/anaconda3/envs/pymc3Env/lib/python3.10/multiprocessing/pool.py:771,
in ApplyResult.get(self, timeout)
769 return self._value
770 else:
--> 771 raise self._value
UnusedInputError: theano.function was asked to create a function
computing outputs given certain inputs, but the provided input
variable at index 0 is not part of the computational graph needed to
compute the outputs: inarray. To make this error into a warning, you
can pass the parameter on_unused_input='warn' to theano.function. To
disable it completely, use on_unused_input='ignore'.
I am simply following a tutorial on medium, so I don't think there is a problem with the code. I have a strong feeling that the problems arises from the way I installed the packages. I installed pymc3 in a conda environment by using these 3 commands:
conda install numpy scipy mkl
conda install theano pygpu
conda install pymc3
I have also tried installing pymc3 by following the guide from the developpers on github:
conda create -c conda-forge -n pymc3_env pymc3 theano-pymc mkl mkl-service
conda activate pymc3_env
I was able to replicate the issue in PyMC3. This appears to be something wrong with the SMC sampler when combined with multiprocessing. Changing to parallel=False will get the SMC sampler working. Changing to NUTS or Metropolis sampling will also work fine.
Please file a bug report on the PyMC repository. However, you may also want to try upgrading to PyMC v4 first.
I am computing these derivatives using the Montecarlo approach for a generic call option. I am interested in this combined derivative (with respect to both S and Sigma). Doing this with the algorithmic differentiation, I get an error that can be seen at the end of the page. What could be a possible solution? Just to explain something regarding the code, I am going to attach the formula used to compute the "X" in the code below:
from jax import jit, grad, vmap
import jax.numpy as jnp
from jax import random
Underlying_asset = jnp.linspace(1.1,1.4,100)
volatilities = jnp.linspace(0.5,0.6,100)
def second_derivative_mc(S,vol):
N = 100
j,T,q,r,k = 10000,1.,0,0,1.
S0 = jnp.array([S]).T #(Nx1) vector underlying asset
C = jnp.identity(N)*vol #matrix of volatilities with 0 outside diagonal
e = jnp.array([jnp.full(j,1.)])#(1xj) vector of "1"
Rand = np.random.RandomState()
Rand.seed(10)
U= Rand.normal(0,1,(N,j)) #Random number for Brownian Motion
sigma2 = jnp.array([vol**2]).T #Vector of variance Nx1
first = jnp.dot(sigma2,e) #First part equation
second = jnp.dot(C,U) #Second part equation
X = -0.5*first+jnp.sqrt(T)*second
St = jnp.exp(X)*S0
P = jnp.maximum(St-k,0)
payoff = jnp.average(P, axis=-1)*jnp.exp(-q*T)
return payoff
greek = vmap(grad(grad(second_derivative_mc, argnums=1), argnums=0)(Underlying_asset,volatilities)
This is the error message:
> UnfilteredStackTrace Traceback (most recent call
> last) <ipython-input-78-0cc1da97ae0c> in <module>()
> 25
> ---> 26 greek = vmap(grad(grad(second_derivative_mc, argnums=1), argnums=0))(Underlying_asset,volatilities)
>
> 18 frames UnfilteredStackTrace: TypeError: Gradient only defined for
> scalar-output functions. Output had shape: (100,).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
> TypeError Traceback (most recent call
> last) /usr/local/lib/python3.7/dist-packages/jax/_src/api.py in
> _check_scalar(x)
> 894 if isinstance(aval, ShapedArray):
> 895 if aval.shape != ():
> --> 896 raise TypeError(msg(f"had shape: {aval.shape}"))
> 897 else:
> 898 raise TypeError(msg(f"had abstract value {aval}"))
> TypeError: Gradient only defined for scalar-output functions. Output had shape: (100,).
As the error message indicates, gradients can only be computed for functions that return a scalar. Your function returns a vector:
print(len(second_derivative_mc(1.1, 0.5)))
# 100
For vector-valued functions, you can compute the jacobian (which is similar to a multi-dimensional gradient). Is this what you had in mind?
from jax import jacobian
greek = vmap(jacobian(jacobian(second_derivative_mc, argnums=1), argnums=0))(Underlying_asset,volatilities)
Also, this is not what you asked about, but the function above will probably not work as you intend even if you solve the issue in the question. Numpy RandomState objects are stateful, and thus will generally not work correctly with jax transforms like grad, jit, vmap, etc., which require side-effect-free code (see Stateful Computations In JAX). You might try using jax.random instead; see JAX: Random Numbers for more information.
I am trying to use optuna lib in Python to optimise parameters for recommender systems' models. Those models are custom and look like standard fit-predict sklearn models (with methods get/set params).
What I do: simple objective function that selects two parameters from uniform int distribution, set these params to model, predicts the model (there no fit stage as it simple model that uses params only in predict stage) and calculates some metric.
What I get: the first trial runs normal, it samples params and prints results to log. But on the second and next trial I have some strange errors (look code below) that I can't solve or google. When I run study on just 1 trial everything is okay.
What I tried: to rearrange parts of objective function, put fit stage inside, try to calculate more simpler metrics - nothing helps.
Here is my objective function:
# getting train, test
# fitting model
self.model = SomeRecommender()
self.model.fit(train, some_other_params)
def objective(trial: optuna.Trial):
# save study
if path is not None:
joblib.dump(study, some_path)
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# setting params to model
params = {'alpha': alpha,
'beta': beta}
self.model.set_params(**params)
# getting predict
recs = self.model.predict(some_other_params)
# metric computing
metric_result = Metrics.hit_rate_at_k(recs, test, k=k)
return metric_result
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
That's what I get on three trials:
[I 2019-10-01 12:53:59,019] Finished trial#0 resulted in value: 0.1. Current best value is 0.1 with parameters: {'alpha': 59.6135986324444, 'beta': 40.714559720597585}.
[W 2019-10-01 13:39:58,140] Setting status of trial#1 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
[W 2019-10-01 13:39:58,206] Setting status of trial#2 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
I can't understand where is the problem and why the first trial is working. Please, help.
Thank you!
Your code seems to have no problems.
I ran a simplified version of your code (see below), and it worked well in my environment:
import optuna
def objective(trial: optuna.Trial):
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# evaluating params
return alpha + beta
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
Could you tell me about your environment in order to investigate the problem? (e.g., OS, Python version, Python interpreter (CPython, PyPy, IronPython or Jython), Optuna version)
why the first trial is working.
This error is raised by optuna/samplers/tpe/sampler.py#558, and this line is only executed when the number of completed trials in the study is greater than zero.
BTW, you might be able to avoid this problem by using RandomSampler as follows:
sampler = optuna.samplers.RandomSampler()
study = optuna.create_study(direction='maximize', sampler=sampler)
Notice that the optimization performance of RandomSampler tends to be worse than TPESampler that is the default sampler of Optuna.
I'm trying to code my own logistic regression, and compare different methods of maximizing the log-likelihood. Using the Newton-CG method, I get the error message "ValueError: setting an array element with a sequence". Reading around, it seems this error rises if the function sought to be minimized returns a non-skalar, but that is not the case here. I need the three methods given below to give the same result (approximately), but when running on my real data, one does not converge, and the other one gives a worse LL than the initial guess, and the third does not run at all.
Why do I get the ValueError message and how can I fix it?
My code (with dummy data, the real data is ~100 measurements) is as follows:
import numpy as np
from numpy import linalg
import scipy
from scipy.optimize import minimize
def CalcLL(beta,xinlist,yinlist):
LL=0.0
ncol=len(beta)
pi=FindPi(xinlist,beta.reshape(ncol,1))
for i in range(len(yinlist)):
LL=LL+np.where(yinlist[i]==1,np.log(pi[i]),np.log(1-pi[i]))
return -LL
def Jacobian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
Jac=np.transpose(np.matrix(yinlist-pi))*np.matrix(xinlist)
return Jac
def Hessian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
W=FindW(pi)
Hes=np.matrix(np.transpose(xinlist))*(np.matrix(W)*np.matrix(xinlist))
return Hes
def FindPi(xinlist,beta):
rows=np.shape(xinlist)[0]# Number of rows in x_new
cols=np.shape(xinlist)[1]# Number of columns in x_new
expon=np.dot(xinlist,beta)
expon=np.array(expon).reshape(rows,1)
pi=np.exp(expon)/(1+np.exp(expon))
return pi
def FindW(pi):
W=np.zeros(len(pi)*len(pi)).reshape(len(pi),len(pi))
for i in range(len(pi)):
W[i,i]=float(pi[i]*(1-pi[i]))
return W
xinlist=np.matrix([[1,1],[0,1],[1,1],[1,1],[1,1],[0,1],[0,1],[1,1],[1,1],[0,1]])
yinlist=np.transpose(np.matrix([0,0,0,0,0,1,1,1,1,1]))
ncol=np.shape(xinlist)[1]
beta1=np.zeros(ncol).reshape(ncol,1) # Initial guess for parameter values
limit=0.000001 # selfwritten Newton-Raphson method
iter_i=limit+1
while iter_i>limit:
Hes=Hessian(beta1,xinlist,yinlist)
Jac=np.transpose(Jacobian(beta1,xinlist,yinlist))
root_diff=np.array(linalg.inv(Hes)*Jac)
beta1=beta1+root_diff
iter_i=np.sum(root_diff*root_diff)
print "When running self-written algorithm, the log-likelihood is",-CalcLL(beta1,xinlist,yinlist)
beta2=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta2,args=(xinlist,yinlist),method='Nelder-Mead',options={'xtol':1e-8,'disp':True,'maxiter':10000})
beta2=res.x
print "The log-likelihood using Nelder-Mead is", -CalcLL(beta2,xinlist,yinlist)
beta3=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
beta3=res.x
print "The log-likelihood using Newton-CG is", -CalcLL(beta3,xinlist,yinlist)
EDIT:
The errorstack is as follows:
Traceback (most recent call last):
File "MyLogisticRegression2.py", line 62, in
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
File C:\Python27\lib\site-packages\scipy\optimize_minimize.py, line 447, in minimize **options)
File C:\Python27\lib\site-packages\scipy\optimize\optimize.py, line 2393, in _minimize_newtoncg eta=numpy.min([0.5, numpy.sqrt(maggrad)])
File C:\Python27\lib\site-packages\numpy\core\fromnumeric.py, line 2393, in amin out=out, **kwargs)
File C:\Python27\lib\site-packages\numpy\core_methods.py, line 29, in _amin return umr_minimum(a,axis,None,out,keepdims)
ValueError: setting an array element with a sequence
I found out the problem rose from beta arrays having shape (2,1) instead of (2,), and likewise for the Jacobian. Reshaping these two solved the problem.
The Newton-CG solver needs only 1d arrays for the Jacobian apparently.
I'm trying to apply my own custom distance metric function when using knn regression model.
My dataset is a mixture of nominal, ordinal, numeric and binary types of fields
Code:
def cus_distance(array1, array2, **kwargs):
# calculate the distance, return a float
pass
knn = neighbors.KNeighborsRegressor(weights='distance', metric=cus_distance)
# train_data is a pandas dataframe obj
knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
The last line will cause an exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-284-04520b227b8a> in <module>()
----> 1 knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
587 X, y = check_arrays(X, y, sparse_format="csr")
588 self._y = y
--> 589 return self._fit(X)
590
591
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
214 self._tree = BallTree(X, self.leaf_size,
215 metric=self.effective_metric_,
--> 216 **self.effective_metric_kwds_)
217 elif self._fit_method == 'kd_tree':
218 self._tree = KDTree(X, self.leaf_size,
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
318
319 """
--> 320 return array(a, dtype, copy=False, order=order)
321
322 def asanyarray(a, dtype=None, order=None):
ValueError: could not convert string to float: Unknown
I know this error caused by string values(the 'Unknown' is one of them) in my dataset.
This confused me, in my understanding, the function cus_distance should take care of these str values, and the KNeighborsRegressor just use the return value of my function.
Q:
* Is this the right way to use a custom defined distance metric in KNN Regression?
* If it is, why I met this exception?
* If not, what is the right way?
The Ball Tree and KD Tree require floating point data, regardless of the metric used. If your data cannot be converted to floating point, then you will get this sort of error.
>>> import numpy as np
>>> data = [1, "Unknown", 2]
>>> np.asarray(data, dtype=float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
----> 1 np.asarray(data, dtype=float)
ValueError: could not convert string to float: Unknown
Thanks #jakevdp .
The scikit-learn supports Brute Force, Ball Tree and KD Tree, and according to #jakevdp 's answer, the only one I can use is Brute Force algorighm, so my code change to:
knn = neighbors.KNeighborsRegressor(weights='distance', metric=cus_distance, algorithm='brute')
knn.fit(train_data.ix[:, fields_list], train_data['time_costs'])
This time it won't raise error anymore, Thanks jakevdp!
But new question came, when I try to use this knn object:
knn.predict(check_data.ix[:, fields_list])
this will cause a same error in my question. So I look into the scikit-learn's source code, found this line cause this error:
elif callable(metric):
# Check matrices first (this is usually done by the metric).
X, Y = check_pairwise_arrays(X, Y)
n_x, n_y = X.shape[0], Y.shape[0]
the function check_pairwise_arrays will try to convert all values to float, "Unknown" cause the error again.
I think this is kind of bug, because scikit's builtin metrics don't support mixture types of dataset, I write a customer metric function, but this line still force the dataset to be pure float type.
And as the comment above this line said, the checking works should be done by customer metrics, so I just commented this line, reload this module, my knn object can work perfectly now :)
ps: I'm working on pushing this change to the scikit-learn official github repo.