I recently tried to use scipy.odr package to conduct a regression analysis. Whenever I try to load a list of data where the elements depend on a function, a value error is raised:
ValueError: x could not be made into a suitable array
I have been using the same kind of programming to make fits using scipy's leastsq and curve_fit routines without problems.
Any idea of what to change and how to proceed? Thanks a lot...
Here I include a minimal working example:
from scipy import odr
from functools import partial
import numpy as np
import matplotlib.pyplot as plt
### choose select=0 and for myModel a list of elements is called which are a function of some parameters
### this results in error message: ValueError: x could not be made into a suitable array
### choose select=1, the function temp is exlcuded, and a fit is generated
### what do i have to do in order to run the programm successfully using select=0?
## choose here!
select=1
pfit=[1.0,1.0]
q0=[1,2,3,4,5]
q1=[3,8,10,19,27]
def temp(par, val):
p1,p2=par
temp_out = p1*val**p2
return temp_out
def fodr(a,x):
if select==0:
fitf = np.array([xi(a) for xi in x])
else:
fitf= a[0]*x**a[1]
return fitf
# define model
myModel = odr.Model(fodr)
# load data
damy=q1
if select==0:
damx=[]
for el in q0:
elm=partial(temp,val=el)
damx.append(elm)
#damx=[el(pfit) for el in damx] # check that function temp works fine
#print damx
else:
damx=q0
myData = odr.Data(damx, damy)
myOdr = odr.ODR(myData, myModel , beta0=pfit, maxit=100, ifixb=[1,1])
out = myOdr.run()
out.pprint()
Edit:
# Robert:
Thanks for your reply. I am using scipy version '0.14.0'. Using select==0 in my minimal example I get following traceback:
Traceback (most recent call last):
File "scipy-odr.py", line 48, in <module>
out = myOdr.run()
File "/home/tg/anaconda/lib/python2.7/site-packages/scipy/odr/odrpack.py", line 1061, in run
self.output = Output(odr(*args, **kwds))
ValueError: x could not be made into a suitable array
In short, your code does not work because damx is a now a list of functools.partial.
scipy.odr is a simple wrapper around Fortran Orthogonal Distance Regression (ODRPACK), both xdata and ydata have to be numerical since they will be converted to some Fortran type under the hood. It doesn't know what to do with a list of functools.partial, therefore the error.
Related
I'm working on solving the following ODE with PyMC3:
def production( y, t, p ):
return p[0]*getBeam( t ) - p[1]*y[0]
The getBeam( t ) is my time dependent coefficient. Those coefficients are given by an array of data which is accessed by the time index as follows:
def getBeam( t ):
nBeam = I[int(t/10)]*pow( 10, -6 )/q_e
return nBeam
I have successfully implemented it by using the scipy.integrate.odeint, but I have hard time to do it with pymc3.ode. In fact, by using the following:
ode_model = DifferentialEquation(func=production, times=x, n_states=1, n_theta=3, t0=0)
with pm.Model() as model:
a = pm.Uniform( "S-Factor", lower=0.01, upper=100 )
ode_solution = ode_model(y0=[0], theta=[a, Yield, lambd])
I obviously get the error TypeError: __trunc__ returned non-Integral (type TensorVariable), as the t is a TensorVariable, thus can not be used to access the array in which the coefficients are stored.
Is there a way to overcome this difficulty? I thought about using the theano.function but I can not get it working since, unfortunately, the coefficients can not be expressed by any analytical function: they are just stored inside the array I which index represents the time variable t.
Thanks
Since you already have a working implementation with scipy.integrate.odeint, you could use theano.compile.ops.as_op, though it comes with some inconveniences (see how to fit a method belonging to an instance with pymc3? and How to write a custom Deterministic or Stochastic in pymc3 with theano.op?)
Using your exact definitions for production and getBeam, the following code seems to work for me:
from scipy.integrate import odeint
from theano.compile.ops import as_op
import theano.tensor as tt
import pymc3 as pm
def ScipySolveODE(a):
return odeint(production, y0=[0], t=x, args=([a, Yield, lambd],)).flatten()
#as_op(itypes=[tt.dscalar], otypes=[tt.dvector])
def TheanoSolveODE(a):
return ScipySolveODE(a)
with pm.Model() as model:
a = pm.Uniform( "S-Factor", lower=0.01, upper=100 )
ode_solution = TheanoSolveODE(a)
Sorry I know this is more of a workaround than an actual solution...
I am working on a .wav signals using python 3.5 and trying to extract mfcc, mfcc delta, mfcc delta-deltas, and other signal features. but there is an error raised only with mfcc delta with is:
Traceback (most recent call last):
mfcc_delta = librosa.feature.delta(mfcc)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\librosa\feature\utils.py", line 116, in delta
**kwargs)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\signal\_savitzky_golay.py", line 337, in savgol_filter
coeffs = savgol_coeffs(window_length, polyorder, deriv=deriv, delta=delta)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\signal\_savitzky_golay.py", line 139, in savgol_coeffs
coeffs, _, _, _ = lstsq(A, y)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\linalg\basic.py", line 1226, in lstsq
% (-info, lapack_driver))
ValueError: illegal value in 4-th argument of internal None
I am working on the following code:
import librosa
import numpy as np
import librosa
from scipy import signal
import scipy.stats
def preprocess_cough(x,fs, cutoff = 6000, normalize = True, filter_ = True, downsample = True):
#Preprocess Data
if len(x.shape)>1:
x = np.mean(x,axis=1) # Convert to mono
if normalize:
x = x/(np.max(np.abs(x))+1e-17) # Norm to range between -1 to 1
if filter_:
b, a = butter(4, fs_downsample/fs, btype='lowpass') # 4th order butter lowpass filter
x = filtfilt(b, a, x)
if downsample:
x = signal.decimate(x, int(fs/fs_downsample)) # Downsample for anti-aliasing
fs_new = fs_downsample
return np.float32(x), fs_new
audio_data = 'F:/test/'
files = librosa.util.find_files(audio_data, ext=['wav'])
x,fs = librosa.load(myFile,sr=48000)
arr, f = preprocess_cough(x,fs)
mfcc = librosa.feature.mfcc(y=arr, sr=f, n_mfcc=13)
mfcc_delta = librosa.feature.delta(mfcc)
mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
when I remove the mffcs calculations and calculate the other wav signal features the error does not appear again. Also, I have tried to remove n_mfcc=13 parameter but the error still raises.
Sample of the output and the shape of mfcc variable
[-3.86701782e+02 -4.14421021e+02 -4.67373749e+02 -4.76989105e+02
-4.23713501e+02 -3.71329285e+02 -3.47003693e+02 -3.19309082e+02
-3.29547089e+02 -3.32584625e+02 -2.78399109e+02 -2.43284348e+02
-2.47878128e+02 -2.59308533e+02 -2.71102844e+02 -2.87314514e+02
-2.58869965e+02 -6.01125565e+01 1.66160011e+01 -8.58060551e+00
-8.49179382e+01 -9.29880371e+01 -9.96001358e+01 -1.04499428e+02
-3.65511665e+01 -3.82106819e+01 -8.69802475e+01 -1.22267052e+02
-1.70187592e+02 -2.35996841e+02 -2.96493286e+02 -3.39086365e+02
-3.59514771e+02]
and the shape is (13,33)
Can anyone help me, please?
Thanks in advance
Somewhat similarly to the issue raised in this question the issue is related to the intricacies of the underlying numerical operations that librosa defers to scipy. SciPy depends on LAPACK library being installed. So at first I would check if you have it installed.
Also, you may want to debug the script step-by-step to step into SciPy and examine actual values that are percolating from librosa.feature.delta to scipy.signal.savgol_filter which may tell you the reason when you cross-check them with documentation.
I have a problem using the orthogonal distance regression with a function using the error function math.erf(). To be more clear it seems I have a problem with the variables which should be fit.
import numpy as np
from scipy import odr
import math
def CDF(x,a,b,c):
#definition of errorfunction to fit data
return c/2*(1+math.erf((x-a)/(b*math.sqrt(2))))
#to make things a little shorter I excluded the parts in which the data is
read and prepared
guess=Ergebnis_popt[1] #using found values for a,b,c from curve_fit without
errors
guess_a=guess[0]
guess_b=guess[1]
guess_c=guess[2]
xerr=Verschiebung_fit[2] #determination of x_erros, the values for the y
errors are included in the prepared data
xerr=xerr/xerr
xerr=xerr/200
#building the model for the odr fit, as the erf function seems to only
handle single values I use np.vectorize
cfd = odr.Model(np.vectorize(CDF,excluded=['a','b','c']),extra_args=
['a','b','c'])
#Definition of the Data with ererrors for the fit
mydata =odr.RealData(Verschiebung_fit[2],y=y_values_fit[2],sx=xerr,sy=y_values_fit[4])
#basis for the odr fit
odr=odr.ODR(mydata,cfd, beta0 = [guess_a,guess_b,guess_c] )
myoutput = odr.run()
myoutput.pprint()
All this results in a Error:
File "C:\Users\Public\Anaconda\lib\site-packages\numpy\lib\function_base.py", line 2785, in _get_ufunc_and_otypes
outputs = func(*inputs)
File "C:\Users\Public\Anaconda\lib\site-packages\numpy\lib\function_base.py", line 2750, in func
return self.pyfunc(*the_args, **kwargs)
TypeError: CDF() takes 4 positional arguments but 5 were given
I am a novice using python. I guess there is a problem with communicating the variables which should be fitted, but I can not figure out where this problem occurs.
I'm trying to code my own logistic regression, and compare different methods of maximizing the log-likelihood. Using the Newton-CG method, I get the error message "ValueError: setting an array element with a sequence". Reading around, it seems this error rises if the function sought to be minimized returns a non-skalar, but that is not the case here. I need the three methods given below to give the same result (approximately), but when running on my real data, one does not converge, and the other one gives a worse LL than the initial guess, and the third does not run at all.
Why do I get the ValueError message and how can I fix it?
My code (with dummy data, the real data is ~100 measurements) is as follows:
import numpy as np
from numpy import linalg
import scipy
from scipy.optimize import minimize
def CalcLL(beta,xinlist,yinlist):
LL=0.0
ncol=len(beta)
pi=FindPi(xinlist,beta.reshape(ncol,1))
for i in range(len(yinlist)):
LL=LL+np.where(yinlist[i]==1,np.log(pi[i]),np.log(1-pi[i]))
return -LL
def Jacobian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
Jac=np.transpose(np.matrix(yinlist-pi))*np.matrix(xinlist)
return Jac
def Hessian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
W=FindW(pi)
Hes=np.matrix(np.transpose(xinlist))*(np.matrix(W)*np.matrix(xinlist))
return Hes
def FindPi(xinlist,beta):
rows=np.shape(xinlist)[0]# Number of rows in x_new
cols=np.shape(xinlist)[1]# Number of columns in x_new
expon=np.dot(xinlist,beta)
expon=np.array(expon).reshape(rows,1)
pi=np.exp(expon)/(1+np.exp(expon))
return pi
def FindW(pi):
W=np.zeros(len(pi)*len(pi)).reshape(len(pi),len(pi))
for i in range(len(pi)):
W[i,i]=float(pi[i]*(1-pi[i]))
return W
xinlist=np.matrix([[1,1],[0,1],[1,1],[1,1],[1,1],[0,1],[0,1],[1,1],[1,1],[0,1]])
yinlist=np.transpose(np.matrix([0,0,0,0,0,1,1,1,1,1]))
ncol=np.shape(xinlist)[1]
beta1=np.zeros(ncol).reshape(ncol,1) # Initial guess for parameter values
limit=0.000001 # selfwritten Newton-Raphson method
iter_i=limit+1
while iter_i>limit:
Hes=Hessian(beta1,xinlist,yinlist)
Jac=np.transpose(Jacobian(beta1,xinlist,yinlist))
root_diff=np.array(linalg.inv(Hes)*Jac)
beta1=beta1+root_diff
iter_i=np.sum(root_diff*root_diff)
print "When running self-written algorithm, the log-likelihood is",-CalcLL(beta1,xinlist,yinlist)
beta2=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta2,args=(xinlist,yinlist),method='Nelder-Mead',options={'xtol':1e-8,'disp':True,'maxiter':10000})
beta2=res.x
print "The log-likelihood using Nelder-Mead is", -CalcLL(beta2,xinlist,yinlist)
beta3=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
beta3=res.x
print "The log-likelihood using Newton-CG is", -CalcLL(beta3,xinlist,yinlist)
EDIT:
The errorstack is as follows:
Traceback (most recent call last):
File "MyLogisticRegression2.py", line 62, in
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
File C:\Python27\lib\site-packages\scipy\optimize_minimize.py, line 447, in minimize **options)
File C:\Python27\lib\site-packages\scipy\optimize\optimize.py, line 2393, in _minimize_newtoncg eta=numpy.min([0.5, numpy.sqrt(maggrad)])
File C:\Python27\lib\site-packages\numpy\core\fromnumeric.py, line 2393, in amin out=out, **kwargs)
File C:\Python27\lib\site-packages\numpy\core_methods.py, line 29, in _amin return umr_minimum(a,axis,None,out,keepdims)
ValueError: setting an array element with a sequence
I found out the problem rose from beta arrays having shape (2,1) instead of (2,), and likewise for the Jacobian. Reshaping these two solved the problem.
The Newton-CG solver needs only 1d arrays for the Jacobian apparently.
I'm trying to create a distribution based on some data I have, then draw randomly from that distribution. Here's what I have:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv()
if __name__ == "__main__":
# pretend this is real data
data = numpy.concatenate((numpy.random.normal(2,5,100), numpy.random.normal(25,5,100)))
d = getDistribution(data)
print d.rvs(size=100) # this usually fails
I think this is doing what I want it to, but I frequently get an error (see below) when I try to do d.rvs(), and d.rvs(100) never works. Am I doing something wrong? Is there an easier or better way to do this? If it's a bug in scipy, is there some way to get around it?
Finally, is there more documentation on creating custom distributions somewhere? The best I've found is the scipy.stats.rv_continuous documentation, which is pretty spartan and contains no useful examples.
The traceback:
Traceback (most recent call last): File "testDistributions.py", line
19, in
print d.rvs(size=100) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 696, in rvs
vals = self._rvs(*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1193, in _rvs
Y = self._ppf(U,*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1212, in _ppf
return self.vecfunc(q,*args) File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.1-py2.6-linux-x86_64.egg/numpy/lib/function_base.py",
line 1862, in call
theout = self.thefunc(*newargs) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1158, in _ppf_single_call
return optimize.brentq(self._ppf_to_solve, self.xa, self.xb, args=(q,)+args, xtol=self.xtol) File
"/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/optimize/zeros.py",
line 366, in brentq
r = _zeros._brentq(f,a,b,xtol,maxiter,args,full_output,disp) ValueError: f(a) and f(b) must have different signs
Edit
For those curious, following the advice in the answer below, here's code that works:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _rvs(self, *x, **y):
# don't ask me why it's using self._size
# nor why I have to cast to int
return kernel.resample(int(self._size))
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
def _pdf(self, x):
return kernel.evaluate(x)
return rv(name='kdedist', xa=-200, xb=200)
Specifically to your traceback:
rvs uses the inverse of the cdf, ppf, to create random numbers. Since you are not specifying ppf, it is calculated by a rootfinding algorithm, brentq. brentq uses lower and upper bounds on where it should search for the value at with the function is zero (find x such that cdf(x)=q, q is quantile).
The default for the limits, xa and xb, are too small in your example. The following works for me with scipy 0.9.0, xa, xb can be set when creating the function instance
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv(name='kdedist', xa=-200, xb=200)
There is currently a pull request for scipy to improve this, so in the next release xa and xb will be expanded automatically to avoid the f(a) and f(b) must have different signs exception.
There is not much documentation on this, the easiest is to follow some examples (and ask on the mailing list).
edit: addition
pdf: Since you have the density function also given by gaussian_kde, I would add the _pdf method, which will make some calculations more efficient.
edit2: addition
rvs: If you are interested in generating random numbers, then gaussian_kde has a resample method. Random Samples can be generated by sampling from the data and adding gaussian noise. So, this will be faster than the generic rvs using the ppf method. I would write a ._rvs method that just calls gaussian_kde's resample method.
precomputing ppf: I don't know of any general way to precompute the ppf. However, the way I thought of doing it (but never tried so far) is to precompute the ppf at many points and then use linear interpolation to approximate the ppf function.
edit3: about _rvs to answer Srivatsan's question in the comment
_rvs is the distribution specific method that is called by the public method rvs. rvs is a generic method that does some argument checking, adds location and scale, and sets the attribute self._size which is the size of the requested array of random variables, and then calls the distribution specific method ._rvs or it's generic counterpart. The extra arguments in ._rvs are shape parameters, but since there are none in this case, *x and **y are redundant and unused.
I don't know how well the size or shape of the .rvs method works in the multivariate case. These distributions are designed for univariate distributions, and might not fully work for the multivariate case, or might need some reshapes.