Instability in fitting data using Scipy Optimize library - python

I am attempting to fit a function to a set of data I have. The function in question is:
x(t) = - B + sqrt(AB(t-t0) + (x0 + B)^2)
I have tried to fit my data (included at the bottom) to this using two different methods but have found that whatever I do the fit for B is extremely unstable. Changing either the method or the initial guess wildly changes the output value. In addition, when I look at the error for this fit using curve_fit the error is almost two orders of magnitude higher than the value. Does anyone have some suggestions on what I should do to decrease the error?
import numpy as np
import scipy.optimize as spo
def modelFun(t,A,B):
return -B + np.sqrt(A*B*(t-t0) + np.power(x0 + B,2))
def errorFun(k,time,data):
A = k[0]
B = k[1]
return np.sum((data-modelFun(time,A,B))**2)
data = np.genfromtxt('testdata.csv',delimiter=',',skip_header = 1)
time = data[:,0]
xt = data[:,1]
t0 = data[0,0]
x0 = data[0,1]
minErrOut = spo.minimize(errorFun,[1,1000],args=(time,xt),bounds=((0,None),(0,None)))
(curveOut, curveCovar) = spo.curve_fit(modelFun,time,xt,p0=[1,1000],method='dogbox',bounds=([-np.inf,0],np.inf))
print('minimize result: A={}; B={}'.format(*minErrOut.x))
print('curveFit result: A={}; B={}'.format(*curveOut))
print('curveFit Error: A={}; B={}'.format(*np.sqrt(np.diag(curveCovar))))
Datafile:
Time,x
201,2.67662
204,3.28159
206,3.44378
208,3.72537
210,3.94826
212,4.36716
214,4.65373
216,5.26766
219,5.59502
221,6
223,6.22189
225,6.49652
227,6.799
229,7.30846
231,7.54229
232,7.76517
233,7.6209
234,7.89552
235,7.94826
236,8.17015
237,8.66965
238,8.66965
239,8.8398
240,8.88856
241,9.00697
242,9.45075
243,9.51642
244,9.63483
245,9.63483
246,10.07861
247,10.02687
248,10.24876
249,10.31443
250,10.47164
251,10.99502
252,10.92935
253,11.0995
254,11.28358
255,11.58209
256,11.53035
257,11.62388
258,11.93632
259,11.98806
260,12.26269
261,12.43284
262,12.60299
263,12.801
264,12.99502
265,13.08557
266,13.25572
267,13.32139
268,13.57114
269,13.76617
270,13.88358
271,13.83184
272,14.10647
273,14.27662
274,14.40796

TL;DR;
Your dataset is linear and misses observations at larger timescale. Therefore you can capture A which is proportional to the slope while your model needs to keep B large (and potentially unbounded) to inhibit the square root trend.
This can be confirmed by developing Taylor series of your model and analyzing the MSE surface associated to the regression.
In a nutshell, considering this kind of dataset and the given model, accept A don't trust B.
MCVE
First, let's reproduce your problem:
import io
import numpy as np
import pandas as pd
from scipy import optimize
stream = io.StringIO("""Time,x
201,2.67662
204,3.28159
...
273,14.27662
274,14.40796""")
data = pd.read_csv(stream)
# Origin Shift:
data = data.sub(data.iloc[0,:])
data = data.set_index("Time")
# Simplified model:
def model(t, A, B):
return -B + np.sqrt(A*B*t + np.power(B, 2))
# NLLS Fit:
parameters, covariance = optimize.curve_fit(model, data.index, data["x"].values, p0=(1, 1000), ftol=1e-8)
# array([3.23405915e-01, 1.59960168e+05])
# array([[ 3.65068730e-07, -3.93410484e+01],
# [-3.93410484e+01, 9.77198860e+12]])
The adjustment is fair:
But as you noticed model parameters differ from several order of magnitude which can prevent optimization to perform properly.
Notice that your dataset is quite linear. The observed effect is not surprising and is inherent the chosen model. B parameter must be several orders of magnitude bigger than A to keep the linear behaviour.
This claim is supported by the analysis of the first terms of the Taylor series:
def taylor(t, A, B):
return (A/2*t - A**2/B*t**2/8)
parameters, covariance = optimize.curve_fit(taylor, data.index, data["x"].values, p0=(1, 100), ftol=1e-8)
parameters
# array([3.23396685e-01, 1.05237134e+09])
Without surprise the slope of your linear dataset can be captured while the parameter B can be arbitrary large and will cause float arithmetic errors during optimization (hence the minimize warning bellow you have got).
Analyzing Error Surface
The problem can be reformulated as a minimization problem:
def objective(beta, t, x):
return np.sum(np.power(model(t, beta[0], beta[1]) - x, 2))
result = optimize.minimize(objective, (1, 100), args=(data.index, data["x"].values))
# fun: 0.6594398116927569
# hess_inv: array([[8.03062155e-06, 2.94644208e-04],
# [2.94644208e-04, 1.14979735e-02]])
# jac: array([2.07304955e-04, 6.40749931e-07])
# message: 'Desired error not necessarily achieved due to precision loss.'
# nfev: 389
# nit: 50
# njev: 126
# status: 2
# success: False
# x: array([3.24090627e-01, 2.11891188e+03])
If we plot the MSE associated to your dataset, we get the following surface:
We have a canyon that is narrow on A space but seems unbounded at least for first decades on B space. This is supporting your observations in your post and comments. It also brings a technical insight on why we cannot fit B properly.
Performing the same operation on synthetic dataset:
t = np.linspace(0, 1000, 100)
x = model(t, 0.35, 20)
data = pd.DataFrame(x, index=t, columns=["x"])
To have the square root shape in addition of the linear trend at the beginning.
result = optimize.minimize(objective, (1, 0), args=(data.index, data["x"].values), tol=1e-8)
# fun: 1.9284246829733202e-10
# hess_inv: array([[ 4.34760333e-05, -4.42855253e-03],
# [-4.42855253e-03, 4.59219063e-01]])
# jac: array([ 4.35726463e-03, -2.19158602e-05])
# message: 'Desired error not necessarily achieved due to precision loss.'
# nfev: 402
# nit: 94
# njev: 130
# status: 2
# success: False
# x: array([ 0.34999987, 20.000013 ])
This version of the problem has following MSE surface:
Showing a potential convex valley around the known solution which explain why you are able to fit both parameters when there are sufficient large time acquisition.
Notice the valley is strongly stretched meaning that in this scenario your problem will benefit from normalization.

Related

Fitting data with an exponential law

I'd like to fit some data with an exponential function. I used scipy.optimize.curve_fit because I already used it for other fits. This time, there is an issue and I can't figure out what's wrong.
Here is what the data looks like when plotted :
data.png
as you see it seems to follow an exponential law.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
data = np.array([
0., 1.93468444, 3.69735865, 5.38185988, 6.02549022,
6.69199075, 7.72316694, 8.08913061, 8.84570241, 8.69711608,
8.80038144, 9.78951087, 9.68486674, 10.06175145, 10.44039495,
10.0481156 , 9.76656204, 9.88581457, 9.81805445, 10.42432252,
10.41102239, 11.2911395 , 9.64866184, 9.98072231, 10.83644694,
10.24748571, 10.81333209, 10.75949899, 10.90367328, 10.42446764,
10.51441017, 10.73047737, 10.8159758 , 10.51013538, 10.02862504,
9.76352112, 10.64829309, 10.6293347 , 10.67752596, 10.34801542,
10.53158576, 10.92883362, 10.67002314, 10.37015825, 10.74876349,
10.12821343, 10.8974205 , 10.1591103 , 10.588377 , 11.92134556,
10.309095 , 11.1174362 , 10.72654524, 10.60890374, 10.37456491,
10.05935346, 11.21295863, 11.09013951, 10.60862773, 11.2558922 ,
11.24660234, 10.35981557, 10.81284365, 10.96113067, 10.22716439,
9.8394873 , 10.01892084, 10.38237311, 10.04920671, 10.87782442,
10.42438756, 10.05614503, 10.5446946 , 9.99974368, 10.76930547,
10.22164072, 10.36942999, 10.89888302, 10.47035428, 10.58157374,
11.12615892, 11.30866718, 10.33215937, 10.46723351, 10.54072701,
11.45027197, 10.45895588, 10.34176601, 10.78405493, 10.43964778,
10.34047484, 10.25099046, 11.05847515, 10.27408195, 10.27529163,
10.16568845, 10.86451738, 10.73205291, 10.73300649, 10.49463959,
10.03729782
])
t = np.linspace(0, 100, len(data)) #time array
def expo(x, a, b, c): #exponential function for fitting
return a * np.exp(b * x) + c
fig1, ax1 = plt.subplots()
ax1.plot(t, data, ".", label="data")
coefs = curve_fit(expo, t, data)[0] # fitting
ax1.plot(t, expo(t, coefs[0], coefs[1], coefs[2]), "-", label="fit")
ax1.legend()
plt.show()
The problem is that curve_fit() returns very big or very small coefficients a,b and c while it should return something more like a = -10.5, b = -0.2, c = 10.5
The fitting process works by finding a local minimum of a loss function.
If the problem is unconstrained, there may be several such local minima,
each giving different values of parameters, and you may get a different one
than the one that you are expecting.
If you have a guess what the parameters should be, you can provide it to narrow the search:
# with an initial guess for values of a, b, c
coefs = curve_fit(expo, t, data, p0=[-10, -1, 10])[0]
The coefficients it produces are:
array([-10.48815244, -0.2091102 , 10.56699883])
Alternatively, you can specify bonds for the parameters:
# with lower and upper bounds for a, b, c
coefs = curve_fit(expo, t, data, bounds=([-20, -2, 0], [-10, 2, 20]))[0]
This gives the same results as above.
Probably a non-linear regression algorithm is implemented in your software.
"Guessed" initial values of the parameters are required to start the iterative process. If no initial value is provided by the user, some initial values are evaluated by the software. That is often a cause of failure because the computed initial values might be too far from the correct values.
Some good initial values can be found in using a linear regression method which doesn't requires initial values. See the calculus below.
The result is :
If the accuracy of the above result is not sufficient according to some specified criteria of fitting, a non-linear regression is necessary. In this case the above values of the parameters $a,b,c$ can be used as initial values to initiate the iterative calculus.
Note : The principle of the method which lineraizes the non-linear regression as shown above is explained in : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
Here is what i tried, used negative b in np.exp
def expo(x,a,b,c):
return a*np.exp(-b*x) + c
>>>[-10.4881516 0.20911016 10.5669989 ]

scipy.integrate.ode gives up on integration

I am encountering a strange problem with scipy.integrate.ode. Here is a minimal working example:
import sys
import time
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from scipy.integrate import ode, complex_ode
def fun(t):
return np.exp( -t**2 / 2. )
def ode_fun(t, y):
a, b = y
f = fun(t)
c = np.conjugate(b)
dt_a = -2j*f*c + 2j*f*b
dt_b = 1j*f*a
return [dt_a, dt_b]
t_range = np.linspace(-10., 10., 10000)
init_cond = [-1, 0]
trajectory = np.empty((len(t_range), len(init_cond)), dtype=np.complex128)
### setup ###
r = ode(ode_fun).set_integrator('zvode', method='adams', with_jacobian=False)
r.set_initial_value(init_cond, t_range[0])
dt = t_range[1] - t_range[0]
### integration ###
for i, t_i in enumerate(t_range):
trajectory[i,:] = r.integrate(r.t+dt)
a_traj = trajectory[:,0]
b_traj = trajectory[:,1]
fun_traj = fun(t_range)
### plot ###
plt.figure(figsize=(10,5))
plt.subplot(121, title='ODE solution')
plt.plot(t_range, np.real(a_traj))
plt.subplot(122, title='Input')
plt.plot(t_range, fun_traj)
plt.show()
This code works correctly and the output figure is (the ODE is explicitly dependent on the input variable, right panel shows the input, left panel the ode solution for the first variable).
So in principle my code is working. What is strange is that if I simply replace the integration range
t_range = np.linspace(-10., 10., 10000)
by
t_range = np.linspace(-20., 20., 10000)
I get the output
So somehow the integrator just gave up on the integration and left my solution as a constant. Why does this happen? How can I fix it?
Some things I've tested: It is clearly not a resolution problem, the integration steps are really small already. Instead, it seems that the integrator does not even bother calling the ode function anymore after a few steps. I've tested that by including a print statement in the ode_fun().
My current suspicion is that the integrator decided that my function is constant after it did not change significantly during the first few integration steps. Do I maybe have to set some tolerance levels somewhere?
Any help appreciated!
"My current suspicion is that the integrator decided that my function is constant after it did not change significantly during the first few integration steps." Your suspicion is correct.
ODE solvers typically have an internal step size that is adaptively adjusted based on error estimates computed by the solver. These step sizes are independent of the times for which the output is requested; the output at the requested times is computed using interpolation of the solution at the points computed at the internal steps.
When you start your solver at t = -20, apparently the input changes so slowly that the solver's internal step size becomes large enough that by the time the solver gets near t = 0, the solver skips right over the input pulse.
You can limit the internal step size with the option max_step of the set_integrator method. If I set max_step to 2.0 (for example),
r = ode(ode_fun).set_integrator('zvode', method='adams', with_jacobian=False,
max_step=2.0)
I get the output that you expect.

Numerical Stability of Forward Substitution in Python

I am implementing some basic linear equation solvers in Python.
I have currently implemented forward and backward substitution for triangular systems of equations (so very straightforward to solve!), but the precision of the solutions becomes very poor even with systems of about 50 equations (50x50 coefficient matrix).
The following code performs the forward/backward substitution:
FORWARD_SUBSTITUTION = 1
BACKWARD_SUBSTITUTION = 2
def solve_triang_subst(A: np.ndarray, b: np.ndarray,
substitution=FORWARD_SUBSTITUTION) -> np.ndarray:
"""Solves a triangular system via
forward or backward substitution.
A must be triangular. FORWARD_SUBSTITUTION means A should be
lower-triangular, BACKWARD_SUBSTITUTION means A should be upper-triangular.
"""
rows = len(A)
x = np.zeros(rows, dtype=A.dtype)
row_sequence = reversed(range(rows)) if substitution == BACKWARD_SUBSTITUTION else range(rows)
for row in row_sequence:
delta = b[row] - np.dot(A[row], x)
cur_x = delta / A[row][row]
x[row] = cur_x
return x
I am using numpy and 64-bit floats.
Simple Testing Tool
I have set up a simple test suite which generates coefficient matrices and x vectors, computes the b, and then uses forward or backward substitution to recover the x, comparing it to the its known value for validity.
The following code performs these checks:
import numpy as np
import scipy.linalg as sp_la
RANDOM_SEED = 1984
np.random.seed(RANDOM_SEED)
def check(sol: np.ndarray, x_gt: np.ndarray, description: str) -> None:
if not np.allclose(sol, x_gt, rtol=0.1):
print("Found inaccurate solution:")
print(sol)
print("Ground truth (not achieved...):")
print(x_gt)
raise ValueError("{} did not work!".format(description))
def fuzz_test_solving():
N_ITERATIONS = 100
refine_result = True
for mode in [FORWARD_SUBSTITUTION, BACKWARD_SUBSTITUTION]:
print("Starting mode {}".format(mode))
for iteration in range(N_ITERATIONS):
N = np.random.randint(3, 50)
A = np.random.uniform(0.0, 1.0, [N, N]).astype(np.float64)
if mode == BACKWARD_SUBSTITUTION:
A = np.triu(A)
elif mode == FORWARD_SUBSTITUTION:
A = np.tril(A)
else:
raise ValueError()
x_gt = np.random.uniform(0.0, 1.0, N).astype(np.float64)
b = np.dot(A, x_gt)
x_est = solve_triang_subst(A, b, substitution=mode,
refine_result=refine_result)
# TODO report error and count, don't throw!
# Keep track of error norm!!
check(x_est, x_gt,
"Mode {} custom triang iteration {}".format(mode, iteration))
if __name__ == '__main__':
fuzz_test_solving()
Note that the maximum size of a test matrix is 49x49. Even in this case, the system cannot always compute decent solutions, and fails by more than a margin of 0.1. Here's an example of such a failure (this is doing backward substitution, so the biggest error is in the 0th coefficient; all the test data are sampled uniformly from [0, 1[):
Solution found with Mode 2 custom triang iteration 24:
[ 0.27876067 0.55200497 0.49499509 0.3259397 0.62420183 0.47041149
0.63557676 0.41155446 0.47191956 0.74385864 0.03002819 0.4700286
0.37989592 0.56527691 0.15072607 0.05659282 0.52587574 0.82252197
0.65662833 0.50250729 0.74139748 0.10852731 0.27864265 0.42981232
0.16327331 0.74097937 0.24411709 0.96934199 0.890266 0.9183985
0.14842446 0.51806495 0.36966843 0.18227989 0.85399593 0.89615663
0.39819336 0.90445931 0.21430972 0.61212349 0.85205597 0.66758689
0.1793689 0.38067267 0.39104614 0.6765885 0.4118123 ]
Ground truth (not achieved...)
[ 0.20881608 0.71009766 0.44735271 0.31169033 0.63982328 0.49075813
0.59669585 0.43844108 0.47764942 0.72222069 0.03497499 0.4707452
0.37679884 0.56439738 0.15120397 0.05635977 0.52616387 0.82230625
0.65670245 0.50251426 0.74139956 0.10845974 0.27864289 0.42981226
0.1632732 0.74097939 0.24411707 0.96934199 0.89026601 0.91839849
0.14842446 0.51806495 0.36966843 0.18227989 0.85399593 0.89615663
0.39819336 0.90445931 0.21430972 0.61212349 0.85205597 0.66758689
0.1793689 0.38067267 0.39104614 0.6765885 0.4118123 ]
I have also implemented the iterative refinement method described in Section 2.5 of [0], and while it did help a little, the results are still poor for larger matrices.
MATLAB Sanity Check
I also did this experiment in MATLAB, and even there, once there are more than 100 equations, the estimation error shoots up exponentially.
Here is the MATLAB code I used for this experiment:
err_norms = [];
range = 1:3:120;
for size=range
A = rand(size, size);
A = tril(A);
x_gt = rand(size, 1);
b = A * x_gt;
x_sol = A\b;
err_norms = [err_norms, norm(x_gt - x_sol)];
end
plot(range, err_norms);
set(gca, 'YScale', 'log')
And here is the resulting plot:
Main Question
My question is: Is this normal behavior, seeing as there is essentially no structure in the problem, given that I randomly generate the A matrix and x?
What about solving linear systems of 100s of equations for various practical applications? Are these limitations simply an accepted fact, and e.g., optimization algorithms are just naturally robust to these issues? Or am I missing some important facets of this problem?
[0]: Press, William H. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press, 2007.
There are no limitations. This is a very fruitful exercise that we all came to realize; writing linear solvers are not that easy and that's why almost always LAPACK or its cousins in other languages are used with full confidence.
You are hit by almost singular matrices and because you are using matlab's backslash you don't see that matlab is switching to least squares solutions behind the scenes when near singularity is hit. If you just change A\b to linsolve(A,b) hence you restrict the solver to solve square systems you'll probably see lots of warnings on your console.
I didn't test it because I don't have a license anymore but if I write blindly this should show you the condition numbers of the matrices at each step.
err_norms = [];
range = 1:3:120;
for i=1:40
size = range(i);
A = rand(size, size);
A = tril(A);
x_gt = rand(size, 1);
b = A * x_gt;
x_sol = linsolve(A,b);
err_norms = [err_norms, norm(x_gt - x_sol)];
zzz(i) = rcond(A);
end
semilogy(range, err_norms);
figure,semilogy(range,zzz);
Note that because you are picking up numbers from a uniform distribution it becomes more and more likely to hit ill-conditioned matrices (wrt to inversion) as the rows have more probability to have rank deficiency. That's why the error becomes bigger and bigger. Sprinkle some identity matrix times a scalar and all errors should come back to eps*n levels.
But best, leave this to expert algorithms which have been tested through decades. It is really not that trivial to write any of these. You can read the Fortran codes, for example, dtrsm solves the triangular system.
On the Python side, you can use scipy.linalg.solve_triangular which uses ?trtrs routines from LAPACK.

How to do Scipy curve fitting with error bars and obtain standard errors on fitting parameters?

I am trying to fit my data points. It looks like the fitting without errors are not that optimistic, therefore now I am trying to fit the data implementing the errors at each point. My fit function is below:
def fit_func(x,a,b,c):
return np.log10(a*x**b + c)
then my data points are below:
r = [ 0.00528039,0.00721161,0.00873037,0.01108928,0.01413011,0.01790143,0.02263833, 0.02886089,0.03663713,0.04659512,0.05921978,0.07540126,0.09593949, 0.12190075,0.15501736,0.19713563,0.25041524,0.3185025,0.40514023,0.51507869, 0.65489938,0.83278859,1.05865016,1.34624082]
logf = [-1.1020581079659384, -1.3966927245616112, -1.4571368537041418, -1.5032694247562564, -1.8534775558300272, -2.2715812166948304, -2.2627690390113862, -2.5275290780299331, -3.3798813619309365, -6.0, -2.6270989211307034, -2.6549656159564918, -2.9366845162570079, -3.0955026428779604, -3.2649261507250289, -3.2837123017838366, -3.0493752067042856, -3.3133647996463229, -3.0865051494299243, -3.1347499415910169, -3.1433062918466632, -3.1747394718538979, -3.1797597345585245, -3.1913094832146616]
Because my data is in log scale, logf, then the error bar for each data point is not symmetric. The upper error bar and lower error bar are below:
upper = [0.070648916083227764, 0.44346256268274886, 0.11928131794776076, 0.094260899008089094, 0.14357124858039971, 0.27236750587684311, 0.18877122991380402, 0.28707938182603066, 0.72011863806906318, 0, 0.16813325716948757, 0.13624929595316049, 0.21847915642008875, 0.25456116079315372, 0.31078368240910148, 0.23178227464741452, 0.09158189214515966, 0.14020538489677881, 0.059482730164901909, 0.051786777740678414, 0.041126467609954531, 0.034394612910981337, 0.027206248503368613, 0.021847333685597548]
lower = [0.06074797748043137, 0.21479225959441428, 0.093479845697059583, 0.077406149968278104, 0.1077175009766278, 0.16610073183912188, 0.13114254113054535, 0.17133966123838595, 0.57498950902908286, 2.9786837094190934, 0.12090437578535695, 0.10355760401838676, 0.14467588244034646, 0.15942693835964539, 0.17929440903034921, 0.15031667827534712, 0.075592499975030591, 0.10581886912443572, 0.05230849287772843, 0.04626422871423852, 0.03756658820680725, 0.03186944137872727, 0.025601929615431285, 0.02080073540367966]
I have the fitting as:
popt, pcov = optimize.curve_fit(fit_func, r, logf,sigma=[lower,upper])
logf_fit = fit_func(r,*popt)
But this is wrong, how can I implement the curve fitting from scipy to include the upper and lower errors? How could I get the fitting errors of the fitting parameters a, b, c?
You can use scipy.optimize.leastsq with custom weights:
import scipy.optimize as optimize
import numpy as np
# redefine lists as array
x=np.array(r)
y=np.array(logf)
errup=np.array(upper)
errlow=np.array(lower)
# error function
def fit_func(x,a,b,c):
return np.log10(a*x**b + c)
def my_error(V):
a,b,c=V
yfit=fit_func(x,a,b,c)
weight=np.ones_like(yfit)
weight[yfit>y]=errup[yfit>y] # if the fit point is above the measure, use upper weight
weight[yfit<=y]=errlow[yfit<=y] # else use lower weight
return (yfit-y)**2/weight**2
answer=optimize.leastsq(my_error,x0=[0.0001,-1,0.0006])
a,b,c=answer[0]
print(a,b,c)
It works, but is very sensitive to initial values, since there is a log which can go in wrong domain (negative numbers) and then it fails. Here I find a=9.14464745425e-06 b=-1.75179880756 c=0.00066720486385which is pretty close to data.

using scipy.optimize.fmin_bfgs for logistic regression classifier leads divide by zero error

I'm trying to implement my binary logistic regression classifier and I decided to use scipy.optimize.fmin_bfgs to minimize the objective function (loglikelihood) as stated in the following formula:
The gradient of this objective function is calculated as:
where:
Now I have my LogisticRegression class which has the sigmoid function to calculate the sigmoid, the loglikelihood function to calculate the loglikelihood, and gradient to calculate the gradient. Finally, I have my learn_classifier method which calls the optimize.fmin_bfgs function to find the best weight vector. My training data set consists of 2013 tuples and each tuple has 113 attributes where the first attribute is the outcome (taking either one or zero). Here is my code:
from features_reader import FeaturesReader
import numpy as np
from scipy import optimize
from scipy.optimize import check_grad
class LogisticRegression:
def __init__(self, features_reader = FeaturesReader()):
features_reader.read_features()
fHeight = len(features_reader.feature_data)
fWidth = len(features_reader.feature_data[0])
tHeight = len(features_reader.test_data)
tWidth = len(features_reader.test_data[0])
self.training_data = np.zeros((fHeight, fWidth))
self.testing_data = np.zeros((tHeight, tWidth))
print 'training data size: ', self.training_data.shape
print 'testing data size: ', self.testing_data.shape
for index, item in enumerate(features_reader.feature_data):
self.training_data[index, 0] = item['outcome']
self.training_data[index, 1:] = np.array([value for key, value in item.items() if key!='outcome'])
def sigmoid(self, v_x, v_weight):
return 1.0/(1.0 + np.exp(-np.dot(v_x, v_weight[1:])+v_weight[0]))
def loglikelihood(self, v_weight, v_x, v_y):
return -1*np.sum(v_y*np.log(self.sigmoid(v_x, v_weight)) + (1-v_y)*(np.log(1-self.sigmoid(v_x, v_weight))))
def gradient(self, v_weight, v_x, v_y):
gradient = np.zeros(v_weight.shape[0])
for row, y in zip(v_x,v_y):
new_row = np.ones(1+row.shape[0])
new_row[1:] = row
y_prime = self.sigmoid(new_row[1:], v_weight)
gradient+=(y_prime-y)*new_row
return gradient
def learn_classifier(self):
result = optimize.fmin_bfgs(f=self.loglikelihood,
x0=np.zeros(self.training_data.shape[1]),
fprime=self.gradient,
args=(self.training_data[:,1:], self.training_data[:,0]))
return result
def main():
features_reader = FeaturesReader(filename = 'features.csv', features_file = 'train_filter1.arff')
logistic_regression = LogisticRegression(features_reader)
result = logistic_regression.learn_classifier()
print result
if __name__ == "__main__":
main()
The FeaturesReader class is the parser that reads the csv file which I did not paste here. But I'm pretty sure the init function correctly parses the csv into a 2-D numpy array that represents training data . This 2-D array has shape (2013, 113) where the first column is the training output . When I ran the learn_classifier function it gives these warning and terminates:
training data size: (2013, 113)
testing data size: (4700, 113)
logistic_regression.py:26: RuntimeWarning: overflow encountered in exp
return 1.0/(1.0 + np.exp(-np.dot(v_x, v_weight[1:])+v_weight[0]))
logistic_regression.py:30: RuntimeWarning: divide by zero encountered in log
return -1*np.sum(v_y*np.log(self.sigmoid(v_x, v_weight)) + (1-v_y)*(np.log(1-self.sigmoid(v_x, v_weight))))
logistic_regression.py:30: RuntimeWarning: invalid value encountered in multiply
return -1*np.sum(v_y*np.log(self.sigmoid(v_x, v_weight)) + (1-v_y)*(np.log(1-self.sigmoid(v_x, v_weight))))
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: nan
Iterations: 1
Function evaluations: 32
Gradient evaluations: 32
So I got these three errors: 1. divide by zero error, 2. overflow encountered in exp 3. invalid value encountered in multiply. And the algorithm terminates after first iteration which is abnormal. I have no clue why these are happening? Do you guys think I did something wrong in calculating the loglikelihood and gradient ? Where else do you think these errors originate? To be more specific, in my loglikelihood function, my w_weight paramenter is assumed to be 1D (shape = 113), my v_x is 2d and has shape (2013,112) (because I do not count the outcome column), and my v_y is 1d and has shape (2013).
I've been dealing with problem myself. The the call to np.exp can very quickly create numbers large enough that they overflow a float64. Even using numpy's longdouble datatype (which may provide more resolution, depending on your CPU) I run into numerical issues.
I found that at least in some case, using feature scaling and normalization solve numerical problems. If you have n features, for each subtract the mean from that feature, then divide by the standard deviation of the feature. The numpy code is:
for row in X.shape[0]:
X[:, row] -= X[:, row].mean()
X[:, row] /= X[:, row].std()
There's probably a more vectorized way to do it without the explicit loop. : )
You can also look at my toy implementation of it here.
Alternatively, you could look at this technique, which describes taking the log of the exponential in order to prevent overflow.

Categories