I have two variables x and y which I am trying to fit using curve_fit from scipy.optimize.
The equation that fits the data is a simple power law of the form y=a(x^b). The fit seems to be well for the data when I set the x and y axis to log scale, i.e ax.set_xscale('log') and ax.set_yscale('log').
Here is the code:
def fitfunc(x,p1,p2):
y = p1*(x**p2)
return y
popt_1,pcov_1 = curve_fit(fitfunc,x,y,p0=(1.0,1.0))
p1_1 = popt_1[0]
p1_2 = popt_1[1]
residuals1 = (ngal_mstar_1) - fitfunc(x,p1_1,p1_2)
xi_sq_1 = sum(residuals1**2) #The chi-square value
curve_y_1 = fitfunc(x,p1_1,p1_2) #This is the fit line seen in the graph
fig = plt.figure(figsize=(14,12))
ax1 = fig.add_subplot(111)
ax1.scatter(x,y,c='r')
ax1.plot(y,curve_y_1,'y.',linewidth=1)
ax1.legend(loc='best',shadow=True,scatterpoints=1)
ax1.set_xscale('log') #Scale is set to log
ax1.set_yscale('log') #SCale is set to log
plt.show()
When I use true log-log values for x and y, the power law fit becomes y=10^(a+b*log(x)),i.e raising the power of the right side to 10 as it is logbase 10. Now both by x and y values are log(x) and log(y).
The fit for the above does not seem to be good. Here is the code I have used.
def fitfunc(x,p1,p2):
y = 10**(p1+(p2*x))
return y
popt_1,pcov_1 = curve_fit(fitfunc,np.log10(x),np.log10(y),p0=(1.0,1.0))
p1_1 = popt_1[0]
p1_2 = popt_1[1]
residuals1 = (y) - fitfunc((x),p1_1,p1_2)
xi_sq_1 = sum(residuals1**2)
curve_y_1 = fitfunc(np.log10(x),p1_1,p1_2) #The fit line uses log(x) here itself
fig = plt.figure(figsize=(14,12))
ax1 = fig.add_subplot(111)
ax1.scatter(np.log10(x),np.log10(y),c='r')
ax1.plot(np.log10(y),curve_y_1,'y.',linewidth=1)
plt.show()
THE ONLY DIFFERENCE BETWEEN THE TWO PLOTS IS THE FITTING EQUATIONS, AND FOR THE SECOND PLOT THE VALUES HAVE BEEN LOGGED INDEPENDENTLY. Am I doing something wrong here, because I want a log(x) vs log(y) plot and the corresponding fit parameters (slope and intercept)
Your transformation of the power-law model to log-log is wrong, i.e. your second fit actually fits a different model. Take your original model y=a*(x^b) and apply the logarithm on both sides, you will get log(y) = log(a) + b*log(x). Thus, your model in log-scale should simply read y' = a' + b*x', where the primes indicate variables in log-scale. The model is now a linear function, a well known result that all power-laws become linear functions in log-log.
That said, you can still expect some small differences in the two versions of your fit, since curve_fit will optimise the least-squares problem. Therefore, in log scale, the fit will minimise the relative error between the fit and the data, while in linear scale, the fit will minimise the absolute error. Thus, in order to decide which way is actually the better domain for your fit, you will have to estimate the error in your data. The data you show certainly does not have a constant uncertainty in log-scale, so on linear scale your fit might be more faithful. If details about the error in each data-point are known, then you could consider using the sigma parameter. If that one is used properly, there should not be much difference in the two approaches. In that case, I would prefer the log-scale fitting, as the model is simpler and therefore likely to be more numerically stable.
Related
I am trying to fit a curve with the curve_fit function in SciPy. By changing the inital values of the model the quality of the fit is changing but I am not able to find the best fit through my data. Here is how my fit looks like
My question is how can I improve this fit and what is the best way of selecting the initial values of the model.
I have attached the raw data which I want to fit an exponential curve to it.
This is the data which I am using
y = [ 338.52656636 337.43934446 348.25434126 308.42768639 279.24436171
269.85992004 279.24436171 249.25992615 239.53215125 219.96215705
220.41993469 220.30549028 220.30549028 195.07049776 180.364391
171.20883816 180.24994659 180.13550218 180.47883541 209.89104892
220.19104587 180.02105777 595.45426801 324.50712607 150.60884426
170.97994934 171.20883816 170.75106052 170.75106052 159.76439711
140.88106937 150.37995544 140.88106937 1620.70451979 140.42329173
150.37995544 140.53773614 284.68047121 1146.84743797 170.97994934
150.60884426 145.74495682 141.10995819 121.53996399 121.19663076
131.38218329 170.40772729 140.42329173 140.82384716 145.5732902
140.30884732 121.53996399 700.39979247 2783.74584185 131.26773888
140.76662496 140.53773614 121.76885281 126.23218482 130.69551683]
and here is my code:
from numpy import arange
from pandas import read_csv
from scipy.optimize import curve_fit
from matplotlib import pyplot
def expDecay(t, Amax, tau):
return Amax/tau*np.exp(-t/tau)
Amax = []
Tau = []
ydata = y
x = array(range(len(y)))
xdata = x
popt, pcov = curve_fit(expDecay, x, y,
p0=(10000, 5),
bounds=([0., 2.], [10000., 30]),)
Amax.append(popt[0])
Tau.append(popt[1])
plt.plot(xdata, expDecay(xdata, *popt), 'k-', label='Pred.');
plt.plot(ydata);
plt.ylim([0, 500])
plt.show()
The deviation is due to the outliers. After eliminating them :
Note about eliminating the outliers.
Since the definition of outlier is subjective a software able to do this will probably be more or less interactive. I built my own very rudimentary software. The principle is :
A first nonlinear regression is done with all the points. With the function and parameters obtained the values of y are computed for each point. The absolute difference between the "y computed" and the "y values" from the given data file are compared. This allows to eliminate the point the further away.
Another nonlinear regression is done with the remaining points. The same procedure eliminates a second point.
And so on until a specified criteria be reached to stop. That is the subjective part.
With your data (60 points) the point n.54 was eliminated first. Then the point n.34, then n.39 and so on. The process stops after eliminating 6 points. Eliminating more points doesn't improve much the LMSE.
The curve above is the result of the last nonlinear regression with the 54 remaining points.
I'm doing a curve fit in python using scipy.curve_fit, and the fit itself looks great, however the parameters that are generated don't make sense.
The equation is (ax)^b + cx, but with the params python finds a = -c and b = 1, so the whole equation just equals 0 for every value of x.
here is the plot and my code.
(https://i.stack.imgur.com/fBfg7.png)](https://i.stack.imgur.com/fBfg7.png)
# experimental data
xdata = cfu_u
ydata = OD_u
# x-values to plot for curve fit
min_cfu = 0.1
max_cfu = 9.1
x_vec = pow(10,np.arange(min_cfu,max_cfu,0.1))
# exponential function
def func(x,a, b, c):
return (a*x)**b + c*x
# curve fit
popt, pcov = curve_fit(func, xdata, ydata)
# plot experimental data and fitted curve
plt.plot(x_vec, func(x_vec, *popt), label = 'curve fit',color='slateblue',linewidth = 2.2)
plt.plot(cfu_u,OD_u,'-',label = 'experimental data',marker='.',markersize=8,color='deepskyblue',linewidth = 1.4)
plt.legend(loc='upper left',fontsize=12)
plt.ylabel("Y",fontsize=12)
plt.xlabel("X",fontsize=12)
plt.xscale("log")
plt.gcf().set_size_inches(7, 5)
plt.show()
print(popt)
[ 1.44930871e+03 1.00000000e+00 -1.44930871e+03]
How can I find the actual parameters?
edit: here is the actual experimental raw data I used: https://pastebin.com/CR2BCJji
The chosen function model is :
y(x)=(ax)^b+cx
In order to understand the difficulty encountered one have first to compare the behaviour of the function to the data on the range of the lowest values of X.
We see that y(x)=0 is an acceptable fitting for the points on a large range (at least 6 decades ) considering the scatter. They are the majority of the experimental points (18 points among 27). The function y(x)=0 is obtained from the function model only if b=1 leading to y(x)=(a+c)x and with a+c=0. At first sight python seems to give : b=1 and c=-a. But we have to look more carefully.
Of course the fonction y(x)=0 is not convenient for the 9 points at larger X.
This draw to think that the fitting of the whole set of points is an extension of the above fitting with values of the parameters different from b=1 and a+c=0 but not far in order to continue to have a good fitting on the above 18 points.
Conclusion : The actual values of the parameters found by python are certainly very close to b=1 and a close to 1.44930871e+03 and b close to -1.44930871e+03
The calculus inside python is certainly carried out with 16 or 18 digits. But the display is with 9 digits only. This is not sufficient to see that b might be different from 1 and that c might be different from -a. This suggests that the clue might be only a matter of display with enough digits.
Yes, the fitting by python looks great. This is a fine performance on the mathematical viewpoint. But the physical signifiance is doubtful with so many digits essential to the fitting on the whole range.
I have added excel plot from which I get the exponential equation, I am trying to curve fit this in Python.
My fitted equation is not as close to the empirical data i have provided when i use it to predict the y data, the prediction gives f(-25)= 5.30e-11, while the empirical data f(-25) gives = 5.3e-13
How can i improve the code to be predicting close to empirical data, or i have made mistakes in my code??
python fitted plot
![][2]
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.optimize as optimize
import scipy.stats as stats
pd.set_option('precision', 14)
def f(x,A,B):
return A * np.exp((-B) * (x))
y_data= [2.156e-05, 1.85e-07, 1.02e-10 , 1.268e-11, 5.352e-13]
x= [-28.8, -27.4, -26 , -25.5, -25]
p, pcov = optimize.curve_fit(f, x, y_data, p0=[10**(-59),4], maxfev=5000)
plt.figure()
plt.plot(x, y_data, 'ko', label="Empirical BER")
plt.plot(x, f(x, *p ), 'g-', label="Fitted BER" )
plt.title(" BER ")
plt.xlabel('Power Rx (dB)')
plt.ylabel('')
plt.legend()
plt.grid()
plt.yscale("log")
plt.show()
Since you are plotting the data with a log-plot, your view of the data and fit is emphasizing the "tiny" compared to the "small". Fitting uses the sum of the squares of the misfit to determine the best fit. A misfit of a few percent of the data with a y-value of ~2e-5 would completely swamp a misfit of a factor of 10 or even 100 for the data with a y-value of 1.e-11. Your plot is consistent with that.
There are two possible routes to a better fit:
a) if you have uncertainties in the y-values, use those. It's quite possible that the uncertainty in the data with y~2e-5 is much larger than the uncertainty in the date with y~1.e-11, and scaling by the uncertainty so that the minimization is of the sum-of-squares of (data-model)/uncertainty will help fit the low-value data better. OTOH, if the errors are constant, plotting those uncertainties might show that the fit you have is actually not that bad -- the misfit where y~1.e-11 is only 1.e-10.
b) realize that you are assessing the fit quality by plotting the log of the data, and embrace that observation so that you fit the log(data) to log(model). Conveniently for a simple exponential function, the log of that model is linear, so you could do linear regression of the log of your data.
Bonus round: recognize that options a) and b) are related. Since a fit minimizes Sum[ ((data-model)/uncertainty)**2], not providing values for uncertainty is effectively saying that the has same uncertainty (=1.0 in fact) for all values of x and y. If you fit the log of the model to the log of the data, as withSum[ (log(data) - log(model))**2] is effectively saying that the uncertainty in the log(data) is the same for all values of x and y.
Given an undirected NetworkX Graph graph, I want to check if it is scale free.
To do this, as I understand, I need to find the degree k of each node, and the frequency of that degree P(k) within the entire network. This should represent a power law curve due to the relationship between the frequency of degrees and the degrees themselves.
Plotting my calculations for P(k) and k displays a power curve as expected, but when I double log it, a straight line is not plotted.
The following plots were obtained with a 1000 nodes.
Code as follows:
k = []
Pk = []
for node in list(graph.nodes()):
degree = graph.degree(nbunch=node)
try:
pos = k.index(degree)
except ValueError as e:
k.append(degree)
Pk.append(1)
else:
Pk[pos] += 1
# get a double log representation
for i in range(len(k)):
logk.append(math.log10(k[i]))
logPk.append(math.log10(Pk[i]))
order = np.argsort(logk)
logk_array = np.array(logk)[order]
logPk_array = np.array(logPk)[order]
plt.plot(logk_array, logPk_array, ".")
m, c = np.polyfit(logk_array, logPk_array, 1)
plt.plot(logk_array, m*logk_array + c, "-")
The m is supposed to represent the scaling coefficient, and if it's between 2 and 3 then the network ought to be scale free.
The graphs are obtained by calling the NetworkX's scale_free_graph method, and then using that as input for the Graph constructor.
Update
As per request from #Joel, below are the plots for 10000 nodes.
Additionally, the exact code that generates the graph is as follows:
graph = networkx.Graph(networkx.scale_free_graph(num_of_nodes))
As we can see, a significant amount of the values do seem to form a straight-line, but the network seems to have a strange tail in its double log form.
Have you tried powerlaw module in python?
It's pretty straightforward.
First, create a degree distribution variable from your network:
degree_sequence = sorted([d for n, d in G.degree()], reverse=True) # used for degree distribution and powerlaw test
Then fit the data to powerlaw and other distributions:
import powerlaw # Power laws are probability distributions with the form:p(x)∝x−α
fit = powerlaw.Fit(degree_sequence)
Take into account that powerlaw automatically find the optimal alpha value of xmin by creating a power law fit starting from each unique value in the dataset, then selecting the one that results in the minimal Kolmogorov-Smirnov distance,D, between the data and the fit. If you want to include all your data, you can define xmin value as follow:
fit = powerlaw.Fit(degree_sequence, xmin=1)
Then you can plot:
fig2 = fit.plot_pdf(color='b', linewidth=2)
fit.power_law.plot_pdf(color='g', linestyle='--', ax=fig2)
which will produce an output like this:
powerlaw fit
On the other hand, it may not be a powerlaw distribution but any other distribution like loglinear, etc, you can also check powerlaw.distribution_compare:
R, p = fit.distribution_compare('power_law', 'exponential', normalized_ratio=True)
print (R, p)
where R is the likelihood ratio between the two candidate distributions. This number will be positive if the data is more likely in the first distribution, but you should also check p < 0.05
Finally, once you have chosen a xmin for your distribution you can plot a comparisson between some usual degree distributions for social networks:
plt.figure(figsize=(10, 6))
fit.distribution_compare('power_law', 'lognormal')
fig4 = fit.plot_ccdf(linewidth=3, color='black')
fit.power_law.plot_ccdf(ax=fig4, color='r', linestyle='--') #powerlaw
fit.lognormal.plot_ccdf(ax=fig4, color='g', linestyle='--') #lognormal
fit.stretched_exponential.plot_ccdf(ax=fig4, color='b', linestyle='--') #stretched_exponential
lognornal vs powerlaw vs stretched exponential
Finally, take into account that powerlaw distributions in networks are being under discussion now, strongly scale-free networks seem to be empirically rare
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6399239/
Part of your problem is that you aren't including the missing degrees in fitting your line. There are a small number of large degree nodes, which you're including in your line, but you're ignoring the fact that many of the large degrees don't exist. Your largest degrees are somewhere in the 1000-2000 range, but there are only 2 observations. So really, for such large values, I'm expecting that the probability a random node has such a large degree 2/(1000*N) (or really, it's probably even less than that). But in your fit, you're treating them as if the probability of those two specific degrees is 2/N, and you're ignoring the other degrees.
The simple fix is to only use the smaller degrees in your fit.
The more robust way is to fit the complementary cumulative distribution. Instead of plotting P(K=k), plot P(K>=k) and try to fit that (noting that if the probability that P(K=k) is a powerlaw, then the probability that P(K>=k) is also, but with a different exponent - check it).
Trying to fit a line to these points is wrong, as the points are not linearly distributed over the x-axis. The fitting function of line will give more importance to the portion of the domain that contain more points.
You should redistribute the observations over the x-axis using function np.interp, like this.
logk_interp = np.linspace(np.min(logk_array),np.max(logk_array),1000)
logPk_interp = np.interp(logk_interp, logk_array, logPk_array)
plt.plot(logk_array, logPk_array,".")
m, c = np.polyfit(logk_interp, logPk_interp, 1)
plt.plot(logk_interp, m*logk_interp + c, "-")
Suppose 'h' is a function of x,y,z and t and it gives us a graph line (t,h) (simulated). At the same time we also have observed graph (observed values of h against t). How can I reduce the difference between observed (t,h) and simulated (t,h) graph by optimizing values of x,y and z? I want to change the simulated graph so that it imitates closer and closer to the observed graph in MATLAB/Python. In literature I have read that people have done same thing by Lavenberg-marquardt algorithm but don't know how to do it?
You are actually trying to fit the parameters x,y,z of the parametrized function h(x,y,z;t).
MATLAB
You're right that in MATLAB you should either use lsqcurvefit of the Optimization toolbox, or fit of the Curve Fitting Toolbox (I prefer the latter).
Looking at the documentation of lsqcurvefit:
x = lsqcurvefit(fun,x0,xdata,ydata);
It says in the documentation that you have a model F(x,xdata) with coefficients x and sample points xdata, and a set of measured values ydata. The function returns the least-squares parameter set x, with which your function is closest to the measured values.
Fitting algorithms usually need starting points, some implementations can choose randomly, in case of lsqcurvefit this is what x0 is for. If you have
h = #(x,y,z,t) ... %// actual function here
t_meas = ... %// actual measured times here
h_meas = ... %// actual measured data here
then in the conventions of lsqcurvefit,
fun <--> #(params,t) h(params(1),params(2),params(3),t)
x0 <--> starting guess for [x,y,z]: [x0,y0,z0]
xdata <--> t_meas
ydata <--> h_meas
Your function h(x,y,z,t) should be vectorized in t, such that for vector input in t the return value is the same size as t. Then the call to lsqcurvefit will give you the optimal set of parameters:
x = lsqcurvefit(#(params,t) h(params(1),params(2),params(3),t),[x0,y0,z0],t_meas,h_meas);
h_fit = h(x(1),x(2),x(3),t_meas); %// best guess from curve fitting
Python
In python, you'd have to use the scipy.optimize module, and something like scipy.optimize.curve_fit in particular. With the above conventions you need something along the lines of this:
import scipy.optimize as opt
popt,pcov = opt.curve_fit(lambda t,x,y,z: h(x,y,z,t), t_meas, y_meas, p0=[x0,y0,z0])
Note that the p0 starting array is optional, but all parameters will be set to 1 if it's missing. The result you need is the popt array, containing the optimal values for [x,y,z]:
x,y,z = popt
h_fit = h(x,y,z,t_meas)