Theil-Sen regression with sklearn on a log-log scale - python
I'm trying to plot my data on a log-log scale, using Theil-Sen regression for the best fit line. However, when I plot work out my regression line on a log-scale 2, it's parallel to my x=y line, which I don't think is correct.
normal scale for X and y :
log-log scale for X and y :
I found a related solution by chaooder for Linear Regression on a semi-log scale to be somewhat helpful. So currently, my regression line would go from being:
y = ax + c on a linear scale to y = 10^^(a log(x)+c) on my log-log scale. But in my head, I can't understand how that has a solution as I cannot calculate a.
Here's the data:
index,x,y
0,0.22,0.26
1,0.39,0.1
2,0.4,0.17
3,0.56,0.41
4,0.57,0.12
5,0.62,0.54
6,0.78,0.99
7,0.79,0.35
8,0.8,0.33
9,0.83,0.91
10,0.95,0.81
11,1.08,0.23
12,1.34,0.11
13,1.34,0.44
14,1.35,0.11
15,1.58,0.24
16,1.66,0.71
17,2.11,0.54
18,2.13,0.42
19,2.19,1.72
20,2.25,2.16
21,2.39,0.95
22,2.4,0.16
23,2.73,0.92
24,2.87,1.1
25,2.96,0.27
26,3.12,1.66
27,3.26,0.06
28,3.28,0.68
29,3.34,0.7
30,3.38,1.14
31,3.39,1.81
32,3.41,0.19
33,3.49,1.4
34,3.52,1.57
35,3.6,0.99
36,3.64,1.28
37,3.65,1.68
38,3.89,1.66
39,3.93,1.64
40,4.01,1.04
41,4.07,0.32
42,4.22,0.68
43,4.52,0.57
44,4.53,0.59
45,4.56,0.7
46,4.6,1.15
47,4.62,1.31
48,4.68,1.09
49,5.03,0.48
50,5.06,0.7
51,5.31,0.62
52,5.41,0.21
53,5.45,2.06
54,6.0,0.72
55,6.06,0.36
56,6.64,1.41
57,6.74,0.59
58,6.96,0.95
59,7.01,1.13
60,7.14,1.56
61,7.14,2.82
62,7.19,1.49
63,7.21,0.88
64,7.23,1.31
65,7.55,0.76
66,7.72,0.5
67,7.75,1.65
68,7.77,1.48
69,7.9,1.8
70,7.95,0.68
71,8.03,1.12
72,8.09,2.61
73,8.86,1.71
74,9.31,0.23
75,9.5,2.35
76,9.62,1.84
77,9.91,0.56
78,9.95,1.67
79,10.4,1.15
80,10.8,0.88
81,11.28,1.8
82,11.31,1.58
83,11.43,1.0
84,12.38,2.83
85,13.38,1.45
86,13.9,1.99
87,30.3,1.99
And my current code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator
from sklearn.linear_model import TheilSenRegressor
for log in [True, False]:
fig,ax = plt.subplots()
data.plot.scatter(ax=ax,
x='x',
y='y',
loglog=log)
vmin = np.amin(data[['x','y']].min().values)*0.8
vmax = np.amax(data[['x','y']].max().values)*1.25
ax.set_xlim(vmin,vmax)
ax.set_ylim(vmin,vmax)
ax.yaxis.set_minor_locator(AutoMinorLocator())
ax.xaxis.set_minor_locator(AutoMinorLocator())
# best fit (ThielSen) line
X = data.x.values[:,np.newaxis]
y = data.y.values
if log:
X = np.log10(X)
y = np.log10(y)
if len(y) > 0:
estimator = TheilSenRegressor(fit_intercept=False) # intentionally set intercept to 0
estimator.fit(X=X,y=y)
y0 = y[0]
x0 = X[0]
y_pred = estimator.predict(np.array([vmin,vmax]).reshape(2,1))
# y_pred = np.power(10,(estimator.predict(X)))
gradient = (y_pred[1] - y_pred[0]) / (vmax - vmin)
intercept = y_pred[1] - gradient * vmax
print(f'gradient: {gradient} \n intercept: {intercept}')
# Theil-Sen regression line
ax.plot([vmin,vmax],y_pred,color='red',lw=1,zorder=1,label='Best fit')
# 1:1 ratio line (black, dashed)
ax.plot([vmin,vmax],[vmin,vmax],lw=1,color='black',ls='--',alpha=0.6,zorder=1,
label='1:1 correlation')
if log:
ax.set_xscale('log');ax.set_yscale('log')
ax.set_title('log-log scale')
fig.savefig('TS_regression_loglog.png')
else:
ax.set_title('normal scale')
fig.savefig('TS_regression_normalscale.png')
If you fitted on log-log, the input for prediction needs to be on the log scale, and you need to transform the prediction before plotting them. These are the lines in question where it's not consistent in terms of scale:
y_pred = estimator.predict(np.array([vmin,vmax]).reshape(2,1))
[..]
ax.plot([vmin,vmax],y_pred,color='red',lw=1,zorder=1,label='Best fit')
Define some of the variables in your code, note you should get the intercept and gradient from the fit:
vmin = np.amin(data[['x','y']].min().values)*0.8
vmax = np.amax(data[['x','y']].max().values)*1.25
X = data.x.values[:,np.newaxis]
y = data.y.values
With a slight modification to your code:
vmin = np.amin(data[['x','y']].min().values)*0.8
vmax = np.amax(data[['x','y']].max().values)*1.25
X = data.x.values[:,np.newaxis]
y = data.y.values
fig,ax = plt.subplots()
data.plot.scatter(ax=ax,x='x',y='y')
estimator = TheilSenRegressor(fit_intercept=False) # intentionally set intercept to 0
estimator.fit(X=np.log10(X),y=np.log10(y))
gradient = estimator.coef_[0]
intercept = estimator.intercept_
print([gradient,intercept])
y_pred = estimator.predict(np.log10([vmin,vmax]).reshape(2,1))
ax.plot([vmin,vmax],10**(y_pred),color='red',lw=1,zorder=1,label='Best fit')
ax.plot([vmin,vmax],[vmin,vmax],lw=1,color='black',ls='--',alpha=0.6,zorder=1,
label='1:1 correlation')
ax.set_xscale('log')
ax.set_yscale('log')
Related
Calculating R-square of a slope of specific part of a graph
As a tradition, I want to say that I am pretty new to python. I have set of x and y values as csv file, and my y values are pretty noisy. So far, I managed to use a filter(scipy.signal.savgol_filter) to filter the noise, plot my graph, and get a linear regression of my data, where it is showing a linear trend. This part is important, because my question is related to linear fitting of some part of the data. Here is the code: import numpy as np import pandas as pd import matplotlib.pyplot as plt import scipy.signal import os from scipy.signal import savgol_coeffs from sklearn.metrics import r2_score from scipy.linalg import lstsq plt.rc('lines',linewidth=1) plt.rc('axes', labelsize=16) plt.rc('xtick', labelsize=14) plt.rc('ytick', labelsize=14) plt.rc('legend', fontsize=10) # define material parameters deprate = 5.90E-07 #deposition rate unit: cm/s Ms = 180000 # Si substrate modulus, unit: MPa hs = 0.03 # Si substrate thickness, unit: cm stressfac = Ms*hs**2/6 # stress prefactor, unit: MPa def fit_slope(hfilm, curvature, h0, h1): # use least square fitting to find the slope of the film_thickness vs curvature curve btw # thickness = h0 and h1. # return fitting parameter p. Least square fitting line is y = p[1]*x + p[0] xdata = hfilm[ (hfilm>h0) & (hfilm<h1)] ydata = curvature[(hfilm>h0) & (hfilm<h1)] A = xdata[:, np.newaxis] ** [0,1] p, *_ = lstsq(A, ydata) return p def read_MOSSdata(filename, deprate): data = pd.read_csv(filename, sep='\s*,\s*', engine='python') time = data['time [s]'][~data['time [s]'].isna()].to_numpy() curvature = data['Curvature'][~data['time [s]'].isna()] # curvature unit: 1/cm hfilm = time * deprate # file thickness unit: cm return time, hfilm, curvature filename = (r'C:\Users\yavuz\01-0722-2.csv') time, hfilm, curvature = read_MOSSdata(filename, deprate) h0 = 0.00005 h1 = 0.00008 xdata = np.linspace(h0, h1, 500) yhat = scipy.signal.savgol_filter(curvature, 21,1) p = fit_slope(hfilm, yhat, h0, h1) plt.plot(hfilm, curvature) plt.plot(hfilm, yhat, color='red', label = 'filtered data') plt.plot(xdata, p[1]*xdata + p[0], color='green', linewidth=4, label = 'linear fitting') plt.xlabel("Film thickness (cm)") plt.ylabel("Curvature(1/cm)") print(f'fitted stress = {-p[1]*stressfac} MPa') plt.legend(loc=0) My question is how do I calculate R-square value of this slope on my graph? I tried using r-square value calculators like sklearn.metrics but the problem is that I am limiting my x values to get a slope of a window, and all of the codes I tried, showing the problem of ''expected x and y to have same length''. I would add the csv file but it seems like there is not such an option. Thanks a lot for the help!
What could be causing incorrect 2-D interpolation in SciPy?
I have a rectilinear (not regular) grid of data (x,y,V) where V is the value at the position (x,y). I would like to use this data source to interpolate my results so that I can fill in the gaps and plot the interpolated values (inside the range) in the future. (Also I need functionality of griddata to check arbitrary values inside the range). I looked at the documentation at SciPy and here. Here is what I tried, and the result: It clearly doesn't match the data. # INTERPOLATION ATTEMPT? from scipy.interpolate import Rbf import numpy as np import matplotlib.pyplot as plt import matplotlib.cm as cm edges = np.linspace(-0.05, 0.05, 100) centers = edges[:-1] + np.diff(edges[:2])[0] / 2. XI, YI = np.meshgrid(centers, centers) # use RBF rbf = Rbf(x, y, z, epsilon=2) ZI = rbf(XI, YI) # plot the result plt.subplots(1,figsize=(12,8)) X_edges, Y_edges = np.meshgrid(edges, edges) lims = dict(cmap='viridis') plt.pcolormesh(X_edges, Y_edges, ZI, shading='flat', **lims) plt.scatter(x, y, 200, z, edgecolor='w', lw=0.1, **lims) #decoration plt.title('RBF interpolation?') plt.xlim(-0.05, 0.05) plt.ylim(-0.05, 0.05) plt.colorbar() plt.show() For reference, here is my data (extracted), it has a circular pattern that I need interpolation to recognize. #DATA experiment1raw = np.array([ [0,40,1,11.08,8.53,78.10,2.29], [24,-32,2,16.52,11.09,69.03,3.37], [8,-32,4,14.27,10.68,71.86,3.19], [-8,-32,6,10.86,9.74,76.69,2.72], [-24,-32,8,6.72,12.74,77.08,3.45], [32,-24,9,18.49,13.67,64.32,3.52], [-32,-24,17,6.72,12.74,77.08,3.45], [16,-16,20,13.41,21.33,59.92,5.34], [0,-16,22,12.16,14.67,69.04,4.12], [-16,-16,24,9.07,13.37,74.20,3.36], [32,-8,27,19.35,17.88,57.86,4.91], [-32,-8,35,6.72,12.74,77.08,3.45], [40,0,36,19.25,20.36,54.97,5.42], [16,0,39,13.41,21.33,59.952,5.34], [0,0,41,10.81,19.55,64.37,5.27], [-16,0,43,8.21,17.83,69.34,4.62], [-40,0,46,5.76,13.43,77.23,3.59], [32,8,47,15.95,23.61,54.34,6.10], [-32,8,55,5.97,19.09,70.19,4.75], [16,16,58,11.27,26.03,56.36,6.34], [0,16,60,9.19,24.94,60.06,5.79], [-16,16,62,7.10,22.75,64.57,5.58], [32,24,65,12.39,29.19,51.17,7.26], [-32,24,73,5.40,24.55,64.33,5.72], [24,32,74,10.03,31.28,50.96,7.73], [8,32,76,8.68,30.06,54.34,6.92], [-8,32,78,6.88,28.78,57.84,6.49], [-24,32,80,5.83,26.70,61.00,6.46], [0,-40,81,7.03,31.55,54.40,7.01], ]) #Atomic Percentages are set here Cr1 = experiment1raw[:,3] Mn1 = experiment1raw[:,4] Fe1 = experiment1raw[:,5] Co1 = experiment1raw[:,6] #COORDINATE VALUES IN PRE-T x_pret = experiment1raw[:,0]/1000 y_pret = experiment1raw[:,1]/1000 #important translation x = -y_pret y = -x_pret z = Cr1
You used a larger epsilon in RBF. Best bet is to set it as default and let scipy calculate an appropriate value. See the implementation here. So setting default epsilon: rbf = Rbf(x, y, z) I got a pretty good interpolation for your data (subjective opinion).
Asymmetric Gaussian Fit in Python
I'm trying to fit an asymmetric Gaussian to this data: http://ge.tt/99iNaL53 (csv file). I have tried to use a skewed Gaussian model from lmfit, and also a spline, but I'm not able to get the Gaussian model to fit well and the splines are not what I'm looking for (I don't want the spline to fit the data exactly as shown below, and altering the level of smoothing isn't helping). Here is code using the above data that produces the plot below. The second figure is an example of what I'm trying to achieve with the goal of reading the rise and decay time from the fit. import numpy as np import matplotlib.pyplot as plt from scipy.interpolate import CubicSpline from scipy.interpolate import UnivariateSpline from lmfit.models import SkewedGaussianModel data = np.loadtxt('data.csv', delimiter=',') x = data[:,0] y = data[:,1] # Skewed Gaussian fit model = SkewedGaussianModel() params = model.make_params(amplitude=400, center=3, sigma=7, gamma=1) result = model.fit(y, params, x=x) # Cubic Spline cs = CubicSpline(x, y) x_range = np.arange(x[0], x[-1], 0.1) # Univariate Spline us = UnivariateSpline(x, y, k = 1) # Univariate Spline (smoothed) us2 = UnivariateSpline(x, y, k = 5) plt.scatter(x, y, marker = '^', color = 'k', linewidth = 0.5, s = 10, label = 'data') plt.plot(x_range, cs(x_range), label = 'Cubic Spline') plt.plot(x_range, us(x_range), label = 'Univariate Spline, k = 1') plt.plot(x_range, us2(x_range), label = 'Univariate Spline, k = 5') plt.plot(x, result.best_fit, color = 'red', label = 'Skewed Gaussian Attempt') plt.xlabel('x') plt.ylabel('y') plt.yscale('log') plt.ylim(1,500) plt.legend() plt.show()
Is there a question here? I don't see one, actually. That result from lmfit is the best fit to a skewed Gaussian model. You've chosen to plot the result on a log-scale. That completely changes the view of the quality of the fit or what is not fit well. It seems like you're expecting a better fit, but not *too good. Well, it looks like your data is not perfectly represented by a single skewed Gaussian. It seems like you were not expecting it to be. You could try different forms for the model function, say a skewed Lorentzian or something. But your data has that low x shoulder that definitely does not look like your uncited figure.
I wrote something for J. Chem. Ed. [1] that involved fitting asymmetric Gaussian functions to data, you can find the core repo here [2] but below is a snippet on how I went about fitting a data set where x = data[:,0] and y = data[:,1] to the type of function you're working with: import numpy as np from scipy.optimize import leastsq from scipy.special import erf initials = [6.5, 13, 1, 0] # initial guess def asymGaussian(x, p): amp = (p[0] / (p[2] * np.sqrt(2 * np.pi))) spread = np.exp((-(x - p[1]) ** 2.0) / (2 * p[2] ** 2.0)) skew = (1 + erf((p[3] * (x - p[1])) / (p[2] * np.sqrt(2)))) return amp * spread * skew def residuals(p,y,x): return y - asymGaussian(x, p) # executes least-squares regression analysis to optimize initial parameters cnsts = leastsq( residuals, initials, args=( data_set[:,1], # y value data_set[:,0] # x value ))[0] y = asymGaussian(data[:,0], cnsts) finally just plot (y, data[:,0]). Hope this helps! [1] https://pubs.acs.org/doi/10.1021/acs.jchemed.9b00818 [2] https://github.com/1mikegrn/pyGC
Plotting bars hist and PDF line (via kdeplot)
I'm trying to plot bar hist of interest rates and attach to it a PDF line. I have looked for solutions and found a way with kdeplot. The result is pretty strange the kdeplot line is much higher than the bars hist and I don't know how to fix it. After applying kdeplot: Before applying kdeplot: Here is the code that I'm using: df=pd.read_excel('interestrate.xlsx') k=0.0005 bin_steps = np.arange(start = df['Interest rate Real'].min(), stop = df['Interest rate Real'].max(), step = k) ax = df['Interest rate Real'].hist(bins = bin_steps, figsize=[10,5]) ax1 = df['Interest rate Real'] vals = ax.get_xticks() ax.set_xticklabels(['{:,.2%}'.format(x) for x in vals]) ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals]) ax.set_title("PDF for Real Interest Rate") #sns.kdeplot(ax1)
The following code snippet should set you in the right direction (just insert your data): import scipy.stats as st y = np.random.randn(1000) # your data goes here plt.hist(y,50, density=True) mn, mx = plt.xlim() plt.xlim(mn, mx) x = np.linspace(mn, mx, 301) kde = st.gaussian_kde(y) plt.plot(x, kde.pdf(x)); Alternatively with seaborn: import seaborn as sns plt.hist(y,50, density=True) sns.kdeplot(y); or as simple as: sns.distplot(y)
plot individual peaks after gaussian curve fitting with python-lmfit
From this piece of code I can print the final fit with "out.best_fit", what I would like to do now, is to plot each of the peaks as individual gaussian curves, instead of all of them merged in one single curve. from pylab import * from lmfit import minimize, Parameters, report_errors from lmfit.models import GaussianModel, LinearModel, SkewedGaussianModel from scipy.interpolate import interp1d from numpy import * fit_data = interp1d(x_data, y_data) mod = LinearModel() pars = mod.make_params(slope=0.0, intercept=0.0) pars['slope'].set(vary=False) pars['intercept'].set(vary=False) x_peak = [278.35, 334.6, 375] y_peak = [fit_data(x) for x in x_peak] i = 0 for x,y in zip(x_peak, y_peak): sigma = 1.0 A = y*sqrt(2.0*pi)*sigma prefix = 'g' + str(i) + '_' peak = GaussianModel(prefix=prefix) pars.update(peak.make_params(center=x, sigma=1.0, amplitude=A)) pars[prefix+'center'].set(min=x-20.0, max=x+20.0) pars[prefix+'amplitude'].set(min=0.0) mod = mod + peak i += 1 out = mod.fit(y_data, pars, x=x_data) plt.figure(1) plt.plot(x_data, y_data) plt.figure(1) plt.plot(x_data, out.best_fit, '--') Plot of the global fit:
I think you want to do this after your fit: components = out.eval_components(x=x_data) for model_name, model_value in components.items(): plt.plot(x_data, model_value) # or more simply, if you prefer: plt.plot(x_data, components['g0_']) plt.plot(x_data, components['g1_']) ... That is, ModelResult.eval_components() for a composite model will return a dictionary with keys that are the prefixes of the component models, and values that are the calculated model for that component.