The line of best fit doesn't match the scatter plot - python

Below is my scatter plot with a regression linear. Just by looking at how the markers are distributed on the plot, I feel like the linear is not covering them correctly. From what i see, it is supposed to be more of a diagonal and a more straight line instead of a curve. here is my code producing the plot:
for i in range (len(linkKarmaList)):
plt.scatter(commentKarmaList[i], linkKarmaList[i], marker="o", s=len(clearModSet[i])*1.0*0.9)
x = numpy.asarray(commentKarmaList)
y = numpy.asarray(linkKarmaList )
plt.plot(numpy.unique(x), numpy.poly1d(numpy.polyfit(x, y, 1))(numpy.unique(x)))
plt.xlabel('Comment Karma ')
plt.ylabel('Link Karma')
plt.title('Link and comment Karma of most popular Forums on reddit')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.show
Am I interpreting that correctly? What am I missing?

You're trying to fit a straight line y = a*x + b, which doesn't look like a straight line in log-space. Instead, you should be plotting a straight line in log-space.
This comes down to log(y) = a * log(x) + b
Which we can then rewrite to log(y) = log(x^a) + b
If we then take the exponent of this, we find:
y = x^a * 10^b or just y = C * x^a, where C (=10^b) and a are the fitting parameters and x and y are your data.
This is the function that makes a straight line in log-log space, which is the function you should try to fit against your data.

From what you show, I'd say the problem is that in the log-log plot the scatterplot looks more or less like a line.
The problem is that you're fitting against natural values and then plotting in a log-log plot.

Related

y-axis range for plotting line of best fit is way too small

I have x and y dataframes and when I plot a scatterplot I get a pretty good result as shown below:
But when I fit the data into a regression model and plot the line of best fit, the line appears to have much higher values and it's squeezing the y-axis into a clustered mess.
How do I make the y-axis have a normal range?
Here's my code:
x = df\["Year"\]
y = df\["Top speed"\]
reg_prep = LinearRegression()
mod_reg = reg_prep.fit(x.to_numpy().reshape((-1,1)),y.to_numpy())
plt.scatter(x,y)
b0 = mod_reg.intercept\_
b1 = mod_reg.coef\_\[0\]
yfit = \[b0 + b1 \* xi for xi in x\]
plt.plot(x,np.array(yfit).reshape(-1,1))
plt.show()`
Try to use this:
plt.plot(x, model.predict(x))

Using SciPy to interpolate data into a quadratic fit

I have a set of data that when plotted most points congregate to the left of the x axis:
plt.plot(x, y, marker='o')
plt.title('Original')
plt.show()
ORIGINAL GRAPH
I want to use scipy to interpolate the data and later try to fit a quadratic line to the data. I am avoiding to simply fit a quadratic curve without interpolation since this will make the obtained curve biased towards the mass of data at one extreme end of the x axis. I tried this by using
f = interp1d(x, y, kind='quadratic')
# Array with points in between min(x) and max(x) for interpolation
x_interp = np.linspace(min(x), max(x), num=np.size(x))
# Plot graph with interpolation
plt.plot(x_interp, f(x_interp), marker='o')
plt.title('Interpolated')
plt.show()
and got INTERPOLATED GRAPH.
However, what I intend to get is something like this:
EXPECTED GRAPH
What am I doing wrong?
My values for x can be found here and values for y here.
Thank you!
Solution 1
I'm pretty sure this does what you want. It fits a second degree (quadratic) polynomial to your data, then plots that function on an evenly spaced array of x values ranging from the minimum to the maximum of your original x data.
new_x = np.linspace(min(x), max(x), num=np.size(x))
coefs = np.polyfit(x,y,2)
new_line = np.polyval(coefs, new_x)
Plotting it returns:
plt.scatter(x,y)
plt.scatter(new_x,new_line,c='g', marker='^', s=5)
plt.xlim(min(x)-0.00001,max(x)+0.00001)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
if that wasn't what you meant...
However, from your question, it seems like you might be trying to force all your original y-values onto evenly spaced x-values (if that's not your intention, let me know, and I'll just delete this part).
This is also possible, there are lots of ways to do this, but I've done it here in pandas:
import pandas as pd
xy_df=pd.DataFrame({'x_orig': x, 'y_orig': y})
sorted_x_y=xy_df.sort_values('x_orig')
sorted_x_y['new_x'] = np.linspace(min(x), max(x), np.size(x))
plt.figure(figsize=[5,5])
plt.scatter(sorted_x_y['new_x'], sorted_x_y['y_orig'])
plt.xlim(min(x)-0.00001,max(x)+0.00001)
plt.xticks(rotation=90)
plt.tight_layout()
Which looks pretty different from your original data... which is why I think it might not be exactly what you're looking for.

Matplotlib negative axis

I want to fit y=mx+c straight line to my data points, but in log form. For this purpose I am using curve_fit module. My simple code is
def func(x,m,c):
return (x*m + c)
x=log10(xdata)
y=log10(ydata)
err=log10(error)
coeff, var = curve_fit(func,x,y,sigma=err)
yfit = func(x,coeff[0],coeff[1])
pl.plot(x,y,'r0')
pl.plot(x,yfit,'k-')
pl.show()
This plot gives me negative numbers on y axis as my y values are in mV. Is there any way to use original xdata and ydata (in mV) on plots with log fitting?
Plot transformed variables instead.
plot(10**x, 10**yfit, 'k-')
and maybe display the plot in log scale
set_yscale('log')

How to transform nonlinear model to linear?

I'm ananlyzing a dataset, and I know that the data should follow a power model:
y = a*x**b
I transformed it to linear by taking logarithms:
ln(y) = ln(a) + b* ln(x)
However, the problems arised on adding a trend line to the plot
slope, intercept, r_value, p_value, std_err = scipy.stats.mstats.linregress(x_ln, y_ln)
yy = np.exp(intercept)*wetarea_x**slope
plt.scatter(wetarea_x, arcgis_wtrshd_x, color = 'blue')
plt.plot(wetarea_x, yy, color = 'green')
This is what I get with this code.
How to modify the code, so that the trend line on the plot would be correct?
Your green strange plot is what you get when you do a line plot in matplotlib, with the x values unsorted. It's a line plot, but it connects by lines (x, y) pairs jumping right and left (in your specific case, it looks like back to near the x-origin). That gives these strange patterns.
You don't have this problem with the blue plot, because it's a scatter plot.
Try calling the plot after sorting both arrays according to the indices of the first using numpy.argsort, say
wetarea_x[np.argsort(wetarea_x)]
and
yy[np.argsort(wetarea_x)]

Python Curve Fitting

I'm trying to use Python to fit a curve to a set of points. Essentially the points look like this.
The blue curve indicates the data entered (in this case 4 points) with the green being a curve fit using np.polyfit and polyfit1d. What I essentially want is a curve fit that looks very similar to the blue line but with a smoother change in gradient at points 1 and 2 (meaning I don't require the line to pass through these points).
What would be the best way to do this? The line looks like an arctangent, is there any way to specify an arctangent fit?
I realise this is a bit of a rubbish question but I want to get away without specifying more points. Any help would be greatly appreciated.
It seems that you might be after interpolation between points rather than fitting a polynomial References: Spline Interpolation with Python and Fitting polynomials to data
However, in either case here is a code snippet that should get you started:
import numpy as np
import scipy as sp
from scipy.interpolate import interp1d
x = np.array([0,5,10,15,20,30,40,50])
y = np.array([0,0,0,12,40,40,40,40])
coeffs = np.polyfit(x, y, deg=4)#you can change degree as you see fit
poly = np.poly1d(coeffs)
yp = np.polyval(poly, x)
interpLength = 10
new_x = np.linspace(x.min(), x.max(), new_length)
new_y = sp.interpolate.interp1d(x, y, kind='cubic')(new_x)
plt.plot(x, y, '.', x, yp, '-', new_x,new_y, '--')
plt.show()

Categories