I am trying to plot a Linear Regression onto a scatterplot in Python.
In R I would simply do the following:
Run OLS Linear Regresion
fit_1 <- lm(medv ~ lstat)
plot(medv ~ lstat)
abline(fit_1, col = "red")
I have been looking at different solutions in Python, but I can't seem to be able to actually get it to work.
My script is:
Plot Data
Boston.plot(kind='scatter', x='medv', y='lstat', color = "black")
plt.show()
Run Linear Regression
fit_1 = sm.ols(formula='medv ~ lstat', data= Boston).fit()
Show Summary
fit_1.summary()
Plot Regression Line
Insert code here
It can be done quite simply. In the below code, I use sklearn to fit the model and predict the values.
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
model = LinearRegression()
model.fit(X,y)
predictions = model.predict(X)
plt.plot(X,y,'o')
# change here
plt.plot(X, predictions, '-')
plt.show()
Try this:
plt.plot(Boston.lstat, fit_1.fittedvalues, 'r')
Saw this on Statology that helped me a lot:
def abline(slope, intercept):
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = intercept + slope * x_vals
plt.plot(x_vals, y_vals, '--')
Related
Is it possible to find the peak (vertex?) values (x,y) of a polynomial regression line that was computed using Matplotlib?
I've included my basic setup below (of course with fuller data sets), as well as a screenshot of the actual regression line question.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
degree=6
setX={'Fixation Duration': {0:1,1:2,2:3}}
setY={'Fixation Occurrences': {0:1,1:2,2:3}}
X_gall=pd.DataFrame.from_dict(setX)
Y_gall=pd.DataFrame.from_dict(setY)
X_seqGall = np.linspace(X_gall.min(),X_gall.max(),300).reshape(-1,1)
polyregGall=make_pipeline(PolynomialFeatures(degree),LinearRegression())
polyregGall.fit(X_gall,Y_gall)
plt.scatter(X_gall,Y_gall, c="#1E4174", s=100.0, alpha=0.4)
plt.plot(X_seqGall,polyregGall.predict(X_seqGall),color="#1E4174", linewidth=4)
plt.show()
would like to find x,y values along red arrows
You can find the maximum from the underlying plot data.
First, let's change your plotting commands to explicitly define the axes:
fig, ax = plt.subplots(figsize=(6,4))
_ = ax.scatter(X_gall,Y_gall, c="#1E4174", s=100.0, alpha=0.4)
poly = ax.plot(X_seqGall,polyregGall.predict(X_seqGall),color="#1E4174", linewidth=4)
plt.show()
Now you can access the line data:
lines = poly[0].axes.lines
for line in lines:
max_y = np.max(line.get_ydata())
print(f"Maximum y is: {max_y}")
x_of_max_y = line.get_xdata()[np.argmax(line.get_ydata())]
print(f"x value of maximum y is: {x_of_max_y}")
Output:
Maximum y is: 3.1515605364361114
x value of maximum y is: 2.8127090301003346
I have never been great with Python plotting concepts, and now I'm still apparently missing something new.
Here is my code.
import pandas as pd
import matplotlib.pyplot as plt
import sys
from numpy import genfromtxt
from sklearn.cluster import DBSCAN
data = pd.read_csv('C:\\Users\\path_here\\wine.csv')
data
# Reading in 2D Feature Space
model = DBSCAN(eps=0.9, min_samples=10).fit(data)
array_flavanoids = data.iloc[:, 2]
# Slicing array
array_colorintensity = data.iloc[:, 3]
# Scatter plot function
colors = model.labels_
plt.scatter(array_flavanoids, array_colorintensity, c=colors, marker='o')
plt.xlabel('Concentration of flavanoids', fontsize=16)
plt.ylabel('Color intensity', fontsize=16)
plt.title('Concentration of flavanoids vs Color intensity', fontsize=20)
plt.show()
Here is my result.
I am expecting the outliers to be in a different color than the non-outliers. So, something like this.
Maybe one color for outliers and another for non-outliers. I am just trying to learn the concept in this exercise. I am trying to follow the example from this link.
https://towardsdatascience.com/outlier-detection-python-cd22e6a12098
I am using this data source.
https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
I am testing different data sets.
I got this to work.
from sklearn.cluster import DBSCAN
def dbscan(X, eps, min_samples):
ss = StandardScaler()
X = ss.fit_transform(X)
db = DBSCAN(eps=eps, min_samples=min_samples)
db.fit(X)
y_pred = db.fit_predict(X)
plt.scatter(X[:,0], X[:,1],c=y_pred, cmap='Paired')
plt.title("DBSCAN")
dbscan(data, eps=.5, min_samples=5)
I found this to be a great resource.
https://medium.com/#plog397/functions-to-plot-kmeans-hierarchical-and-dbscan-clustering-c4146ed69744
I'm trying to make a small program that will plot a graph with best fit line and that will predict the COST value based on inputted SIZE value.
I always get this error, and I do not know what it means:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
No handles with labels found to put in legend.
This is the graph that I get (red), and I think that curve should look like green curve that i have draw.
And finally, program makes prediction only when I exit the graph.
What am I doing wrong?
This is the code:
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt
size=[[1],[2],[3],[4],[5],[7],[9],[10],[11],[13]]
cost=[[10],[22],[35],[48],[60],[80],[92],[111],[118],[133]]
def predict(size,cost,x):
dates=np.reshape(size,(len(size),1))
svr_poly=SVR(kernel="poly",C=1e3, degree=2)
svr_poly.fit(size,cost)
plt.scatter(size,cost, color="blue")
plt.plot(cost, svr_poly.predict(cost), color="red")
plt.xlabel("Size")
plt.ylabel("Cost")
plt.title("prediction")
plt.legend()
plt.show()
predictedcost=predict(size,cost,7)
print(predictedcost)
Here, I found the answer to this problem. So if you are interested, check it
import numpy as np
import matplotlib.pyplot as plt
import math
X = np.array([1,2,3,5,6,7,4,7,8,9,5,10,11,7,6,6,10,11,11,12,13,13,14])
Y=np.array([2,3,5,8,11,14,9,19,15,19,15,16,14,7,13,13,14,13,23,25,26,27,33])
koeficienti_polinom = np.polyfit(X, Y, 2)
a=koeficienti_polinom[0]
b=koeficienti_polinom[1]
c=koeficienti_polinom[2]
xval=np.linspace(np.min(X), np.max(X))
regression=a * xval**2 + b*xval + c
predX = float(input("Enter: "))
predY = a * predX**2 + b*predX + c
plt.scatter(X,Y, s=20, color="blue" )
plt.scatter(predX, predY, color="red")
plt.plot(xval, regression, color="black", linewidth=1)
print("Kvadratno predvidjanje: ",round(predY,2))
I'm stuck solving this issue for two days now. I have some datapoints I put in a scatter plot and get this:
Which is nice, but now I also want to add a regression line, so I had a look at this example from sklearn and changed the code to this
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
degrees = [3, 4, 5]
X = combined[['WPI score']]
y = combined[['CPI score']]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features), ("linear_regression", linear_regression)])
pipeline.fit(X, y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X, y, scoring="neg_mean_squared_error", cv=10)
X_test = X #np.linspace(0, 1, len(combined))
plt.plot(X, pipeline.predict(X_test), label="Model")
plt.scatter(X, y, label="CPI-WPI")
plt.xlabel("X")
plt.ylabel("y")
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(degrees[i], -scores.mean(), scores.std()))
plt.savefig(pic_path + 'multi.png', bbox_inches='tight')
plt.show()
which has the following output:
Note that X and y are both DataFrames of size (151, 1). I can post the content of X and y too, if necessary.
What I want is a nice smooth line, but I seem not to be able to figure out, how to do this.
[Edit]
The question here is: How do I get a single smooth, curvy polynomial line instead of multiple ones with seemingly random pattern.
[Edit 2]
The problem is, when I use the linspace like this:
X_test = np.linspace(1, 4, 151)
X_test = X_test[:, np.newaxis]
I get a even more random pattern:
The trick was to set the code like following:
X_test = np.linspace(min(X['GPI score']), max(X['GPI score']), X.shape[0])
X_test = X_test[:, np.newaxis]
plt.plot(X_test, pipeline.predict(X_test), label="Model")
Which yields the following result (a much nicer, single smooth line)
gh_data = ascii.read('http://dept.astro.lsa.umich.edu/~ericbell/data/GHOSTS/M81/ngc3031- field15.newphoto_radec')
ra = gh_data['col5'][:]
dec = gh_data['col6'][:]
f606 = gh_data['col3'][:]
f814 = gh_data['col4'][:]
plot(f6062-f8142,f8142, 'bo', alpha=0.15)
axis([-1,2.5,27,23])
xlabel('F606W-F814W')
ylabel('F814W')
title('Field 14')
The data set is imported and organized into different columns, I am trying to overlay a line of best fit, or linear regression over the scatterplot created, but I cannot figure out how. Thanks in advance.
As #rayryeng pointed out, your code just plots the data, but doesn't actually compute any regression results to plot. Here's one way of doing it:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame({"y": range(1,11)+np.random.rand(10),
"x": range(1,11)+np.random.rand(10)})
Use statsmodels OLS method to fit a regression line, and params to extract the coefficient on the single regressor:
beta_1 = sm.OLS(data.y, data.x).fit().params
Produce a scatterplot and add a regression line:
fig, ax = plt.subplots()
ax.scatter(data.x, data.y)
ax.plot(range(1,11), [i*beta_1 for i in range(1,11)], label = "best fit")
ax.legend(loc="best")