Python sklearn poly regression - python

I'm stuck solving this issue for two days now. I have some datapoints I put in a scatter plot and get this:
Which is nice, but now I also want to add a regression line, so I had a look at this example from sklearn and changed the code to this
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
degrees = [3, 4, 5]
X = combined[['WPI score']]
y = combined[['CPI score']]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features), ("linear_regression", linear_regression)])
pipeline.fit(X, y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X, y, scoring="neg_mean_squared_error", cv=10)
X_test = X #np.linspace(0, 1, len(combined))
plt.plot(X, pipeline.predict(X_test), label="Model")
plt.scatter(X, y, label="CPI-WPI")
plt.xlabel("X")
plt.ylabel("y")
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(degrees[i], -scores.mean(), scores.std()))
plt.savefig(pic_path + 'multi.png', bbox_inches='tight')
plt.show()
which has the following output:
Note that X and y are both DataFrames of size (151, 1). I can post the content of X and y too, if necessary.
What I want is a nice smooth line, but I seem not to be able to figure out, how to do this.
[Edit]
The question here is: How do I get a single smooth, curvy polynomial line instead of multiple ones with seemingly random pattern.
[Edit 2]
The problem is, when I use the linspace like this:
X_test = np.linspace(1, 4, 151)
X_test = X_test[:, np.newaxis]
I get a even more random pattern:

The trick was to set the code like following:
X_test = np.linspace(min(X['GPI score']), max(X['GPI score']), X.shape[0])
X_test = X_test[:, np.newaxis]
plt.plot(X_test, pipeline.predict(X_test), label="Model")
Which yields the following result (a much nicer, single smooth line)

Related

How to fit more complex funtions with sklearn?

I used sklearn in python to fit polynomial functions:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_reg_model = LinearRegression()
poly_features = poly.fit_transform(xvalues.reshape(-1, 1))
poly_reg_model.fit(poly_features, y_values)
final_predicted = poly_reg_model.predict(poly_features)
...
Instead of only using x^n parts, I want to incude a (1-x^2)^(1/2) part in the fit-function.
Is this possible with sklearn?
I tried to define a Feature which includes more complex terms but I falied to achieve this.
No idea whether it is possible within scikitlearn - after all polynomial fit is constrained to specific polynomial formulations from the mathematical stanndpoint. If you want to fit a formula with some unknown parameters, you can use scipy.optimize.curve_fit. First let us generate some dummy data with noise:
import numpy as np
from matplotlib import pyplot as plt
def f(x):
return (1-x**2)**(1/2)
xvalues = np.linspace(-1, 1, 30)
yvalues = [f(x) + np.random.randint(-10, 10)/100 for x in xvalues]
Then, we set up our function to be optimized:
from scipy.optimize import curve_fit
def f_opt(x, a, b):
return (a-x**2)**(1/b)
popt, pcov = curve_fit(f_opt, xvalues, yvalues)
You can of course modify this function to be more elastic. Finally we plot the results
plt.scatter(xvalues, yvalues, label='data')
plt.plot(xvalues, f_opt(xvalues, *popt), 'r-', label='fit')
plt.legend()
So now you can use f_opt(new_x, *popt) to predict new points (alternatively you can print the values and hard-code them). popt basically has the parameters that you specify in f_opt except x - for more details check the documentation I've linked!

Polynomial Regression plot not showing correctly

I run this code for polynomial regression using sklearn but my plot is not what i was expecting. As you can see here i'm not getting a smooth line but it's jumping from one point to another. From my understanding i have to sort X, but when i do that all i get is an empty plot with a linear line.
import operator
import numpy as np
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.formula.api as smf
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Age', 'Annual Income (k$)','Spending Score (1-100)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=3, max_iter=100)
y_kmeans= kmeans.fit_predict(x)
mydict = {i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
dictlist = []
for key, value in mydict.items():
temp = [key,value]
dictlist.append(temp)
df0 = df[df.index.isin(mydict[0].tolist())]
X = df0[['Age', 'Annual Income (k$)']]
Y = df0['Spending Score (1-100)']
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, Y)
y_poly_pred = model.predict(X_poly)
r2 = r2_score(Y,y_poly_pred)
print(r2)
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression(fit_intercept = False))
model.fit(X,Y)
plt.scatter(X.iloc[:, 1], Y, color='red')
plt.plot(X, Y, color='blue')
plt.xlabel('Age. Annual income')
plt.ylabel('Spending Score')
plt.show()
TLDR; the data is not linear dependent.
The reason the graph got so messy is because you plotted the X (train data) with the Y (the actual prediction data) and the fact that you were plotting this data while:
the data was messy and not really linear dependent
is what made the result this messy graph.
I suggest you to:
split to the train data into train, test and then after you train the model check the error with the test and maybe create 2 plots, 1 with the model results according to the test data and one with the actual result for the test data.
and change plot code to this:
.
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

Linear regression plot is really bad

observations that are different from each other so i run regression again but for only one cluster.But it also came out wrong What exactly is wrong here? I'll also have to point out that i am still new to this (linerear regression etc.) so my understanding of all this is still bad. How can i fix this plot and please if it's possible try to explain why it's wrong.
Code :
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(0)
kmeans.cluster_centers_
kmeans.labels_
n,y_test = train_test_split(X, Y, test_size = 0.4, random_state = 0)
plt.scatter(X.iloc[:, 1], Y)
plt.show()
You're performing multiple linear regression, since you have 2 input features ('Age', 'Annual Income (k$)') that try to predict the output feature ('Spending Score (1-100)'). You need to plot this data in 3D, in order to properly visualize the regression.
Even though I can't test your code without the data, something like this should work (after training the model):
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X.iloc[:, 0], X.iloc[:, 1], Y)
ax.plot(X.iloc[:, 0], X.iloc[:, 1], y_pred, color='red')
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income')
ax.set_zlabel('Spending Score')

Why is this array compared to two values instead of one?

I'm having trouble understanding what this bit of code means. Why is y.values compared to a tuple with two values, when the shape of the array is a single row (650,)?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
df = pd.read_csv('readonly/mushrooms.csv')
df2 = pd.get_dummies(df)
df3 = df2.sample(frac=0.08)
X = df3.iloc[:,2:]
y = df3.iloc[:,1]
pca = PCA(n_components=2).fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(pca, y, random_state=0)
plt.figure(dpi=120)
plt.scatter(pca[y.values==0,0], pca[y.values==0,1], alpha=0.5, label='Edible', s=2) # < --- why is y.values compared to two values instead of one???
plt.scatter(pca[y.values==1,0], pca[y.values==1,1], alpha=0.5, label='Poisonous', s=2)
plt.legend()
plt.title('Mushroom Data Set\nFirst Two Principal Components')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.gca().set_aspect('equal')
Actually y.values is a single row. In pca[y.values==0,0], it is obtaining the values which satisfies y.values==0 condition.
Please see the fit_transform documentation. It outputs a 2-dimensional array.

R abline() equivalent in Python

I am trying to plot a Linear Regression onto a scatterplot in Python.
In R I would simply do the following:
Run OLS Linear Regresion
fit_1 <- lm(medv ~ lstat)
plot(medv ~ lstat)
abline(fit_1, col = "red")
I have been looking at different solutions in Python, but I can't seem to be able to actually get it to work.
My script is:
Plot Data
Boston.plot(kind='scatter', x='medv', y='lstat', color = "black")
plt.show()
Run Linear Regression
fit_1 = sm.ols(formula='medv ~ lstat', data= Boston).fit()
Show Summary
fit_1.summary()
Plot Regression Line
Insert code here
It can be done quite simply. In the below code, I use sklearn to fit the model and predict the values.
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
model = LinearRegression()
model.fit(X,y)
predictions = model.predict(X)
plt.plot(X,y,'o')
# change here
plt.plot(X, predictions, '-')
plt.show()
Try this:
plt.plot(Boston.lstat, fit_1.fittedvalues, 'r')
Saw this on Statology that helped me a lot:
def abline(slope, intercept):
axes = plt.gca()
x_vals = np.array(axes.get_xlim())
y_vals = intercept + slope * x_vals
plt.plot(x_vals, y_vals, '--')

Categories