I try to Fit Multiple Linear Regression Model
Y= c + a1.X1 + a2.X2 + a3.X3 + a4.X4 +a5X5 +a6X6
Had my model had only 3 variable I would have used 3D plot to plot.
How can I plot this . I basically want to see how the best fit line looks like or should I plot multiple scatter plot and see the effect of individual variable
Y = a1X1 when all others are zero and see the best fit line.
What is the best approach for these models. I know it is not possible to visualize higher dimensions want to know what should be the best approach. I am desperate to see the best fit line
I found this post which is more helpful and followed
https://stats.stackexchange.com/questions/73320/how-to-visualize-a-fitted-multiple-regression-model.
Based on suggestions
I am currently just plotting scatter plots like dependent variable vs. 1st independent variable, then vs. 2nd independent variable etc I am doing same thing . I may not be able to see best fit line for complete model but I know how it is dependent on individual variable
from sklearn.linear_model import LinearRegression
train_copy = train[['OverallQual', 'AllSF','GrLivArea','GarageCars']]
train_copy =pd.get_dummies(train_copy)
train_copy=train_copy.fillna(0)
linear_regr_test = LinearRegression()
fig, axes = plt.subplots(1,len(train_copy.columns.values),sharey=True,constrained_layout=True,figsize=(30,15))
for i,e in enumerate(train_copy.columns):
linear_regr_test.fit(train_copy[e].values[:,np.newaxis], y.values)
axes[i].set_title("Best fit line")
axes[i].set_xlabel(str(e))
axes[i].set_ylabel('SalePrice')
axes[i].scatter(train_copy[e].values[:,np.newaxis], y,color='g')
axes[i].plot(train_copy[e].values[:,np.newaxis],
linear_regr_test.predict(train_copy[e].values[:,np.newaxis]),color='k')
You can use Seaborn's regplot function, and use the predicted and actual data for comparison. It is not the same as plotting a best fit line, but it shows you how well the model works.
sns.regplot(x=y_test, y=y_predict, ci=None, color="b")
You could try to visualize how well your model is performing by comparing actual and predicted values.
Assuming that our actual values are stored in Y, and the predicted ones in Y_, we could plot and compare both.
import seaborn as sns
ax1 = sns.distplot(Y, hist=False, color="r", label="Actual Value")
sns.distplot(Y_, hist=False, color="b", label="Fitted Values" , ax=ax1)
Related
I have a sample time-series dataframe:
df = pd.DataFrame({'year':'1990','1991','1992','1993','1994','1995','1996',
'1997','1998','1999','2000'],
'count':[96,184,148,154,160,149,124,274,322,301,300]})
I want a linear regression line with confidence interval band in the regression line. Although I managed to plot a linear regression line. I am finding it difficult to plot the confidence interval band in the plot. Here is the snippet of my code for linear regression plot:
from matplotlib import ticker
from sklearn.linear_model import LinearRegression
X = df.date_ordinal.values.reshape(-1,1)
y = df['count'].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X, y)
predictions = reg.predict(X.reshape(-1, 1))
fig, ax = plt.subplots()
plt.scatter(X, y, color ='blue',alpha=0.5)
plt.plot(X, predictions,alpha=0.5, color = 'black',label = r'$N$'+ '= {:.2f}t + {:.2e}\n'.format(reg.coef_[0][0],reg.intercept_[0]))
plt.ylabel('count($N$)');
plt.xlabel(r'Year(t)');
plt.legend()
formatter = ticker.ScalarFormatter(useMathText=True)
formatter.set_scientific(True)
formatter.set_powerlimits((-1,1))
ax.yaxis.set_major_formatter(formatter)
plt.xticks(ticks = df.date_ordinal[::5], labels = df.index.year[::5])
plt.grid()
plt.show()
plt.clf()
This gives me a nice linear regression plot for time series .
Problem & Desired output
However, I need confidence interval for the regression line too as in:.
Help on this issue would be highly appreciated.
The problem you are running into is that the package and function you use from sklearn.linear_model import LinearRegression does not provide a way to simply obtain the confidence interval.
If you want to absolutely use sklearn.linear_model.LinearRegression, you will have to dive into the methods of calculating a confidence interval. One popular approach is using bootstrapping, like was done with this previous answer.
However, the way I interpret your question, is that you are looking for a way to quickly do this inside of a plot command, similar to the screenshot you attached. If your goal is purely visualization, then you can simply use the seaborn package, which is also where your example image comes from.
import seaborn as sns
sns.lmplot(x='year', y='count', data=df, fit_reg=True, ci=95, n_boot=1000)
Where I have highlighted the three self-explanatory parameters of interest with their default values fit_reg, ci, and n_boot. Refer to the documentation for a full description.
Under the hood, seaborn uses the statsmodels package. So if you want something in between purely visualization, and writing the confidence interval function from scratch yourself, I would refer you instead to using statsmodels. Specifically, look at the documentation for calculating a confidence interval of an ordinary least squares (OLS) linear regression.
The following code should give you a starting point for using statsmodels in your example:
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
df = pd.DataFrame({'year':['1990','1991','1992','1993','1994','1995','1996','1997','1998','1999','2000'],
'count':[96,184,148,154,160,149,124,274,322,301,300]})
df['year'] = df['year'].astype(float)
X = sm.add_constant(df['year'].values)
ols_model = sm.OLS(df['count'].values, X)
est = ols_model.fit()
out = est.conf_int(alpha=0.05, cols=None)
fig, ax = plt.subplots()
df.plot(x='year',y='count',linestyle='None',marker='s', ax=ax)
y_pred = est.predict(X)
x_pred = df.year.values
ax.plot(x_pred,y_pred)
pred = est.get_prediction(X).summary_frame()
ax.plot(x_pred,pred['mean_ci_lower'],linestyle='--',color='blue')
ax.plot(x_pred,pred['mean_ci_upper'],linestyle='--',color='blue')
# Alternative way to plot
def line(x,b=0,m=1):
return m*x+b
ax.plot(x_pred,line(x_pred,est.params[0],est.params[1]),color='blue')
Which produces your desired output
While the values of everything are accessible via standard statsmodels functions.
I have a Data Frame df with two columns 'Egy' and 'fx' that I plot in this way:
plot_1 = df_data.plot(x="Egy", y="fx", color="red", ax=ax1, linewidth=0.85)
plot_1.set_xscale('log')
plt.show()
But then I want to smooth this curve using spline like this:
from scipy.interpolate import spline
import numpy as np
x_new = np.linspace(df_data['Egy'].min(), df_data['Egy'].max(),500)
f = spline(df_data['Egy'], df_data['fx'],x_new)
plot_1 = ax1.plot(x_new, f, color="black", linewidth=0.85)
plot_1.set_xscale('log')
plt.show()
And the plot I get is this (forget about the scatter blue points).
There are a lot of "peaks" in the smooth curve, mainly at lower x. How Can I smooth this curve properly?
When I consider the "busybear" suggestion of use np.logspace instead of np.linspace I get the following picture, which is not very satisfactory either.
You have your x values linearly scaled with np.linspace but your plot is log scaled. You could try np.geomspace for your x values or plot without the log scale.
Using spline will only work well for functions that are already smooth. What you need is to regularize the data and then interpolate afterwards. This will help to smooth out the bumps. Regularization is an advanced topic, and it would not be appropriate to discuss it in detail here.
Update: for regularization using machine learning, you might look into the scikit library for Python.
I'm facing a silly problem while plotting a graph from a regression function calculated using sci-kit-learn. After constructing the function I need to plot a graph that shows X and Y from the original data and calculated dots from my function. The problem is that my function is not a line, despite being linear, it uses a Fourier series in order to give the right shape for my curve, and when I try to plot the lines using:
ax.plot(df['GDPercapita'], modelp1.predict(df1), color='k')
I got a Graph like this:
Plot
But the trhu graph is supposed to be a line following those black points:
Dots to be connected
I'm generating the graph using the follow code:
fig, ax = plt.subplots()
ax.scatter(df['GDPercapita'], df['LifeExpectancy'], edgecolors=(0, 0, 0))
ax.scatter(df['GDPercapita'], modelp1.predict(df1),color='k') #this line is changed to get the first pic.
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show(block=True)
Does anyone have an idea about what to do?
POST DISCUSSION EDIT:
Ok, so first things first:
The data can be download at: http://www.est.ufmg.br/~marcosop/est171-ML/dados/worldDevelopmentIndicators.csv
I had to generate new data using a Fourier expasion, with normalized values of GDPercapita, in order to perform an exhaustive optimization algorithm for Regression Function used to predict the LifeExpectancy, and found out the number o p parameters that generate the best Regression Function, this number is p=22.
Now I have to generate a Polynomial Function using the predictions points of the regression fuction with p=22, to show how the best regression function is compared to the Polynomial function with the 22 degrees.
To generate the prediction I use the following code:
from sklearn import linear_model
modelp22 = linear_model.LinearRegression()
modelp22.fit(xp22,y_train)
df22 = df[p22]
fig, ax = plt.subplots()
ax.scatter(df['GDPercapita'], df['LifeExpectancy'], edgecolors=(0, 0, 0))
ax.scatter(df['GDPercapita'], modelp22.predict(df22),color='k')
ax.set_xlabel('GDPercapita')
ax.set_ylabel('LifeExpectancy')
plt.show(block=True)
Now I need to use the predictions points to create a Polynomial Function and plot a graph with: The original data(first scatter), the predictions points(secont scatter) and the Polygonal Funciontion (a curve or plot) to show their visual relation.
I have 2 datasets (df3 and df4 which respectively hold information for total head and efficiency) with a common independent variable (flow rate).
I am looking to plot both of them in the same graph but the dependent variables have different y-axes. I initially used lmplot() for the polynomial order functionality but this was unsuccessful in having both plots appear in one window. I would like assistance with combining both my scatter plot and regression plots into one plot which shows the overlap between the datasets.
I have used the following approach to generate my charts:
ax2.scatter(df3['Flow_Rate_(KG/S)'], df2['Efficiency_%'], color='pink')
ax2.scatter(df4['Flow_Rate_(KG/S)'], df4['Total Head'], color='teal')
plt.show()
The reason why it is important for the lines to be plotted against each other is that to monitor pump performance, we need to have both the total head (M) and efficiency % of the pump to understand the relationship and subsequent degradation of performance.
The only other way I could think of is to write the polynomial functions as equations to be put into arguments in the plot function and have them drawn out as such. I haven't yet tried this but thought I'd ask if there are any other alternatives before I head down this pathway.
Let me try to rephrase the problem: You have two datasets with common independent values, but different dependent values (f(x), g(x) respectively). You want to plot them both in the same graph, however the dependent values have totally different ranges. Therefore you want to have two different y axes, one for each dataset. The data should be plotted as a scatter plot and a regression line should be shown for each of them; you are more interested in seeing the regression line than knowing or calculating the regression curve itself. Hence you tried to use seaborn lmplot, but you were unsuccessful to get both datasets into the same graph.
In case the above is the problem you want to solve, the answer could be the following.
lmplot essentially plots a regplot to an axes grid. Because you don't need that axes grid here, using a regplot may make more sense. You may then create an axes and a twin axes and plot one regplot to each of them.
import numpy as np; np.random.seed(42)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df1 = pd.DataFrame({"x": np.sort(np.random.rand(30)),
"f": np.sort(np.random.rayleigh(size=30))})
df2 = pd.DataFrame({"x": np.sort(np.random.rand(30)),
"g": 500-0.1*np.sort(np.random.rayleigh(20,size=30))**2})
fig, ax = plt.subplots()
ax2 = ax.twinx()
sns.regplot(x="x", y="f", data=df1, order=2, ax=ax)
sns.regplot(x="x", y="g", data=df2, order=2, ax=ax2)
ax2.legend(handles=[a.lines[0] for a in [ax,ax2]],
labels=["f", "g"])
plt.show()
So I implemented the Power Iteration Clustering in Spark(inbuilt) with the Dataset I have. I got the model after using this
model = PowerIterationClustering.train(similarities, 2, 10)
When I do
model.assignments.collect()
I've all the values.
Now I want to plot a scatter plot of this model using Matplotlib.
But I'm not able to understand how to do it.
I got that x and y in the below code is id and cluster in model-
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
But I'm not able to understand how to use it. What should be the area, colors ?
You first need to parse the Assignment tuple, then collect. The output will be:
(<id int>, <cluster int>)
Instead of
Assignment(id=..,cluster=...)
You can do this by
model.assignments.map(lambda asm: (asm[0], asm[1])).collect()
You can then extract the x and y from the resulting list of tuples.