I am doing linear regression with multiple variables. To get thetas (coefficients) I used Numpy's least-squares numpy.linalg.lstsq tool. In my data I have n = 143 features and m = 13000 training examples. I want to plot house prices against area and show fitting line for this feature.
Data preparation code (Python):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
path = 'DB2.csv'
data = pd.read_csv(path, header=None, delimiter=";")
data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
Getting theta coefficients with numpy.linalg.lstsq:
thetas = np.linalg.lstsq(X, y)[0]
Prediction part:
allAreasData = X.iloc[:,120] #Used as argument to scatter all training data
areasTestValues = X.iloc[0:100,120] #Used as argument for plot function
testingExamples = X.iloc[0:100,:] #Used to make predictions
predictions = testingExamples.dot(thetas)
Note: 120 in the above code is index of Area column in my dataset.
Visualization part:
fig, ax = plt.subplots(figsize=(18,10))
ax.scatter(allAreasData, y, label='Traning Data', color='r')
ax.plot(areasTestValues, predictions, 'b', label='Prediction')
ax.legend(loc=2)
ax.set_xlabel('Area')
ax.set_ylabel('Price')
ax.set_title('Predicted Price vs. House Area')
Output plot:
I expected to get some single regression line that fits data but instead of it got such strange polyline (broken line). What I am doing wrong? Scatter works right. But plot is not. For plot function I send 2 arguments:
1) Testing area data (100 area data examples)
2) Predictions of price based on 100 training examples that include area data
Update:
After sorting x I got this plot with curve:
I was expecting to get straight line fitting all my data with least square errors but instead got a curve. Isn't linear regression and numpy.linalg.lstsq tool supposed to return straight fitting line instead of curve?
Your result is linear in a 143 dimensional space. ;) Since your X contains many more features than just the area the prediction will also (linearly) depend on those features.
If you redo your training with X = data.iloc[:,120] (only considering the area feature) you should receive a straight line when you plot the results.
Related
I'm working with a dataframe including two columns and 40 rows (X, y). I have fitted different degrees of curves to this data in a loop. To visualize this process, I need to plot data and the fitted curve line, at the end of each iteration. I couldn't handle it. Would you please help me with it?
p.s. the code is as below:
error_list=[]
bias_list=[]
variance_list=[]
degrees=[1,3,7,11,16,20]
for degree in degrees:
polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
lin_regressor = LinearRegression()
polynomial_regression = Pipeline([("poly_features", polybig_features),("lin_regressor", lin_regressor)])
polynomial_regression.fit(X, y)
y_predict= polynomial_regression.predict(X)
error = mean_squared_error(y, y_predict)
error_list.append(error)
bias_list.append(abs(((mean_squared_error(y, y_predict))-(np.var(y_predict)))))
variance_list.append(np.var(y_predict))
plt.plot(X, y_predict)
The result is not what I want. I want each fitted curve line to be shown in a unique plot.
I miss a part of my dataset, which is the position of the tennis ball in the video for each frame. The missing part is when the player hits the ball, and the ball goes up and comes down to the second player taking a curve shape.
I have create the curve using polynomial regression method, as shown in the image.enter image description here
The curve presents the ten points before missing the data and ten points after.
Now, How can we generate a sequence of points, which is the missing datase, from the curve that I have created using python?!
The missing data points:
([1189,1188, 1186,1187,1184,1183,1182,1181,1181,1181[missing data]908,906,901,900,898,893,888,883,878,879])
([221,216,213,212,209,205,200,195,195,195[missing data]212,222,235,235,249,263,276,292,303,303])
This is the Code that I use to create the curve:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1189,1188, 1186,1187,1184,1183,1182,1181,1181,1181,908,906,901,900,898,893,888,883,878,879])
y = np.array([221,216,213,212,209,205,200,195,195,195,212,222,235,235,249,263,276,292,303,303])
model = np.poly1d(np.polyfit(x,y,3))
line = np.linspace(np.min(x), np.max(x), num=100)
plt.scatter(x, y)
plt.plot(line, model(line))
plt.show()
Your model was obtained using np.polyfit:
fitted_parameters = np.polyfit(x,y,3)
You can use np.polyval to make a prediction:
x = 1050
prediction = np.polyval(fitted_parameters, x)
# The prediction value for x = 1050 is y = 8.64
So it is just a matter of using np.linspace to obtain an evenly distributed set of x values and use np.polyval to obtain the curve (y values that are missing).
I have daily data like below as a result.
And my daily prediction is like the one below most of the day.
I found a mild trend that value of Y-axis in my prediction where X-axis is between 0-3000 always need to be amplified much more..
How could I find a function that can close the prediction data to result data? I'd imagine I could do it if I Fourier transform both array but I also guess there would be a simpler way to do it..
I assume what you want to get as output is a kind of 'best fit' scaling function to match your output to your prediction. One straightforward approach would be to just calculate the difference of your output and prediction and then apply a smoothing algorithm of your choice to get a scaling function. (e.g. a Savitzky Golay Filter)
Minimal example below:
import numpy as np
from scipy.signal import savgol_filter
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,100)
data = np.sin(x) + np.random.normal(0, 0.2, 100) # a noisy sine function
prediction = x # line with slope 1, a bad initial fit
deviation = prediction - data
fit = savgol_filter(deviation, window_length=71, polyorder=2)
plt.plot(x, prediction-fit)
plt.plot(x, data)
play around with window_length and polyorder to find a suitable degree of noise for your dataset.
I am relatively new to python. I am trying to do a multivariate linear regression and plot scatter plots and the line of best fit using one feature at a time.
This is my code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
And this is the graph that I'm getting-
I have tried searching a lot but to no avail. I wanted to understand why this is not showing a line of best-fit and why instead it is connecting all the points on the scatter plot.
Thank you!
See linear regression means, that you are predicting the value linearly which will always give you a best fit line. Anything else is not possible, in your code:
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)
y_pred=regr.predict(x_test)
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Use the right variables to plot the line ie:
plt.plot(x_test,y_pred)
Plot the graph between the values that you put for test and the predictions that you get from that ie:
y_pred=regr.predict(x_test)
Also your model must be trained for the same, otherwise you will get the straight line but the results will be unexpected.
This is a multivariant data so you need to get the pairwise line
http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text=wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix.
or change the model for a linearly dependent data that will change the model completely
Train=df.loc[:650]
valid=df.loc[651:]
x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]
x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()
regr=linear_model.LinearRegression()
regr.fit(x_train['lag_7'],y_train)
y_pred=regr.predict(x_test['lag_7'])
plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)
plt.show()
Assuming your graphical library is matplotlib, imported with import matplotlib.pyplot as plt, the problem is that you passed the same data to both plt.scatter and plt.plot. The former draws the scatter plot, while the latter passes a line through all points in the order given (it first draws a straight line between (x_test['lag_7'][0], y_pred[0]) and (x_test['lag_7'][1], y_pred[1]), then one between (x_test['lag_7'][1], y_pred[1]) and (x_test['lag_7'][2], y_pred[2]), etc.)
Concerning the more general question about how to do multivariate regression and plot the results, I have two remarks:
Finding the line of best fit one feature at a time amounts to performing 1D regression on that feature: it is an altogether different model from the multivariate linear regression you want to perform.
I don't think it makes much sense to split your data into train and test samples, because linear regression is a very simple model with little risk of overfitting. In the following, I consider the whole data set df.
I like to use OpenTURNS because it has built-in linear regression viewing facilities. The downside is that to use it, we need to convert your pandas tables (DataFrame or Series) to OpenTURNS objects of the class Sample.
import pandas as pd
import numpy as np
import openturns as ot
from openturns.viewer import View
# convert pandas DataFrames to numpy arrays and then to OpenTURNS Samples
X = ot.Sample(np.array(df[['lag_7','rolling_mean', 'expanding_mean']]))
X.setDescription(['lag_7','rolling_mean', 'expanding_mean']) # keep labels
Y = ot.Sample(np.array(df[['sales']]))
Y.setDescription(['sales'])
You did not provide your data, so I need to generate some:
func = ot.SymbolicFunction(['x1', 'x2', 'x3'], ['4*x1 + 0.05*x2 - 2*x3'])
inputs_distribution = ot.ComposedDistribution([ot.Uniform(0, 3.0e6)]*3)
residuals_distribution = ot.Normal(0.0, 2.0e6)
ot.RandomGenerator.SetSeed(0)
X = inputs_distribution.getSample(30)
X.setDescription(['lag_7','rolling_mean', 'expanding_mean'])
Y = func(X) + residuals_distribution.getSample(30)
Y.setDescription(['sales'])
Now, let us find the best-fitting line one feature at a time (1D linear regression):
linear_regression_1 = ot.LinearModelAlgorithm(X[:, 0], Y)
linear_regression_1.run()
linear_regression_1_result = linear_regression_1.getResult()
ot.VisualTest_DrawLinearModel(X[:, 0], Y, linear_regression_1_result)
linear_regression_2 = ot.LinearModelAlgorithm(X[:, 1], Y)
linear_regression_2.run()
linear_regression_2_result = linear_regression_2.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 1], Y, linear_regression_2_result))
linear_regression_3 = ot.LinearModelAlgorithm(X[:, 2], Y)
linear_regression_3.run()
linear_regression_3_result = linear_regression_3.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 2], Y, linear_regression_3_result))
As you can see, in this example, none of the one-feature linear regressions are able to very accurately predict the output.
Now let us do multivariate linear regression. To plot the result, it is best to view the actual vs. predicted values.
full_linear_regression = ot.LinearModelAlgorithm(X, Y)
full_linear_regression.run()
full_linear_regression_result = full_linear_regression.getResult()
full_linear_regression_analysis = ot.LinearModelAnalysis(full_linear_regression_result)
View(full_linear_regression_analysis.drawModelVsFitted())
As you can see, in this example, the fit is much better with multivariate linear regression than with 1D regressions one feature at a time.
I am trying to fit a cosine squared to a data array from an optics interferometry intensity measurement. Unfortunately, the fit returns amplitudes and periods that are way off. Only once I received a more reasonable fit by selecting the first 200 data points from the array (and some other selections). Those fit parameters were used as initial guesses to extend the fit to the entire array, which gave back a plot similar to the image.
import csv
import numpy as np
import matplotlib.pyplot as plt
import scipy as sy
from numpy import genfromtxt
from scipy.optimize import curve_fit
# reads the data from the csv file
csvfile ="</home/pi/Desktop/molecularpolOutput_No2.csv>"
csv = genfromtxt ('molecularpolOutput_No2.csv', delimiter=",")
# defines the data as variables
pressure = csv[100:200,2]
intensity = csv[100:200,3]
temperature = csv[:,1]
pi = 3.14
P = pressure
# defines the function and initial fit parameters
def func(P, T, a, b, c):
return a*np.cos((2*pi*P)/T+b)**2+c
p0 = sy.array([2200, 45, 4000, 85])
# fits the function
coeffs, pcov = curve_fit(func, pressure, intensity, p0)
I = func(P, coeffs[0], coeffs[1], coeffs[2], coeffs[3])
print 'period =',(coeffs[0]), 'Pa'
# plots the data and the function
fig = plt.figure(figsize=(10, 3), dpi=100)
plt.plot(pressure, intensity, linestyle="none", marker=".")
plt.plot(pressure, I)
plt.xlabel('Pressure (Pa)')
plt.ylabel('Relative intensity')
plt.title('interference intensity plot of Newtons rings ')
plt.show()
I would expect the fit to be correct for both a large and small data array. However, as the figures show, extending the array messes with both the amplitude and period. The fit which looks ok, also gives values for the period comparable to other experiments. The data generated by the photoresistor is not precisely linear but I assume this should not be the problem for curve_fit. Is their something I can change in the code to get the fit working? I already tried this code: How do I fit a sine curve to my data with pylab and numpy?
update
A least square curve fit in Matlab gives the same problem. Should I try another method to fit the curve or is it the data that causes the problem?
Matlab Code:
%% Opens excel file
filename = 'vpnat_1.xlsx';
Pr = xlsread(filename,'D1:D500');
I = xlsread(filename, 'E1:E500');
P = Pr;
% defines figure size relative to screen
scrsz = get(groot,'ScreenSize');
figure('Position',[1 scrsz(4)/2 scrsz(3)/2 scrsz(4)/4])
%% fit & plots
hold on
scatter(P,I,'.'); % scatter plot
%% defines parameter guesses
Im = mean(I);
Iu = max(I);
Il = min(I);
Ia = Iu-Il;
Ip = 2000;
Id = -4000;
a_0 = [Ia; Ip; Id; Im]; % initial guesses
fun = #(a,P) a(1).*(cos((2*pi*P)./a(2)+a(3)).^2)+a(4); % defines function
fcn = #(a) sum((fun(a,P)-I).^2); % finds best fit
s = fminsearch(fcn, a_0);
plot(P,fun(s,P)) % plots fitted function
hold off
I solved the problem by using Matlab. It appears that the parameters were to poorly defined for curve_fit in python to find a least squares whithin its given boundaries (Constrain on number of iterations?).
Matlab appeared to accept a larger margin of error in the initial parameters and therefore found a fit for all selections of data. Using the fit parameters from matlab as initial parameters in Python returns a proper fit. The problem in python could be prevented by computing the guesses for the parameters to get a better start.