plotting data at the end of each iterate - python

I'm working with a dataframe including two columns and 40 rows (X, y). I have fitted different degrees of curves to this data in a loop. To visualize this process, I need to plot data and the fitted curve line, at the end of each iteration. I couldn't handle it. Would you please help me with it?
p.s. the code is as below:
error_list=[]
bias_list=[]
variance_list=[]
degrees=[1,3,7,11,16,20]
for degree in degrees:
polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
lin_regressor = LinearRegression()
polynomial_regression = Pipeline([("poly_features", polybig_features),("lin_regressor", lin_regressor)])
polynomial_regression.fit(X, y)
y_predict= polynomial_regression.predict(X)
error = mean_squared_error(y, y_predict)
error_list.append(error)
bias_list.append(abs(((mean_squared_error(y, y_predict))-(np.var(y_predict)))))
variance_list.append(np.var(y_predict))
plt.plot(X, y_predict)
The result is not what I want. I want each fitted curve line to be shown in a unique plot.

Related

I can not understand why my test and predict y plot for my regression model is like that?

I am working on a regression model (Decision Tree) on a multidimensional data, with 16 features. The model r2_score is 0.97. The y test and y predict plot looks so wrong! the range of x is not the same.
would you please tell me what is the problem?
I have also tried to fit the model in one dimension to check the x range in the diagram, but it just decrease the score obviously, and the diagram is still odd!
Matplotlib's plot function draws a single line by connecting the points in the order that they are drawn. The reason you are seeing a mess is because the points are not ordered along the x-axis.
In a regression model, you have a function f(x) -> R where f here is your decision tree and x is in the 16 dimensional space. However, you cannot order your x , which has 16 dimensions, along the x-axis.
Instead, what you can do is just plot the the ground truth and predicted values for each index as a scatter plot:
import numpy as np
# Here, I'm assuming y_DT_5 is either a 1D array or a column vector.
# If necessary, change the argument of np.arange accordingly to get the number of values
idxs = np.arange(len(y_DT_5))
plt.figure(figsize=(16,4))
plt.scatter(x=idxs, y=y_DT_5, marker='x') # Plot each ground truth value as separate pts
plt.scatter(x=idxs, y=y_test, marker='.') # Plot each predicted value as separate points
If your model works, the 2 points plotted at each index should be close along the y-axis.

How can we generate a sequence of numbers from Polynomial regression curve?

I miss a part of my dataset, which is the position of the tennis ball in the video for each frame. The missing part is when the player hits the ball, and the ball goes up and comes down to the second player taking a curve shape.
I have create the curve using polynomial regression method, as shown in the image.enter image description here
The curve presents the ten points before missing the data and ten points after.
Now, How can we generate a sequence of points, which is the missing datase, from the curve that I have created using python?!
The missing data points:
([1189,1188, 1186,1187,1184,1183,1182,1181,1181,1181[missing data]908,906,901,900,898,893,888,883,878,879])
([221,216,213,212,209,205,200,195,195,195[missing data]212,222,235,235,249,263,276,292,303,303])
This is the Code that I use to create the curve:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1189,1188, 1186,1187,1184,1183,1182,1181,1181,1181,908,906,901,900,898,893,888,883,878,879])
y = np.array([221,216,213,212,209,205,200,195,195,195,212,222,235,235,249,263,276,292,303,303])
model = np.poly1d(np.polyfit(x,y,3))
line = np.linspace(np.min(x), np.max(x), num=100)
plt.scatter(x, y)
plt.plot(line, model(line))
plt.show()
Your model was obtained using np.polyfit:
fitted_parameters = np.polyfit(x,y,3)
You can use np.polyval to make a prediction:
x = 1050
prediction = np.polyval(fitted_parameters, x)
# The prediction value for x = 1050 is y = 8.64
So it is just a matter of using np.linspace to obtain an evenly distributed set of x values and use np.polyval to obtain the curve (y values that are missing).

i want to fit a single curve to my dataset of 5 points but it does not seem to work with Sklearn PolynomialFeatures

I want to have an input(x) of 5 points and an output of the same size(y). after that, I should fit a curved line into the dataset. finally, I should use matplotlib to draw the curved line and the points in order to show a non-linear regression.
I want to fit a single curve to my dataset of 5 points .but it does not seem to work. it is simple but I'm new to sklearn. do you know what is wrong with my code?
here is the code:
#here is the dataset of 5 points
x=np.random.normal(size=5)
y=2.2*x-1.1
y=y+np.random.normal(scale=3,size=y.shape)
x=x.reshape(-1,1)
#i use polynomialfeatures module because I want extra dimensions
preproc=PolynomialFeatures(degree=4)
x_poly=preproc.fit_transform(x)
#in this part I want to make 100 points to feed it to a polynomial and after that i can draw a curve .
x_line=np.linspace(-2,2,100)
x_line=x_line.reshape(-1,1)
#at this point i made y_hat inorder to have values of predicted y.
poly_line=PolynomialFeatures(degree=4)
x_feats=poly_line.fit_transform(x_line)
y_hat=LinearRegression().fit(x_feats,y).predict(x_feats)
plt.plot(y_hat,y_line,"r")
plt.plot(x,y,"b.")
First of all, your are having a LinearRegression problem. As joostblack and Arya commented, you equation is y=2.2x-1.1, this is linear. Why you need polynomial features?
Anyway, if you need to do this task because you have been asked, here you have code that can work:
x=np.random.normal(size=5)
y=2.2*x-1.1
mymodel = numpy.poly1d(numpy.polyfit(x, y, 4))
myline = numpy.linspace(-2, 2, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
As we commented, is "silly" to fit a linear problem with polyonomalia degree 4 because wi will always a linear regression as a solution. It can be useful if you have another relation like that: y=x**3+x-2 (this is not linear as you can see):
np.random.seed(0)
x=np.random.normal(size=5)
y=x**3+x-2
mymodel = numpy.poly1d(numpy.polyfit(x, y, 4))
myline = numpy.linspace(-2, 3, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Two final comments. You have to differentiate what is a LinearRegression and Polyonomial, and in which case they are useful. Second, I used numpy to solve your problem, not sklearn, it's more simple for your problem, be aware of that.

How to fit multiple curves to a single scatter plot of data?

I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data
Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.
I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:

Strange plot after linear regression using Numpy's least squares

I am doing linear regression with multiple variables. To get thetas (coefficients) I used Numpy's least-squares numpy.linalg.lstsq tool. In my data I have n = 143 features and m = 13000 training examples. I want to plot house prices against area and show fitting line for this feature.
Data preparation code (Python):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
path = 'DB2.csv'
data = pd.read_csv(path, header=None, delimiter=";")
data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
Getting theta coefficients with numpy.linalg.lstsq:
thetas = np.linalg.lstsq(X, y)[0]
Prediction part:
allAreasData = X.iloc[:,120] #Used as argument to scatter all training data
areasTestValues = X.iloc[0:100,120] #Used as argument for plot function
testingExamples = X.iloc[0:100,:] #Used to make predictions
predictions = testingExamples.dot(thetas)
Note: 120 in the above code is index of Area column in my dataset.
Visualization part:
fig, ax = plt.subplots(figsize=(18,10))
ax.scatter(allAreasData, y, label='Traning Data', color='r')
ax.plot(areasTestValues, predictions, 'b', label='Prediction')
ax.legend(loc=2)
ax.set_xlabel('Area')
ax.set_ylabel('Price')
ax.set_title('Predicted Price vs. House Area')
Output plot:
I expected to get some single regression line that fits data but instead of it got such strange polyline (broken line). What I am doing wrong? Scatter works right. But plot is not. For plot function I send 2 arguments:
1) Testing area data (100 area data examples)
2) Predictions of price based on 100 training examples that include area data
Update:
After sorting x I got this plot with curve:
I was expecting to get straight line fitting all my data with least square errors but instead got a curve. Isn't linear regression and numpy.linalg.lstsq tool supposed to return straight fitting line instead of curve?
Your result is linear in a 143 dimensional space. ;) Since your X contains many more features than just the area the prediction will also (linearly) depend on those features.
If you redo your training with X = data.iloc[:,120] (only considering the area feature) you should receive a straight line when you plot the results.

Categories