I miss a part of my dataset, which is the position of the tennis ball in the video for each frame. The missing part is when the player hits the ball, and the ball goes up and comes down to the second player taking a curve shape.
I have create the curve using polynomial regression method, as shown in the image.enter image description here
The curve presents the ten points before missing the data and ten points after.
Now, How can we generate a sequence of points, which is the missing datase, from the curve that I have created using python?!
The missing data points:
([1189,1188, 1186,1187,1184,1183,1182,1181,1181,1181[missing data]908,906,901,900,898,893,888,883,878,879])
([221,216,213,212,209,205,200,195,195,195[missing data]212,222,235,235,249,263,276,292,303,303])
This is the Code that I use to create the curve:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1189,1188, 1186,1187,1184,1183,1182,1181,1181,1181,908,906,901,900,898,893,888,883,878,879])
y = np.array([221,216,213,212,209,205,200,195,195,195,212,222,235,235,249,263,276,292,303,303])
model = np.poly1d(np.polyfit(x,y,3))
line = np.linspace(np.min(x), np.max(x), num=100)
plt.scatter(x, y)
plt.plot(line, model(line))
plt.show()
Your model was obtained using np.polyfit:
fitted_parameters = np.polyfit(x,y,3)
You can use np.polyval to make a prediction:
x = 1050
prediction = np.polyval(fitted_parameters, x)
# The prediction value for x = 1050 is y = 8.64
So it is just a matter of using np.linspace to obtain an evenly distributed set of x values and use np.polyval to obtain the curve (y values that are missing).
Related
I've managed to plot the decision boundary of a support vector machine in 2D and 3D. Now, I'd like to plot the normal vector of it as well, but in a way that works not only in 2D / 3D but also in higher-dimensional spaces. At the moment, I'm simply calculating the normal vector by computing the slope of it with m1 * m2 = -1.
Going deeper into the mathematics behind SVMs, I've found out that there's the w-vector which is perpendicular to the decision boundary. I'm using the LinearSVC implementation of sklearn to train the classifier. As far as I know, the w-vector is given by the coef_[0] attribute, but plotting this vector doesn't give the result I was expecting.
Is there a general way to compute the normal vector of a SVM decision boundary, which not only works in 2D / 3D but also in high-dimensional spaces?
What I'm trying to achieve is to navigate inside a n-dimensional space gradually from one class to another. Since its not possible to visualize a high-dimensional space, I'd like to validate everything first in 2D/3D to gain a better understanding.
I've a data set of labeled fashion item images. First, I've extracted 2048-dimensional feature vectors using a CNN (ResNet-50). Then, I perform PCA reduce the dimensionality of the vectors. Before, I've performed some data cleaning and filtering.
num_feature_dimensions = 2 # Set the number of embedding dimensions
pca = PCA(n_components = num_feature_dimensions)
embs_compressed = pca.fit_transform(df_embs_filtered)
df_embs_filtered_compressed = pd.DataFrame(embs_compressed)
df_embs_filtered_compressed
After that, I train the SVM with the uncompressed embeddings as X and the season feature as y (binary problem, either winter or summer).
X = df_embs_filtered
y = df_filtered["season"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
svm_clf = LinearSVC(C=1, max_iter=100000)
svm_clf.fit(X_scaled, y)
The last step is to visualize the embedding space (in case of 2D / 3D) with the decision boundary and an orthogonal axis. It should be possible for a user to navigate over that orthogonal axis to go from one class to the other. So, I'm creating a marker for the user and utilizing an ipywidgets FloatSlider which updates the position. Then, depending on the user's position, it'll show the image of the nearest neighbor embedding.
This is the whole code for creating the scatter plot, computing the decision boundary and its orthogonal axis, and the FloatSlider for 2D. I left out some snippets which I think aren't relevant to the question.
from ipywidgets import AppLayout, FloatSlider
from matplotlib.offsetbox import (AnnotationBbox, OffsetImage, TextArea)
plt.ioff()
fig, ax = plt.subplots(figsize=(15,7))
fig.canvas.header_visible = False
fig.canvas.layout.min_height = '400px'
# Create Scatterplot of filtered dataset colored by season feature
sns.scatterplot(x="x", y="y",
hue="season",
data=df_filtered,
legend="full",
alpha=0.8)
# Computes the decision boundary of a trained classifier
db_xx, db_yy = calc_svm_decision_boundary(svm_clf, -35, 35)
# Rotate the decision boundary 90° to get perpendicular axis
neg_yy = np.negative(db_yy)
neg_slope = -1 / -svm_clf.coef_[0][0]
bias = svm_clf.intercept_[0] / svm_clf.coef_[0][1]
ortho_db_yy = neg_slope * db_xx - bias
# Plot the axes
plt.plot(db_xx, db_yy, "k-", linewidth=1)
plt.plot(db_xx, ortho_db_yy, "r-", linewidth=1)
#plt.plot(neg_yy, db_xx, "g-", linewidth=2)
# Choose a random starting position and initialize user marker on that position
rand_idx = random.choice(range(len(db_xx)))
x = db_xx[rand_idx]
y = ortho_db_yy[rand_idx]
user_marker, user_positon = create_user_marker(x, y)
# Compute the nearest neighbour and annotate it with its respective image
nearest_neighbour, nearest_neighbour_pos = get_nearest_neighbour(user_positon, df_filtered)
annotate_nearest_neighbour(nearest_neighbour, nearest_neighbour_pos, ax, df_filtered)
plt.title('Nearest Embedding: {} with season: {}, pos: {}'.format(nearest_neighbour, df_filtered.loc[df_filtered['id'] == nearest_neighbour].season.values[0], user_positon))
# Create Slider to interact with the plot
slider = FloatSlider(
orientation="horizontal",
description="x-Position:",
value=user_positon[0],
min=min(db_xx),
max=max(db_xx)
)
slider.layout.margin = '0px 30% 0px 30%'
slider.layout.width = '25%'
slider.observe(update_user_position_2D, names='value')
AppLayout(
center=fig.canvas,
footer=slider,
pane_heights=[0, 6, 1]
)
def calc_svm_decision_boundary(svm_clf, xmin, xmax):
"""Compute the decision boundary"""
w = svm_clf.coef_[0]
b = svm_clf.intercept_[0]
xx = np.linspace(xmin, xmax, 200)
yy = -w[0]/w[1] * xx - b/w[1]
return xx, yy
This results to the following plot:
This approach to compute the orthogonal axis works in 2D but I'm looking for a general approach to compute it regardless of the space dimension. As you can see in the plot, for the season feature there's no fitting separating hyperplane in lower-dimensional spaces. My hypothesis is, that there's a hyperplane in a higher-dimensional space which is separating the classes well.
Now, I thought that I could use the w-vector which is perpendicular to the decision boundary to compute an orthogonal axis in any space. Is that possible or do I've an error in reasoning?
I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data
Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.
I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:
I am doing linear regression with multiple variables. To get thetas (coefficients) I used Numpy's least-squares numpy.linalg.lstsq tool. In my data I have n = 143 features and m = 13000 training examples. I want to plot house prices against area and show fitting line for this feature.
Data preparation code (Python):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
path = 'DB2.csv'
data = pd.read_csv(path, header=None, delimiter=";")
data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
Getting theta coefficients with numpy.linalg.lstsq:
thetas = np.linalg.lstsq(X, y)[0]
Prediction part:
allAreasData = X.iloc[:,120] #Used as argument to scatter all training data
areasTestValues = X.iloc[0:100,120] #Used as argument for plot function
testingExamples = X.iloc[0:100,:] #Used to make predictions
predictions = testingExamples.dot(thetas)
Note: 120 in the above code is index of Area column in my dataset.
Visualization part:
fig, ax = plt.subplots(figsize=(18,10))
ax.scatter(allAreasData, y, label='Traning Data', color='r')
ax.plot(areasTestValues, predictions, 'b', label='Prediction')
ax.legend(loc=2)
ax.set_xlabel('Area')
ax.set_ylabel('Price')
ax.set_title('Predicted Price vs. House Area')
Output plot:
I expected to get some single regression line that fits data but instead of it got such strange polyline (broken line). What I am doing wrong? Scatter works right. But plot is not. For plot function I send 2 arguments:
1) Testing area data (100 area data examples)
2) Predictions of price based on 100 training examples that include area data
Update:
After sorting x I got this plot with curve:
I was expecting to get straight line fitting all my data with least square errors but instead got a curve. Isn't linear regression and numpy.linalg.lstsq tool supposed to return straight fitting line instead of curve?
Your result is linear in a 143 dimensional space. ;) Since your X contains many more features than just the area the prediction will also (linearly) depend on those features.
If you redo your training with X = data.iloc[:,120] (only considering the area feature) you should receive a straight line when you plot the results.
I am working on a project in which I am trying to model the movement of an object in a kymograph. In order to do so, I fit a curve to each line of pixels in an image, and append the location of the vertex to approximately model the location of the object in the image. Below is a sample image.
As you can see, early in the time series (at the top of the image) the position of the object is nicely focused and easily modeled with a Gaussian curve. However, closer to the end of the time series (at the bottom of the image), the peak is much more diffuse. I suspect that the data at the bottom of the image will be fit much more closely by a curve modeling a Poisson distribution (image below, right) while the data at the top/middle of the image will be fit much more closely by a Gaussian or polynomial curve (image below, left).
Is there any way to, for each line of pixels, fit more than one curve to the same data and then score each for a least-squares fit? This way, I could (hopefully) switch models midway through an image to accommodate changing behaviors of the object that I am trying to track. My current code is below:
from PIL import Image
def populateData(picture) :
"""Open an image and populate a list of lists with the grayscale value"""
im = Image.open(picture)
size = im.size
width = size[0]
height = size[1]
allPixels = list(im.getdata())
pixelList = [allPixels[width*i :
width * (i+1)] for i in range(height)]
return(pixelList)
rawData = populateData("testTop.tif")
import numpy as np
from scipy.optimize import curve_fit
def findVertex(listOfRows) :
xList = []
for row in listOfRows :
x = np.arange(len(row))
ffunc = lambda x, a, x0, s: a*np.exp(-0.5*(x-x0)**2/s**2)
p, _ = curve_fit(ffunc, x, row, p0=[100,5,2])
x0 = p[1]
xList.append(x0)
xArray = np.array(xList)
return(xArray)
xValues = findVertex(rawData)
def buildRows(listOfRows) :
yArray = np.arange(len(listOfRows))
return(yArray)
yValues = buildRows(rawData)
from matplotlib import pyplot as plt
from scipy import ndimage
image = ndimage.imread("testTop.tif",flatten=True)
fig = plt.figure()
axes = fig.add_subplot(111)
axes.imshow(image)
axes.plot(xValues, yValues, 'k-')
axes.set_title('testLine')
axes.grid()
axes.set_xlabel('x')
axes.set_ylabel('time')
plt.show()
EDIT:
This is the file I used as an input (testTop.tif)
You will need to work out some form of goodness of fit between the fit and your data. Taking the sum of the squared differences between your current fit (a Gaussian) and your data divided by the variance.
sumerrsq = 0.
for i in range(yValues.shape[0]):
sumerrsq += np.power(yValues[i] - xValues[i],2)
goodfit = np.sqrt(sumerrsq/var)
I think you can use use the second output from curve fit (the covariance) to get the variance,
p, pcov = curve_fit(ffunc, x, row, p0=[100,5,2])
var = np.diag(pcov)
You can then check the value of goodfit and if it is not sufficient, switch to a different distribution. In using a different distribution, you may need to use a different estimation of error (this assumes the errors are normally distributed).
Note, without the data (and not being sure what array was which) I couldn't test any of this code...
According to the curve_fit docs:
To compute one standard deviation errors on the parameters use perr =
np.sqrt(np.diag(pcov)).
So if that's the value you're trying to compare, then you could take that second returned value from curve_fit (the one you are currently assigning to _), use it to calculate perr as above, and compare the error between multiple curves.
I would suggest you work with a 2D fit model. A 1d Gaussian distribution is the basis but the mean and variance depend on position and time. You then would fit the model against the 2d image data.
In case you want to stay with your approach, it looks like it's just the starting value for mean and variance which you need to tweak in order to get a better fit for the lines with large times.
To your question, you can model any score function you want, so you could do something like:
def score(x,y):
if x < 10:
return x**2 - y
else:
return x - y
So in order to work with two different models in different ranges, follow this example.
Are there any algorithms that will return the equation of a straight line from a set of 3D data points? I can find plenty of sources which will give the equation of a line from 2D data sets, but none in 3D.
Thanks.
If you are trying to predict one value from the other two, then you should use lstsq with the a argument as your independent variables (plus a column of 1's to estimate an intercept) and b as your dependent variable.
If, on the other hand, you just want to get the best fitting line to the data, i.e. the line which, if you projected the data onto it, would minimize the squared distance between the real point and its projection, then what you want is the first principal component.
One way to define it is the line whose direction vector is the eigenvector of the covariance matrix corresponding to the largest eigenvalue, that passes through the mean of your data. That said, eig(cov(data)) is a really bad way to calculate it, since it does a lot of needless computation and copying and is potentially less accurate than using svd. See below:
import numpy as np
# Generate some data that lies along a line
x = np.mgrid[-2:5:120j]
y = np.mgrid[1:9:120j]
z = np.mgrid[-5:3:120j]
data = np.concatenate((x[:, np.newaxis],
y[:, np.newaxis],
z[:, np.newaxis]),
axis=1)
# Perturb with some Gaussian noise
data += np.random.normal(size=data.shape) * 0.4
# Calculate the mean of the points, i.e. the 'center' of the cloud
datamean = data.mean(axis=0)
# Do an SVD on the mean-centered data.
uu, dd, vv = np.linalg.svd(data - datamean)
# Now vv[0] contains the first principal component, i.e. the direction
# vector of the 'best fit' line in the least squares sense.
# Now generate some points along this best fit line, for plotting.
# I use -7, 7 since the spread of the data is roughly 14
# and we want it to have mean 0 (like the points we did
# the svd on). Also, it's a straight line, so we only need 2 points.
linepts = vv[0] * np.mgrid[-7:7:2j][:, np.newaxis]
# shift by the mean to get the line in the right place
linepts += datamean
# Verify that everything looks right.
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d as m3d
ax = m3d.Axes3D(plt.figure())
ax.scatter3D(*data.T)
ax.plot3D(*linepts.T)
plt.show()
Here's what it looks like:
If your data is fairly well behaved then it should be sufficient to find the least squares sum of the component distances. Then you can find the linear regression with z independent of x and then again independent of y.
Following the documentation example:
import numpy as np
pts = np.add.accumulate(np.random.random((10,3)))
x,y,z = pts.T
# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]
# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]
# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
x = (z - c_xz)/m_xz
y = (z - c_yz)/m_yz
return x,y
#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()
If you want to minimize the actual orthogonal distances from the line (orthogonal to the line) to the points in 3-space (which I'm not sure is even referred to as linear regression). Then I would build a function that computes the RSS and use a scipy.optimize minimization function to solve it.