How to plot svm hyperplane with only one feature - python

I have a dataset with one feature and I'm using scikit-learn train a support vector classifier. I'd like to visualize the results, but I'm a little bit perplexed on how to plot the scatter. I'm getting my hyperplane by doing the following:
slope = clf.coef_[0][0]
intercept = clf.intercept_[0]
Which gives me y = -.01x + 2.5
I'm assuming this is my hyperplane. I can't seem to figure out how to plot my data around this with only one feature. What would I use for my y-axis?

It's an interesting problem. On the surface it's very simple — one feature means one dimension, hence the hyperplane has to be 0-dimensional, i.e. a point. Yet what scikit-learn gives you is a line. So the question is really how to turn this line into a point.
I've spent about an hour looking for answers in the documentation for scikit-learn, but there is simply nothing on 1-d SVM classifiers (probably because they are not practical). So I decided to play with the sample code below to see if I can figure out the answer:
from sklearn import svm
n_samples = 100
X = np.concatenate([np.random.normal(0,0.1,n_samples), np.random.normal(10,0.1,n_samples)]).reshape(-1,1)
y = np.array([0]*n_samples+[1]*n_samples)
clf = svm.LinearSVC(max_iter = 10000)
clf.fit(X,y)
slope = clf.coef_
intercept = clf.intercept_
print(slope, intercept)
print(-intercept/slope)
X is the array of samples such that the first 100 points are sampled from N(0,0.1), and the next 100 points are sampled from N(10,0.1). y is the array of labels (100 of class '0' and 100 of class '1'). Intuitively it's clear the hyperplane should be halfway between 0 and 10.
Once you fit the classifier, you find out the intercept is about -0.96 which is nowhere near where the 0-d hyperplane (i.e. a point) should be. However, if you take y=0 and back-calculate x, it will be pretty close to 5. Now try changing the means of the distributions that make up X, and you will find that the answer is always -intercept/slope. That's your 0-d hyperplane (point) for the classifier.
So to visualize, you merely need to plot your data on a number line (use different colours for the classes), and then plot the boundary obtained by dividing the negative intercept by the slope. I'm not sure how to plot a number line, but you can always resort to a scatter plot with all y coordinates set to 0.

Related

I can not understand why my test and predict y plot for my regression model is like that?

I am working on a regression model (Decision Tree) on a multidimensional data, with 16 features. The model r2_score is 0.97. The y test and y predict plot looks so wrong! the range of x is not the same.
would you please tell me what is the problem?
I have also tried to fit the model in one dimension to check the x range in the diagram, but it just decrease the score obviously, and the diagram is still odd!
Matplotlib's plot function draws a single line by connecting the points in the order that they are drawn. The reason you are seeing a mess is because the points are not ordered along the x-axis.
In a regression model, you have a function f(x) -> R where f here is your decision tree and x is in the 16 dimensional space. However, you cannot order your x , which has 16 dimensions, along the x-axis.
Instead, what you can do is just plot the the ground truth and predicted values for each index as a scatter plot:
import numpy as np
# Here, I'm assuming y_DT_5 is either a 1D array or a column vector.
# If necessary, change the argument of np.arange accordingly to get the number of values
idxs = np.arange(len(y_DT_5))
plt.figure(figsize=(16,4))
plt.scatter(x=idxs, y=y_DT_5, marker='x') # Plot each ground truth value as separate pts
plt.scatter(x=idxs, y=y_test, marker='.') # Plot each predicted value as separate points
If your model works, the 2 points plotted at each index should be close along the y-axis.

i want to fit a single curve to my dataset of 5 points but it does not seem to work with Sklearn PolynomialFeatures

I want to have an input(x) of 5 points and an output of the same size(y). after that, I should fit a curved line into the dataset. finally, I should use matplotlib to draw the curved line and the points in order to show a non-linear regression.
I want to fit a single curve to my dataset of 5 points .but it does not seem to work. it is simple but I'm new to sklearn. do you know what is wrong with my code?
here is the code:
#here is the dataset of 5 points
x=np.random.normal(size=5)
y=2.2*x-1.1
y=y+np.random.normal(scale=3,size=y.shape)
x=x.reshape(-1,1)
#i use polynomialfeatures module because I want extra dimensions
preproc=PolynomialFeatures(degree=4)
x_poly=preproc.fit_transform(x)
#in this part I want to make 100 points to feed it to a polynomial and after that i can draw a curve .
x_line=np.linspace(-2,2,100)
x_line=x_line.reshape(-1,1)
#at this point i made y_hat inorder to have values of predicted y.
poly_line=PolynomialFeatures(degree=4)
x_feats=poly_line.fit_transform(x_line)
y_hat=LinearRegression().fit(x_feats,y).predict(x_feats)
plt.plot(y_hat,y_line,"r")
plt.plot(x,y,"b.")
First of all, your are having a LinearRegression problem. As joostblack and Arya commented, you equation is y=2.2x-1.1, this is linear. Why you need polynomial features?
Anyway, if you need to do this task because you have been asked, here you have code that can work:
x=np.random.normal(size=5)
y=2.2*x-1.1
mymodel = numpy.poly1d(numpy.polyfit(x, y, 4))
myline = numpy.linspace(-2, 2, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
As we commented, is "silly" to fit a linear problem with polyonomalia degree 4 because wi will always a linear regression as a solution. It can be useful if you have another relation like that: y=x**3+x-2 (this is not linear as you can see):
np.random.seed(0)
x=np.random.normal(size=5)
y=x**3+x-2
mymodel = numpy.poly1d(numpy.polyfit(x, y, 4))
myline = numpy.linspace(-2, 3, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Two final comments. You have to differentiate what is a LinearRegression and Polyonomial, and in which case they are useful. Second, I used numpy to solve your problem, not sklearn, it's more simple for your problem, be aware of that.

How to fit multiple curves to a single scatter plot of data?

I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data
Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.
I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:

Python - Kriging (Gaussian Process) in scikit_learn

I am considering using this method to interpolate some 3D points I have. As an input I have atmospheric concentrations of a gas at various elevations over an area. The data I have appears as values every few feet of vertical elevation for several tens of feet, but horizontally separated by many hundreds of feet (so 'columns' of tightly packed values).
The assumption is that values vary in the vertical direction significantly more than in the horizontal direction at any given point in time.
I want to perform 3D kriging with that assumption accounted for (as a parameter I can adjust or that is statistically defined - either/or).
I believe the scikit learn module can do this. If it can, my question is how do I create a discrete cell output? That is, output into a 3D grid of data with dimensions of, say, 50 x 50 x 1 feet. Ideally, I would like an output of [x_location, y_location, value] with separation of those (or similar) distances.
Unfortunately I don't have a lot of time to play around with it, so I'm just hoping to figure out if this is possible in Python before delving into it. Thanks!
Yes, you can definitely do that in scikit_learn.
In fact, it is a basic feature of kriging/Gaussian process regression that you can use anisotropic covariance kernels.
As it is precised in the manual (cited below) ou can either set the parameters of the covariance yourself or estimate them. And you can choose either having all parameters equal or all different.
theta0 : double array_like, optional
An array with shape (n_features, ) or (1, ). The parameters in the
autocorrelation model. If thetaL and thetaU are also specified, theta0
is considered as the starting point for the maximum likelihood
estimation of the best set of parameters. Default assumes isotropic
autocorrelation model with theta0 = 1e-1.
In the 2d case, something like this should work:
import numpy as np
from sklearn.gaussian_process import GaussianProcess
x = np.arange(1,51)
y = np.arange(1,51)
X, Y = np.meshgrid(lons, lats)
points = zip(obs_x, obs_y)
values = obs_data # Replace with your observed data
gp = GaussianProcess(theta0=0.1, thetaL=.001, thetaU=1., nugget=0.001)
gp.fit(points, values)
XY_pairs = np.column_stack([X.flatten(), Y.flatten()])
predicted = gp.predict(XY_pairs).reshape(X.shape)

Finding error range for peak point using polynomial fitting

I have data for a spectral line which makes a noisy U shaped curve .
I want to fit a curve and find the x,y values for the minimum point .
I then fitted a polynomial to it using polyfit .
I then found to minimum point on the fitted curve .
NB: The original curve is not symmetric (The left side is slightly steeper than the right .)
Therefore the min(original) is slightly left of min(fitted_curve)
How do I find the X and Y errors for this point ?
Here are the bones of my code :
import pylab , numpy
x = [... linear list of floats ...]
y = [... list of floats ...] # Produces a noisy U shaped curve .
fit = numpy.polyfit(x,y,3)
fit2 = numpy.polyval(fit,x) # Fit a third order polynomial .
miny = # min y value on fitted curve .
minx = # corresponding x value , not the actually min(x) .
pylab.plot(x,y,'k-')
pylab.plot(x,fitt,'r-')
pylab.plot(minx,miny,'ro')
pylab.show()
Now that I have the original [x,y] , the fitted curve [x,fitt2] and the minimum point on the fitted curve [minx,miny] , how do I find the error range for this single point ?
Thanks .
For numpy 1.7 the polyfit has the option cov=True. You get as additional output the covariance matrix. From this, using Gaussian error propagation, you can get the error of the minimum. But what kind of spectrum is it? Very often there are model shapes to fit, such that there is no need for the polynomial fit.
You might also want to look at scipy.optimize.curve_fit
PS: What makes you think that the true value is left of your fitted value. This would be true if your fit function was symmetric, being applied to the asymmetric peak. The third order polynomial, however, should be able to address asymmetry.

Categories