How to plot normal vector of decision boundary?

How to plot normal vector of decision boundary? - python

I've managed to plot the decision boundary of a support vector machine in 2D and 3D. Now, I'd like to plot the normal vector of it as well, but in a way that works not only in 2D / 3D but also in higher-dimensional spaces. At the moment, I'm simply calculating the normal vector by computing the slope of it with m1 * m2 = -1.
Going deeper into the mathematics behind SVMs, I've found out that there's the w-vector which is perpendicular to the decision boundary. I'm using the LinearSVC implementation of sklearn to train the classifier. As far as I know, the w-vector is given by the coef_[0] attribute, but plotting this vector doesn't give the result I was expecting.
Is there a general way to compute the normal vector of a SVM decision boundary, which not only works in 2D / 3D but also in high-dimensional spaces?
What I'm trying to achieve is to navigate inside a n-dimensional space gradually from one class to another. Since its not possible to visualize a high-dimensional space, I'd like to validate everything first in 2D/3D to gain a better understanding.
I've a data set of labeled fashion item images. First, I've extracted 2048-dimensional feature vectors using a CNN (ResNet-50). Then, I perform PCA reduce the dimensionality of the vectors. Before, I've performed some data cleaning and filtering.
num_feature_dimensions = 2 # Set the number of embedding dimensions
pca = PCA(n_components = num_feature_dimensions)
embs_compressed = pca.fit_transform(df_embs_filtered)
df_embs_filtered_compressed = pd.DataFrame(embs_compressed)
df_embs_filtered_compressed
After that, I train the SVM with the uncompressed embeddings as X and the season feature as y (binary problem, either winter or summer).
X = df_embs_filtered
y = df_filtered["season"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
svm_clf = LinearSVC(C=1, max_iter=100000)
svm_clf.fit(X_scaled, y)
The last step is to visualize the embedding space (in case of 2D / 3D) with the decision boundary and an orthogonal axis. It should be possible for a user to navigate over that orthogonal axis to go from one class to the other. So, I'm creating a marker for the user and utilizing an ipywidgets FloatSlider which updates the position. Then, depending on the user's position, it'll show the image of the nearest neighbor embedding.
This is the whole code for creating the scatter plot, computing the decision boundary and its orthogonal axis, and the FloatSlider for 2D. I left out some snippets which I think aren't relevant to the question.
from ipywidgets import AppLayout, FloatSlider
from matplotlib.offsetbox import (AnnotationBbox, OffsetImage, TextArea)
plt.ioff()
fig, ax = plt.subplots(figsize=(15,7))
fig.canvas.header_visible = False
fig.canvas.layout.min_height = '400px'
# Create Scatterplot of filtered dataset colored by season feature
sns.scatterplot(x="x", y="y",
hue="season",
data=df_filtered,
legend="full",
alpha=0.8)
# Computes the decision boundary of a trained classifier
db_xx, db_yy = calc_svm_decision_boundary(svm_clf, -35, 35)
# Rotate the decision boundary 90° to get perpendicular axis
neg_yy = np.negative(db_yy)
neg_slope = -1 / -svm_clf.coef_[0][0]
bias = svm_clf.intercept_[0] / svm_clf.coef_[0][1]
ortho_db_yy = neg_slope * db_xx - bias
# Plot the axes
plt.plot(db_xx, db_yy, "k-", linewidth=1)
plt.plot(db_xx, ortho_db_yy, "r-", linewidth=1)
#plt.plot(neg_yy, db_xx, "g-", linewidth=2)
# Choose a random starting position and initialize user marker on that position
rand_idx = random.choice(range(len(db_xx)))
x = db_xx[rand_idx]
y = ortho_db_yy[rand_idx]
user_marker, user_positon = create_user_marker(x, y)
# Compute the nearest neighbour and annotate it with its respective image
nearest_neighbour, nearest_neighbour_pos = get_nearest_neighbour(user_positon, df_filtered)
annotate_nearest_neighbour(nearest_neighbour, nearest_neighbour_pos, ax, df_filtered)
plt.title('Nearest Embedding: {} with season: {}, pos: {}'.format(nearest_neighbour, df_filtered.loc[df_filtered['id'] == nearest_neighbour].season.values[0], user_positon))
# Create Slider to interact with the plot
slider = FloatSlider(
orientation="horizontal",
description="x-Position:",
value=user_positon[0],
min=min(db_xx),
max=max(db_xx)
)
slider.layout.margin = '0px 30% 0px 30%'
slider.layout.width = '25%'
slider.observe(update_user_position_2D, names='value')
AppLayout(
center=fig.canvas,
footer=slider,
pane_heights=[0, 6, 1]
)
def calc_svm_decision_boundary(svm_clf, xmin, xmax):
"""Compute the decision boundary"""
w = svm_clf.coef_[0]
b = svm_clf.intercept_[0]
xx = np.linspace(xmin, xmax, 200)
yy = -w[0]/w[1] * xx - b/w[1]
return xx, yy
This results to the following plot:
This approach to compute the orthogonal axis works in 2D but I'm looking for a general approach to compute it regardless of the space dimension. As you can see in the plot, for the season feature there's no fitting separating hyperplane in lower-dimensional spaces. My hypothesis is, that there's a hyperplane in a higher-dimensional space which is separating the classes well.
Now, I thought that I could use the w-vector which is perpendicular to the decision boundary to compute an orthogonal axis in any space. Is that possible or do I've an error in reasoning?

Related

How can we generate a sequence of numbers from Polynomial regression curve?

I miss a part of my dataset, which is the position of the tennis ball in the video for each frame. The missing part is when the player hits the ball, and the ball goes up and comes down to the second player taking a curve shape.
I have create the curve using polynomial regression method, as shown in the image.enter image description here
The curve presents the ten points before missing the data and ten points after.
Now, How can we generate a sequence of points, which is the missing datase, from the curve that I have created using python?!
The missing data points:
([1189,1188, 1186,1187,1184,1183,1182,1181,1181,1181[missing data]908,906,901,900,898,893,888,883,878,879])
([221,216,213,212,209,205,200,195,195,195[missing data]212,222,235,235,249,263,276,292,303,303])
This is the Code that I use to create the curve:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1189,1188, 1186,1187,1184,1183,1182,1181,1181,1181,908,906,901,900,898,893,888,883,878,879])
y = np.array([221,216,213,212,209,205,200,195,195,195,212,222,235,235,249,263,276,292,303,303])
model = np.poly1d(np.polyfit(x,y,3))
line = np.linspace(np.min(x), np.max(x), num=100)
plt.scatter(x, y)
plt.plot(line, model(line))
plt.show()

Your model was obtained using np.polyfit:
fitted_parameters = np.polyfit(x,y,3)
You can use np.polyval to make a prediction:
x = 1050
prediction = np.polyval(fitted_parameters, x)
# The prediction value for x = 1050 is y = 8.64
So it is just a matter of using np.linspace to obtain an evenly distributed set of x values and use np.polyval to obtain the curve (y values that are missing).

How to fit multiple curves to a single scatter plot of data?

I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data

Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.

I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:

k-space vector for N-body simulation box DFTs

I'm trying to write a particle mesh N-body simulation. In such a simulation the potential field is found by solving Poisson's equation using Fourier transforms. I have been following a presentation by Andrey Kravtsov (http://astro.uchicago.edu/~andrey/talks/PM/pm.pdf), but slide 15 has me confused. So far, I have assigned densities to a 3d grid from particle positions, and Fourier transformed the density grid. The next step is to calculate Green's function in Fourier space and multiply it with the Fourier transformed density grid, and afterwards applying an inverse Fourier transform to real space to obtain the potential grid. Through trial and error I traced the part that wasn't working correctly to the potential calculation, and specifically the k-space vector.
So, to calculate Green's function in Fourier space I need the Fourier axes usually called k-space vectors k_x, k_y, k_z. Using the slide it should be 2*pi*(k,l,m)/N_g for components k,l,m, where N_g is the number of grid cells. So far I've tried with these components running from 0,+1,+2,...,N_g. And -N_particle/2, ..., +N_particle/2 and several other iterations. The only thing that has produced reasonable results (can see a cluster in density slice projected on the same potential field slice) has been with using numpy.fft.freq in Python for specific values of the resolution/sample spacing. However, any resolution I chose (such as L/N_g, N_p/N_g, 2pi/N_g, etc.) did not scale properly with box size L, number of grid cells or number of particles and no longer worked for e.g. larger number of grid cells.
My question is:
How do I define my k-space vectors (i.e. the Fourier axes in reciprocal space) for a simulation with, along one direction, box size L, number of grid cells N_g and number of particles N_p?
I should add that the particle positions and velocities are all in code units as defined in the first few slides.
Minimum working example:
#!/usr/bin/env python3
import numpy as np
import matplotlib.pyplot as plt
M = 30 #Number of particles in 1 direction
Mn = 90 #Number of grid cells in 1 direction
Lx = 10 #grid physical size
u = np.random.random(M*M*M)
v = np.random.random(M*M*M)
w = np.random.random(M*M*M)
#Have purposefully taken smaller cube, to show potential works
planex = M*u
planey = M*v
planez = M*w
#Create a new grid
grid = np.zeros([Mn,Mn,Mn], dtype='cfloat')
#cell center coordinates
x_c = np.floor(planex).astype(int)%Mn
y_c = np.floor(planey).astype(int)%Mn
z_c = np.floor(planez).astype(int)%Mn
#in terms of the average density of the universe, doesnt matter for the
#example
mass = 1.
#Update the grid
grid[z_c,y_c,x_c] += mass
fig = plt.figure()
ax = fig.add_subplot(111)
plt.imshow(grid[:,:,2].real)
plt.show()
#FFT the grid
grid = np.fft.fftn(grid)
#The resolution and the k-space vectors are the parts I am unsure about
resolution = np.pi*2/(M/Mn)
resolution = Lx/Mn
#Define the k-space vectors
k_x = np.fft.fftfreq(Mn, resolution)
k_y = np.fft.fftfreq(Mn, resolution)
k_z = np.fft.fftfreq(Mn, resolution)
kz, ky, kx = np.meshgrid(k_z, k_y, k_x)
Omega_0 = 0.27
a = 0.3
#Calculate Greens function
k_squared = np.sin(kz/2)**2 + np.sin(ky/2)**2 + np.sin(kx/2)**2
Greens = -3*Omega_0/8/a*np.divide(1, k_squared, where=k_squared!=0)
#Multiply the grids in Fourier space
grid = Greens*grid
#IFFT to real space
potentials = np.fft.ifftn(grid)
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
plt.imshow(potentials[:,:,0].real)
plt.show()
Large value for the resolution makes velocities explosive, small value and very small velocities. So what makes the right resolution?
This is my first time asking on Stack overflow, please let me know if I'm doing something wrong.
Best, R.

Build in function for plotting bayes decision boundary given the probability function

Is there a function in python, that plots bayes decision boundary if we input a function to it? I know there is one in matlab, but I'm searching for some function in python. I know that one way to achieve this is to iterate over the points, but I am searching for a built-in function.
I have bivariate sample points on the axis, and I want to plot the decision boundary in order to classify them.

Going off the guess of Chris in the comments above, I'm assuming you want to cluster points according to the Gaussian Mixture model - a reasonable method assuming the underlying distribution is a linear combination of Gaussian distributed samples. Below I've shown an example using numpy to create a sample data set, sklearn for it's GM modeling and pylab to show the results.
import numpy as np
from pylab import *
from sklearn import mixture
# Create some sample data
def G(mu, cov, pts):
return np.random.multivariate_normal(mu,cov,500)
# Three multivariate Gaussians with means and cov listed below
MU = [[5,3], [0,0], [-2,3]]
COV = [[[4,2],[0,1]], [[1,0],[0,1]], [[1,2],[2,1]]]
A = [G(mu,cov,500) for mu,cov in zip(MU,COV)]
PTS = np.concatenate(A) # Join them together
# Use a Gaussian Mixture model to fit
g = mixture.GMM(n_components=len(A))
g.fit(PTS)
# Returns an index list of which cluster they belong to
C = g.predict(PTS)
# Plot the original points
X,Y = map(array, zip(*PTS))
subplot(211)
scatter(X,Y)
# Plot the points and color according to the cluster
subplot(212)
color_mask = ['k','b','g']
for n in xrange(len(A)):
idx = (C==n)
scatter(X[idx],Y[idx],color=color_mask[n])
show()
See the sklearn.mixture example page for more detailed information on the classification methods.

Fitting a line in 3D

Are there any algorithms that will return the equation of a straight line from a set of 3D data points? I can find plenty of sources which will give the equation of a line from 2D data sets, but none in 3D.
Thanks.

If you are trying to predict one value from the other two, then you should use lstsq with the a argument as your independent variables (plus a column of 1's to estimate an intercept) and b as your dependent variable.
If, on the other hand, you just want to get the best fitting line to the data, i.e. the line which, if you projected the data onto it, would minimize the squared distance between the real point and its projection, then what you want is the first principal component.
One way to define it is the line whose direction vector is the eigenvector of the covariance matrix corresponding to the largest eigenvalue, that passes through the mean of your data. That said, eig(cov(data)) is a really bad way to calculate it, since it does a lot of needless computation and copying and is potentially less accurate than using svd. See below:
import numpy as np
# Generate some data that lies along a line
x = np.mgrid[-2:5:120j]
y = np.mgrid[1:9:120j]
z = np.mgrid[-5:3:120j]
data = np.concatenate((x[:, np.newaxis],
y[:, np.newaxis],
z[:, np.newaxis]),
axis=1)
# Perturb with some Gaussian noise
data += np.random.normal(size=data.shape) * 0.4
# Calculate the mean of the points, i.e. the 'center' of the cloud
datamean = data.mean(axis=0)
# Do an SVD on the mean-centered data.
uu, dd, vv = np.linalg.svd(data - datamean)
# Now vv[0] contains the first principal component, i.e. the direction
# vector of the 'best fit' line in the least squares sense.
# Now generate some points along this best fit line, for plotting.
# I use -7, 7 since the spread of the data is roughly 14
# and we want it to have mean 0 (like the points we did
# the svd on). Also, it's a straight line, so we only need 2 points.
linepts = vv[0] * np.mgrid[-7:7:2j][:, np.newaxis]
# shift by the mean to get the line in the right place
linepts += datamean
# Verify that everything looks right.
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d as m3d
ax = m3d.Axes3D(plt.figure())
ax.scatter3D(*data.T)
ax.plot3D(*linepts.T)
plt.show()
Here's what it looks like:

If your data is fairly well behaved then it should be sufficient to find the least squares sum of the component distances. Then you can find the linear regression with z independent of x and then again independent of y.
Following the documentation example:
import numpy as np
pts = np.add.accumulate(np.random.random((10,3)))
x,y,z = pts.T
# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]
# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]
# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
x = (z - c_xz)/m_xz
y = (z - c_yz)/m_yz
return x,y
#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()
If you want to minimize the actual orthogonal distances from the line (orthogonal to the line) to the points in 3-space (which I'm not sure is even referred to as linear regression). Then I would build a function that computes the RSS and use a scipy.optimize minimization function to solve it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.