4D Heatmap for PDP - python

Right now I'm working on a custom implementation of a PDP interaction for 3 features at the same time and I have a problem with visualizing the data. Sometime ago I thought that it would be nice to represent a 3D-heatmap for features' interaction, where the surface shape represents PDP interaction for 2 out of 3 features (x&y and x&z), and the heatmap texture layer represents interaction of the last two features (z&y).
Right now I have a working (barebones, though) implementation of the PDP heatmap for 2 features.
def pdp_custom_3D(estimated_model, X, y, n_splits, target_name_1, target_name_2, prefit = True, X_train = None, y_train = None):
import matplotlib.pyplot as plt
import seaborn as sb
if prefit == False:
try:
estimated_model.fit(X_train, y_train)
except:
logging.warning("Estimated model must have fit method.")
PDP_list_1 = list()
PDP_list_2 = list()
feature_list_1 = list()
feature_list_2 = list()
x_max_1 = X[target_name_1].max()
x_min_1 = X[target_name_1].min()
x_max_2 = X[target_name_2].max()
x_min_2 = X[target_name_2].min()
print('x_max_1', x_max_1)
print('x_min_1', x_min_1)
print('x_max_2', x_max_2)
print('x_min_2', x_min_2)
X_copy = X.copy()
step_1 = abs(x_max_1 - x_min_1)/n_splits
step_2 = abs(x_max_2 - x_min_2)/n_splits
x_axis = int(n_splits+1)
y_axis = int(n_splits+1)
s = (x_axis, y_axis)
PDP_result = np.zeros(s)
counter_1 = x_min_1
counter_2 = x_min_2
for i in range(n_splits+1):
feature_list_1.append(counter_1 + (i*step_1))
X_copy[target_name_1] = counter_1 + i * step_1
for j in range(n_splits+1):
feature_list_2.append(counter_2 + j * step_2)
X_copy[target_name_2] = counter_2 + j * step_2
temp = estimated_model.predict(X_copy)
PDP_result[i][j] = temp.mean()
sb.heatmap(PDP_result)
return
And It gets me somewhat nice heatmap
Is there any way I can get 3D heatmap (3 axis are selected features and 4th is mean of the prediction model is giving me back - It is being represented as a heat signature here, and I want a surface as a second representation of the prediction) with the same data forms?
Somewhat like this
Thank you all and have a nice day!

Related

Create 3D Streamtube plot in Plotly

Aim
I would like to create a 3D Streamtube Plot with Plotly.
Here is a cross-section of the vector field in the middle of the plot to give you an idea of how it looks like:
The final vector field should have rotational symmetry.
My Attempt
Download the data here: https://filebin.net/x6ywfuo6v4851v74
Run the code bellow:
Code:
import plotly.graph_objs as go
import plotly.express as px
import pandas as pd
import numpy as np
import plotly.io as pio
pio.renderers.default='browser'
# Import data to pandas
df = pd.read_csv("data.csv")
# Plot
X = np.linspace(0,1,101)
Y = np.linspace(0,1,10)
Z = np.linspace(0,1,101)
# Points from which the streamtubes should originate
xpos,ypos = np.meshgrid(X[::5],Y, indexing="xy")
xpos = xpos.reshape(1,-1)[0]
ypos = ypos.reshape(1,-1)[0]
starting_points = px.scatter_3d(
x=xpos,
y=ypos,
z=[-500]*len(xpos)
)
starting_points.show()
# Streamtube Plot
data_plot = [go.Streamtube(
x = df['x'],
y = df['y'],
z = df['z'],
u = df['u'],
v = df['v'],
w = df['w'],
starts = dict( #Determines the streamtubes starting position.
x=xpos,
y=ypos,
z=[-500]*len(xpos)
),
#sizeref = 0.3,
colorscale = 'jet',
showscale = True,
maxdisplayed = 300 #Determines the maximum segments displayed in a streamtube.
)]
fig = go.Figure(data=data_plot)
fig.show()
The initial points (starting points) of the streamtubes seem to be nicely defined:
...but the resulting 3D streamtube plot is very weird:
Edit
I tried normalizing the field plot, but the result is still not satisfactory:
import plotly.graph_objs as go
import pandas as pd
import numpy as np
import plotly.io as pio
pio.renderers.default='browser'
# Import data to pandas
df = pd.read_csv("data.csv")
# NORMALIZE VECTOR FIELD -> between [0,1]
df["u"] = (df["u"]-df["u"].min()) / (df["u"].max()-df["u"].min())
df["v"] = (df["v"]-df["v"].min()) / (df["v"].max()-df["v"].min())
df["w"] = (df["w"]-df["w"].min()) / (df["w"].max()-df["w"].min())
# Plot
X = np.linspace(0,1,101)
Y = np.linspace(0,1,10)
Z = np.linspace(0,1,101)
# Points from which the streamtubes should originate
xpos,ypos = np.meshgrid(X[::5],Y, indexing="xy")
xpos = xpos.reshape(1,-1)[0]
ypos = ypos.reshape(1,-1)[0]
# Streamtube Plot
data_plot = [go.Streamtube(
x = df['x'],
y = df['y'],
z = df['z'],
u = df['u'],
v = df['v'],
w = df['w'],
starts = dict( #Determines the streamtubes starting position.
x=xpos,
y=ypos,
z=[0]*len(xpos)
),
#sizeref = 0.3,
colorscale = 'jet',
showscale = True,
maxdisplayed = 300 #Determines the maximum segments displayed in a streamtube.
)]
fig = go.Figure(data=data_plot)
fig.show()
Data
As for the data itself:
It is created from 10 slices (y-direction). For each slice (y), [u,v,w] on a regular xz mesh (101x101) was computed. The whole was then assembled into the dataframe which you can download, and which has 101x101x10 data points.
Edit 2
It may be that I am wrongly converting my original data (download here: https://filebin.net/tlgkz3fy1h3j6h5o) into the format suitable for plotly, hence I was wondering if you know how this can be done correctly?
Here some code to visualize the data in a 3D vector plot correctly:
# %%
import pickle
import numpy as np
import matplotlib.pyplot as plt
# Import Full Data
with open("full_data.pickle", 'rb') as handle:
full_data = pickle.load(handle)
# Axis
X = np.linspace(0,1,101)
Y = np.linspace(0,1,10)
Z = np.linspace(-500,200,101)
# Initialize List of all fiels
DX = []
DY = []
DZ = []
for cross_section in list(full_data["cross_sections"].keys()):
# extract field components in x, y, and z
dx,dy,dz = full_data["cross_sections"][cross_section]
# Make them numpy imediatley
dx = np.array(dx)
dy = np.array(dy)
dz = np.array(dz)
# Apppend
DX.append(dx)
DY.append(dy)
DZ.append(dz)
#Convert to numpy
DX = np.array(DX)
DY = np.array(DY)
DZ = np.array(DZ)
# Create 3D Quiver Plot with color gradient
# Source: https://stackoverflow.com/questions/65254887/how-to-plot-with-matplotlib-a-3d-quiver-plot-with-color-gradient-for-length-giv
def plot_3d_quiver(x, y, z, u, v, w):
# COMPUTE LENGTH OF VECTOR -> MAGNITUDE
c = np.sqrt(np.abs(v) ** 2 + np.abs(u) ** 2 + np.abs(w) ** 2)
c = (c.ravel() - c.min()) / c.ptp()
# Repeat for each body line and two head lines
c = np.concatenate((c, np.repeat(c, 2)))
# Colormap
c = plt.cm.jet(c)
fig = plt.figure(dpi =300)
ax = fig.gca(projection='3d')
ax.quiver(x, y, z, u, v, w, colors=c, length=0.2, arrow_length_ratio=0.7)
plt.gca().invert_zaxis()
plt.show()
# Create Mesh !
xi, yi, zi = np.meshgrid(X, Y, Z, indexing='xy')
skip_every = 5
skip_slice = 2
skip3D=(slice(None,None,skip_slice),slice(None,None,skip_every),slice(None,None,skip_every))
# Source: https://stackoverflow.com/questions/68690442/python-plotting-3d-vector-field
plot_3d_quiver(xi[skip3D], yi[skip3D], zi[skip3D]/1000, DX[skip3D], DY[skip3D],
np.moveaxis(DZ[skip3D],2,1))
As you can see there are some long downward vectors in the middle of the 3D space, which is not shown in the plotly tubes.
Edit 3
Using the code from the answer, I get this:
This is a huge improvement. This looks almost perfect and is in accordance to what I expect.
A few more questions:
Is there a way to also show some tubes at the lower part of the plot?
Is there a way to flip the z-axis, such that the tubes are coming down from -z to +z (like shown in the cross-section streamline plot) ?
How does the data need to be structured to be organized correctly for the plotly plot? I ask that because of the use of np.moveaxis()?
I have rewritten my answer to reflect the history of conversation but in a disciplined manner.
The situation is:
len(np.unique(df['x']))
>>> 101
that when compared with:
len(np.unique(df['y']))
>>> 10
Seems data in y-direction are much coarser than that of x-direction!
But in z-direction the situation is even worse because the range of data are way more than that of x and y:
df.min()
>>> x 0.000000
y 0.000000
z -500.000000
u -0.369106
v -0.259156
w -0.517652
df.max()
>>> x 1.000000
y 1.000000
z 200.000000
u 0.368312
v 0.238271
w 1.257869
The solution to the ill formed data-set comprises of three steps:
Normalize the vector field and sample points in each direction
Either reduce data density in x and z direction or increase density of data on y-axis.(This step is optional but generally recommended)
After making a plot based on the new data, change axis ticks to the real values.
To normalize a vector-field in this situation which apparently is an engineering one, it's important to maintain the relative length of vectors on every spacial point by doing it this way:
# NORMALIZE VECTOR FIELD -> between [0,1]
np_df = np.array([u, v, w])
vecf_norm = np.linalg.norm(np_df, 2, axis=0)
max_norm = np.max(vecf_norm)
min_norm = np.min(vecf_norm)
u = u * (vecf_norm - min_norm) / (max_norm - min_norm)
v = v * (vecf_norm - min_norm) / (max_norm - min_norm)
w = w * (vecf_norm - min_norm) / (max_norm - min_norm)
As you will see at the end, this formulation will be used to enhance the resulting tube-plot.
Please let me add some important details about using dimensionless data for engineering data visualisation:
First of all if this vector field is resulted from any sort of differential equations, it is highly recommended to reformulate your P.D.F. to a dimensionless equation before attempting to solve it numerically.
If the vector field is result of an already dimensionless differential equation, you need to plot it using dimensionless data (including geometry and u,v,w values).
Please consider plotly uses the local divergence values to determine the local diameter of the tubes. When changing the vector field (and the geometry) we are changing the divergence as well.
I tried to mix your initial and second codes to get this:
import plotly.graph_objs as go
import plotly.express as px
import pandas as pd
import numpy as np
import plotly.io as pio
import pickle
pio.renderers.default='browser'
# Import Full Data
with open("full_data.pickle", 'rb') as handle:
full_data = pickle.load(handle)
# Axis
X = np.linspace(0,1,101)
Y = np.linspace(0,1,10)
Z = np.linspace(-0.5,0.2,101)
xpos,ypos = np.meshgrid(X[::5],Y, indexing="ij")
#xpos = xpos.reshape(1,-1)[0]
#ypos = ypos.reshape(1,-1)[0]
xpos = np.ravel(xpos)
ypos = np.ravel(ypos)
# Initialize List of all fields
DX = []
DY = []
DZ = []
for cross_section in list(full_data["cross_sections"]):
# extract field components in x, y, and z
dx,dy,dz = full_data["cross_sections"][cross_section]
# Make them numpy imediatley
dx = np.array(dx)
dy = np.array(dy)
dz = np.array(dz)
# Apppend
DX.append(dx)
DY.append(dy)
DZ.append(dz)
#Convert to numpy
move_i = [0, 1, 2]
move_e = [1, 2, 0]
DX = np.moveaxis(np.array(DX), move_i, move_e)
DY = np.moveaxis(np.array(DY), move_i, move_e)
DZ = np.moveaxis(np.array(DZ), move_i, move_e)
# Create Mesh !
xi, yi, zi = np.meshgrid(X, Y, Z, indexing="ij")
data_plot = [go.Streamtube(
x = np.ravel(xi),
y = np.ravel(yi),
z = np.ravel(zi),
u = np.ravel(DX),
v = np.ravel(DY),
w = np.ravel(DZ),
starts = dict( #Determines the streamtubes starting position.
x=xpos,
y=ypos,
z=np.array([-0.5]*len(xpos)
)),
#sizeref = 0.3,
colorscale = 'jet',
showscale = True,
maxdisplayed = 300 #Determines the maximum segments displayed in a streamtube.
)]
fig = go.Figure(data=data_plot)
fig.show()
In this code I have removed the skipping thing, because I suspect the evil is happening there. The resulting plot which you have added to your question, seems similar to the 2D plot of your question, but it requires more work to have better result.
So using what have been told already in addition to the info below:
Yes, Tubes are started from the start points, so you need to define start points where you expect to see tubes there! but, the start points need to be geometrically inside the space defined by sample points, otherwise maybe plotly be forced to extrapolate data (I'm not sure about this) and it results in distorted and unexpected results. This means you can define start points both in upper and lower planes of the field to ensure that you have vectors which emit on both planes. Sometime the vectors are there but you can not see them because they are drawn too thin to see. It's because their local divergences are too low, may be if you normalize this vector field by the rules mentioned earlier, it gives you a better result.
According to plotly documentation:
You can tell plotly's automatic axis range calculation logic to reverse the direction of an axis by setting the autorange axis property to "reversed"
plotly reads data point-by-point, so the order of points doesn't really matter but in case of your problem, the issue happens when data became corrupted and disturbed during omitting of some of sample points. i.e. some of x,y,z and some of u,v,w data loosed their correct location which resulted in an entirely different unexpected data set.
I have tried to normalize the (u,v,w) vector-field(using the formulation provided earlier):
import plotly.graph_objs as go
import plotly.express as px
import pandas as pd
import numpy as np
import plotly.io as pio
import pickle
pio.renderers.default='browser'
# Import Full Data
with open("full_data.pickle", 'rb') as handle:
full_data = pickle.load(handle)
# Axis
X = np.linspace(0,1,101)
Y = np.linspace(0,1,10)
Z = np.linspace(-0.5,0.2,101)
xpos,ypos = np.meshgrid(X[::5],Y, indexing="ij")
#xpos = xpos.reshape(1,-1)[0]
#ypos = ypos.reshape(1,-1)[0]
xpos = np.ravel(xpos)
ypos = np.ravel(ypos)
# Initialize List of all fields
DX = []
DY = []
DZ = []
for cross_section in list(full_data["cross_sections"]):
# extract field components in x, y, and z
dx,dy,dz = full_data["cross_sections"][cross_section]
# Make them numpy imediatley
dx = np.array(dx)
dy = np.array(dy)
dz = np.array(dz)
# Apppend
DX.append(dx)
DY.append(dy)
DZ.append(dz)
#Convert to numpy
move_i = [0, 1, 2]
move_e = [1, 2, 0]
DX = np.moveaxis(np.array(DX), move_i, move_e)
DY = np.moveaxis(np.array(DY), move_i, move_e)
DZ = np.moveaxis(np.array(DZ), move_i, move_e)
u1 = np.ravel(DX)
v1 = np.ravel(DY)
w1 = np.ravel(DZ)
np_df = np.array([u1, v1, w1])
vecf_norm = np.linalg.norm(np_df, 2, axis=0)
max_norm = np.max(vecf_norm)
min_norm = np.min(vecf_norm)
u2 = u1 * (vecf_norm - min_norm) / (max_norm - min_norm)
v2 = v1 * (vecf_norm - min_norm) / (max_norm - min_norm)
w2 = w1 * (vecf_norm - min_norm) / (max_norm - min_norm)
# Create Mesh !
xi, yi, zi = np.meshgrid(X, Y, Z, indexing="ij")
data_plot = [go.Streamtube(
x = np.ravel(xi),
y = np.ravel(yi),
z = np.ravel(zi),
u = u2,
v = v2,
w = w2,
starts = dict( #Determines the streamtubes starting position.
x=xpos,
y=ypos,
z=np.array([-0.5]*len(xpos)
)),
#sizeref = 0.3,
colorscale = 'jet',
showscale = True,
maxdisplayed = 300 #Determines the maximum segments displayed in a streamtube.
)]
fig = go.Figure(data=data_plot)
fig.show()
and get a better plot:

Kalman filter in python 2-D

As shown in this picture, my predicted points are following the GPS track, which has noisy points and that is not desired. Instead I want my filter to predict points that follow the road instead of the green area.
I tried to implement Kalman filter on noisy GPS data to remove the jumping points or predicting missing data if GPS signal is lost. Data contains latitude and longitude. After adjusting the parameters I can see that my predicted values are very much the same as the measurements I have, which is not fulfilling the actual problem I am trying to solve. I am still at the learning
stage, so I am not sure if the parameter selection is not right or the problem lies within my Python code. I'm using QGIS for visualization of Actual and Prediction values to compare them with my real GPS data.
Here is my code:
....python...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Users/mun/Desktop/Research/Ny mappe/GPS_track.csv')
df.head(1000)
lat = np.array([df.latitude])
print(lat)
long = np.array([df.longitude])
print(long)
print(len(long[0]))
for i in range(len(long)):
print(long[i][0])
for i in range(len(lat[0])):
print(lat[0][i])
print(len(lat[0]))
print(len(long[0]))
#length of the arrays. the arrays should always have the same length
lng=len(lat[0])
print(lng)
for index in range(lng):
print(lat[0][index])
print(long[0][index])
for index in range (lng):
np.array((lat[0][index], long[0][index]))
coord1 = [list(i) for i in zip (lat[0],long[0])]
print(coord1)
from pylab import *
from numpy import *
import matplotlib.pyplot as plt
class Kalman:
def __init__(self, ndim):
self.ndim = ndim
self.Sigma_x = eye(ndim)*1e-4 # Process noise (Q)
self.A = eye(ndim) # Transition matrix which
predict state for next time step (A)
self.H = eye(ndim) # Observation matrix (H)
self.mu_hat = 0 # State vector (X)
self.cov = eye(ndim)*0.01 # Process Covariance (P)
self.R = .001 # Sensor noise covariance matrix /
measurement error (R)
def update(self, obs):
# Make prediction
self.mu_hat_est = dot(self.A,self.mu_hat)
self.cov_est = dot(self.A,dot(self.cov,transpose(self.A))) +
self.Sigma_x
# Update estimate
self.error_mu = obs - dot(self.H,self.mu_hat_est)
self.error_cov = dot(self.H,dot(self.cov,transpose(self.H))) +
self.R
self.K =
dot(dot(self.cov_est,transpose(self.H)),linalg.inv(self.error_cov))
self.mu_hat = self.mu_hat_est + dot(self.K,self.error_mu)
if ndim>1:
self.cov = dot((eye(self.ndim) -
dot(self.K,self.H)),self.cov_est)
else:
self.cov = (1-self.K)*self.cov_est
if __name__ == "__main__":
#print "***** 1d ***********"
ndim = 1
nsteps = 3
k = Kalman(ndim)
mu_init=array([54.907134])
cov_init=0.001*ones((ndim))
obs = random.normal(mu_init,cov_init,(ndim, nsteps))
for t in range(ndim,nsteps):
k.update(obs[:,t])
print ("Actual: ", obs[:, t], "Prediction: ", k.mu_hat_est)
coord_output=[]
for coordinate in coord1:
temp_list=[]
ndim = 2
nsteps = 100
k = Kalman(ndim)
mu_init=np.array(coordinate)
cov_init=0.0001*ones((ndim))
obs = zeros((ndim, nsteps))
for t in range(nsteps):
obs[:, t] = random.normal(mu_init,cov_init)
for t in range(ndim,nsteps):
k.update(obs[:,t])
print ("Actual: ", obs[:, t], "Prediction: ", k.mu_hat_est[0])
temp_list.append(obs[:, t])
temp_list.append(k.mu_hat_est[0])
print("temp list")
print(temp_list)
coord_output.append(temp_list)
for coord_pair in coord_output:
print(coord_pair[0])
print(coord_pair[1])
print("--------")
print(line_actual)
print(coord_output)
df2= pd.DataFrame(coord_output)
print(df2)
Actual = df2[0]
Prediction = df2[1]
print (Actual)
print(Prediction)
Actual_df = pd.DataFrame(Actual)
Prediction_df = pd.DataFrame(Prediction)
print(Actual_df)
print(Prediction_df)
Actual_coord = pd.DataFrame(Actual_df[0].to_list(), columns = ['latitude',
'longitude'])
Actual_coord.to_csv('C:/Users/mun/Desktop/Research/Ny
mappe/Actual_noise.csv')
Prediction_coord = pd.DataFrame(Prediction_df[1].to_list(), columns =
['latitude', 'longitude'])
Prediction_coord.to_csv('C:/Users/mun/Desktop/Research/Ny
mappe/Prediction_noise.csv')
print (Actual_coord)
print (Prediction_coord)
Actual_coord.plot(kind='scatter',x='longitude',y='latitude',color='red')
plt.show()
Prediction_coord.plot(kind='scatter',x='longitude',y='latitude',
color='green')
plt.show()

how to create a proper sigmoid curve?

I'm trying to use logistic regression on the popularity of hits songs on Spotify from 2010-2019 based on their durations and durability, whose data are collected from an .csv file. Basically, since the popularity values of each song is numerical, I have converted each of them to binary numbers "0" to "1". If the popularity value of a hit song is less than 70, I will replace its current value to 0, and vice versa if its value is more than 70.
The current sigmoid curve is being "log" right now, hence it is showing a straight line. However, in the context of this code, I am still not sure how to add in a proper sigmoid curve, instead of just the straight line. Is there anything i need to add to my code in order to show both a solid sigmoid curve and the log of the curve in the same graph? It would be deeply appreciated if someone can help me with the final step.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('top10s [SubtitleTools.com] (2).csv')
BPM = df.bpm
BPM = np.array(BPM)
Energy = df.nrgy
Energy = np.array(Energy)
Dance = df.dnce
Dance = np.array(Dance)
dB = df.dB
dB = np.array(dB)
Live = df.live
Live = np.array(Live)
Valence = df.val
Valence = np.array(Valence)
Acous = df.acous
Acous = np.array(Acous)
Speech = df.spch
Speech = np.array(Speech)
df.loc[df['popu'] <= 70, 'popu'] = 0
df.loc[df['popu'] > 70, 'popu'] = 1
def Logistic_Regression(X, y, iterations, alpha):
ones = np.ones((X.shape[0], ))
X = np.vstack((ones, X))
X = X.T
b = np.zeros(X.shape[1])
for i in range(iterations):
z = np.dot(X, b)
p_hat = sigmoid(z)
gradient = np.dot(X.T, (y - p_hat))/y.size
b = b + alpha * gradient
if (i % 1000 == 0):
print('LL, i ', log_likelihood(X, y, b), i)
return b
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def log_likelihood(X, y, b):
z = np.dot(X, b)
LL = np.sum(y*z - np.log(1 + np.exp(z)))
return LL
def LR1():
Dur = df.dur
Dur = np.array(Dur)
Pop = df.popu
Pop = [int(i) for i in Pop]; Pop = np.array(Pop)
plt.figure(figsize=(10,8))
colormap = np.array(['r', 'b'])
plt.scatter(Dur, Pop, c = colormap[Pop], alpha = .4)
b = Logistic_Regression(Dur, Pop, iterations = 8000, alpha = 0.00005)
print('Done')
p_hat = sigmoid(np.dot(Dur, b[1]) + b[0])
idxDur = np.argsort(Dur)
plt.plot(Dur[idxDur], p_hat[idxDur])
plt.show()
LR1()
My dataset:
CSV File
My Current Graph
What i want to have:
Shape of sigmoid i want
at first glance, your Logistic_Regression initialization seems very wrong.
I think you packed X with [X, 1] then tries to learn W = [Weight, bias], which should be [1, 0] to start with.
Note the 1 is vector [1, 1, 1...] with length = feature vector length.
try something like this:
x_range = np.linspace(Dur.min(), Dur.max(), 100)
p_hat = sigmoid(np.dot(x_range, b[1]), b[0])
plt.plot(x_range, p_hat)
plt.show()

How to plot perceptron decision boundary and data set in python

I wrote multilayer-perceptron, using three layers (0,1,2). I want to plot the decision boundary and the data-set(eight features long) that i classified, Using python.
How do i plot it on the screen, using one of the python libraries?
Weight function -> matrix[3][8]
Sample x -> vector[8]
#-- Trains the boundary decision, and test it. --#
def perceptron(x, y):
m = len(x)
d = len(x[0])
eta = 0.1
w = [[0 for k in range(d)] for j in range(3)]
T = 2500
for t in range(0, T):
i = random.randint(0, m - 1)
v = [float(j) for j in x[i]]
y_hat = np.argmax(np.dot(w, v))
if y_hat != y[i]:
w[y[i]] = np.add(w[y[i]], np.array(v) * eta)
w[y_hat] = np.subtract(w[y_hat], np.array(v) * eta)
w_perceptron = w
#-- Test the decision boundary that we trained. --#
#-- Prints the loss weight function. --#
M_perceptron = 0
for t in range(0, m):
y_hat = np.argmax(np.dot(w_perceptron, x[t]))
if y[t] != y_hat:
M_perceptron = M_perceptron + 1
return float(M_perceptron) / m
def main():
y = []
x = [[]]
x = readTrain_X(sys.argv[1], x) # Reads data trainning set.
readTrain_Y(sys.argv[2], y) # Reads right classified training set.
print(perceptron(x, y))
You cannot plot 8 features. There is no way you can visualize a 8D space. But what you can do is to perform dimensionality reduction using PCA/t-SNE to 2D for visualization. If you can reduce it to 2D then you can use create a grid of values and use the probabilities returned by the model to visualize the decision boundary.
Reference: Link

How to draw decision boundary in SVM sklearn data in python?

I am reading email data from training set and creating train_matrix, train_labels and test_labels. Now how do I display decision boundary using matplot in python. I am using svm of sklearn. There are online example for pre given data sets through iris. But plot fails on custom data. Here is my code
Error :
Traceback (most recent call last):
File "classifier-plot.py", line 115, in <module>
Z = Z.reshape(xx.shape)
ValueError: cannot reshape array of size 260 into shape (150,1750)
Code:
import os
import numpy as np
from collections import Counter
from sklearn import svm
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
def make_Dictionary(root_dir):
all_words = []
emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
for mail in emails:
with open(mail) as m:
for line in m:
words = line.split()
all_words += words
dictionary = Counter(all_words)
list_to_remove = dictionary.keys()
for item in list_to_remove:
if item.isalpha() == False:
del dictionary[item]
elif len(item) == 1:
del dictionary[item]
dictionary = dictionary.most_common(3000)
return dictionary
def extract_features(mail_dir):
files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
features_matrix = np.zeros((len(files),3000))
train_labels = np.zeros(len(files))
count = 0;
docID = 0;
for fil in files:
with open(fil) as fi:
for i,line in enumerate(fi):
if i == 2:
words = line.split()
for word in words:
wordID = 0
for i,d in enumerate(dictionary):
if d[0] == word:
wordID = i
features_matrix[docID,wordID] = words.count(word)
train_labels[docID] = 0;
filepathTokens = fil.split('/')
lastToken = filepathTokens[len(filepathTokens) - 1]
if lastToken.startswith("spmsg"):
train_labels[docID] = 1;
count = count + 1
docID = docID + 1
return features_matrix, train_labels
TRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"
dictionary = make_Dictionary(TRAIN_DIR)
print "reading and processing emails from file."
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)
model = svm.SVC(kernel="rbf", C=10000)
print "Training model."
features_matrix = features_matrix[:len(features_matrix)/10]
labels = labels[:len(labels)/10]
#train model
model.fit(features_matrix, labels)
predicted_labels = model.predict(test_feature_matrix)
print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)
##----------------
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
X = features_matrix
y = labels
svc = model.fit(X, y)
#svm.SVC(kernel='linear', C=C).fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = y[:].min() - 1, y[:].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel']
Z = predicted_labels#svc.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[0])
plt.show()
In the tutorial that you were following Z is computed by applying the classifier to a set of feature vectors generated to form a regular NxM grid. This makes the plot smooth.
When you replaced
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
with
Z = predicted_labels
you replaced this regular grid with the predictions taken on your dataset. The next line failed with an error since it could not reshape an array of size len(files) to an NxM matrix. There is no reason len(files) = NxM.
There is a reason why you could not follow the tutorial directly. Your data dimension is 3000, so your decision boundary would be a 2999-dimensional hyperplane in a 3000-dimensional space. This is not easy to visualize.
In the tutorial the dimension is 4 and it is reduced to 2 for visualization.
The best way to reduce the dimension of your data depends on the data. In the tutorial we just pick the first two components of the 4-dimensional vector.
Another option that works well in many cases is to use Principal Component Analysis to reduce the dimension of data.
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(features_matrix, labels)
reduced_matrix = pca.fit_transform(features_matrix, labels)
model.fit(reduced_matrix, labels)
Such model can be used for 2D visualization. You can just follow the tutorial directly and define
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
A complete but not a very impressive example
We do not have access to your email data, so for illustration we could just use random data.
from sklearn import svm
from sklearn.decomposition import PCA
# initialize algorithms and data with random
model = svm.SVC(gamma=0.001,C=100.0)
pca = PCA(n_components = 2)
rng = np.random.RandomState(0)
U = rng.rand(200, 2000)
v = (rng.rand(200)*2).astype('int')
pca.fit(U,v)
U2 = pca.fit_transform(U,v)
model.fit(U2,v)
# generate grid for plotting
h = 0.2
x_min, x_max = U2[:,0].min() - 1, U2[:, 0].max() + 1
y_min, y_max = U2[:,1].min() - 1, U2[:, 1].max() + 1
xx, yy = np.meshgrid(
np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# create decision boundary plot
Z = s.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
contourf(xx,yy,Z,cmap=plt.cm.coolwarm, alpha=0.8)
scatter(U2[:,0],U2[:,1],c=v)
show()
Would produce a decision boundary that does not look very impressive.
Indeed the first two principal components capture just about 1% of the information contained in the data
>>> print(pca.explained_variance_ratio_)
[ 0.00841935 0.00831764]
If now you introduce just a little bit of carefully disguised asymmetry you would already see an effect.
Modify the data to introduce shifts at just one coordinate randomly selected for each feature
random_shifts = (rng.rand(2000)*200).astype('int')
for i in range(MM):
if v[i] == 1:
U[i,random_shifts[i]] += 5.0
And applying PCA you would get somewhat more informative picture.
Note that here the first two principal components already explain about 5% of the variance and the red part of the picture contains many more red points than blue ones.

Categories