Compare sampler combining over- and under-sampling ERROR

Compare sampler combining over- and under-sampling ERROR - python

I am trying to run a comparison in my oversampling and udnersampling algorithms.
This is my y numpy array:
[0. 0. 0. ... 0. 0. 0.]
There are 1's and 0's here.
percentage of 0s is 99.57113470676805
percentage of 1s 0.4288652932319543
This is my X numpy array:
[[ 9.99139870e+00 6.87505736e-01 8.18184694e-01 5.79211424e-03
7.07254165e-02 -4.96940863e-02]
[ 1.45842820e-02 8.90971353e-01 5.40819886e-02 4.78689597e-03
-7.58403812e-01 1.25082521e-01]
[ 1.45743243e-02 8.77439954e-01 3.24491931e-02 4.73968535e-03
-5.17675263e-02 -5.86812372e-02]
...
[ 1.81681846e-03 2.17873637e+00 7.85498395e-01 5.44274803e-04
-4.03230077e-02 2.36304861e-02]
[ 1.81637248e-03 2.22724182e+00 7.85498395e-01 5.74896405e-04
2.43415000e-01 -2.68917605e-02]
[ 1.81600743e-03 2.29634509e+00 7.85498395e-01 5.93269365e-04
1.17457969e-01 1.15348925e-03]]
There are 6 X features as you can see above but the ERROR is that there is only 2. I dont know where I can fix this error so that the graph works.
This is what I am trying to measure:
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
samplers = [SMOTE(random_state=0), SMOTEENN(random_state=0), SMOTETomek(random_state=0)]
fig, axs = plt.subplots(3, 2, figsize=(15, 25))
for ax, sampler in zip(axs, samplers):
clf = make_pipeline(sampler, LinearSVC()).fit(X, y)
plot_decision_function(X, y, clf, ax[0])
plot_resampling(X, y, sampler, ax[1])
fig.tight_layout()
plt.show()
def plot_decision_function(X, y, clf, ax):
"""Plot the decision function of the classifier and the original data"""
plot_step = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(
np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)
)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=0.4)
ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor="k")
ax.set_title(f"Resampling using {clf[0].__class__.__name__}")
def plot_resampling(X, y, sampler, ax):
"""Plot the resampled dataset using the sampler."""
X_res, y_res = sampler.fit_resample(X, y)
ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor="k")
sns.despine(ax=ax, offset=10)
ax.set_title(f"Decision function for {sampler.__class__.__name__}")
return Counter(y_res)
The error that I am getting is simple, but I cannot find a way to fix it:
ValueError: X has 2 features per sample; expecting 10
ValueError Traceback (most recent call last)
in <module>
9 for ax, sampler in zip(axs, samplers):
10 clf = make_pipeline(sampler, LinearSVC()).fit(X, y)
---> 11 plot_decision_function(X, y, clf, ax[0])
12 plot_resampling(X, y, sampler, ax[1])
13 fig.tight_layout()

Related

Having problems with code "plot_2d_classification"

I am working my way through the online text "Applied Machine Learning in Python" at https://amueller.github.io/aml/01-ml-workflow/02-supervised-learning.html
Currently, I am working through the chapter on "Supervised Learning". The following snippet of code occurs toward the end of the chapter:
fig, axes = plt.subplots(2, 2, figsize=(8, 8))
for ax, n_neighbors in zip(axes.ravel(), [3, 5, 11, 33]):
ax.set_title(f"n_neighbors={n_neighbors}")
clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train[['mean compactness', 'worst concave points']], y_train)
ax.scatter(X_train['mean compactness'], X_train['worst concave points'], c=y_train, cmap='bwr', s=2)
plot_2d_classification(clf, np.array(X_train[['mean compactness', 'worst concave points']]), ax=ax, alpha=.4, cmap='bwr')
ax.set_aspect("equal")
ax.set_xlim(0.05, 0.17)
ax.set_ylim(0.06, 0.2)
When I copy and paste it into Jupyter Notebook, it returns the following error:
NameError Traceback (most recent call last)
Input In [24], in <cell line: 2>()
4 clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train[['mean compactness', 'worst concave points']], y_train)
5 ax.scatter(X_train['mean compactness'], X_train['worst concave points'], c=y_train, cmap='bwr', s=2)
6 plot_2d_classification(clf, np.array(X_train[['mean compactness', 'worst concave points']]), ax=ax, alpha=.4, cmap='bwr')
7 ax.set_aspect("equal")
8 ax.set_xlim(0.05, 0.17)
NameError: name 'plot_2d_classification' is not defined
It is supposed to return a set of the following four plots.
enter image description here
I have done a Google search using the term "plot_2d_classification" and received a single page of links, none of which provide any insight.
I found the following two posts by A. Mueller for
plot_2d_separator.py https://github.com/amueller/mglearn/blob/master/mglearn/plot_2d_separator.py
which requires
plot_helpers.py https://github.com/amueller/mglearn/blob/master/mglearn/plot_helpers.py
Cutting and Pasting the snippet of code above returns additional errors such that none of the three sets of code runs successfully.
Any suggestions?

The error indicates you didn't define the function plot_2d_classification
Just copy the function plot_2d_classification in the plot_2d_separator.py and make a little modification. The full codes:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
def plot_2d_classification(classifier, X, fill=False, ax=None, eps=None,
alpha=1, cm=''):
# multiclass
if eps is None:
eps = X.std() / 2.
if ax is None:
ax = plt.gca()
x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps
y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps
xx = np.linspace(x_min, x_max, 1000)
yy = np.linspace(y_min, y_max, 1000)
X1, X2 = np.meshgrid(xx, yy)
X_grid = np.c_[X1.ravel(), X2.ravel()]
decision_values = classifier.predict(X_grid)
ax.imshow(decision_values.reshape(X1.shape), extent=(x_min, x_max,
y_min, y_max),
aspect='auto', origin='lower', alpha=alpha, cmap=cm)
ax.set_xlim(x_min, x_max)
ax.set_ylim(y_min, y_max)
ax.set_xticks(())
ax.set_yticks(())
cancer = load_breast_cancer(as_frame=True)
cancer_df = cancer.frame
data_train, data_test = train_test_split(cancer_df)
X_train = data_train.drop(columns='target')
y_train = data_train.target
X_test = data_test.drop(columns='target')
y_test = data_test.target
fig, axes = plt.subplots(2, 2, figsize=(8, 8))
for ax, n_neighbors in zip(axes.ravel(), [3, 5, 11, 33]):
ax.set_title(f"n_neighbors={n_neighbors}")
clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train[['mean compactness', 'worst concave points']], y_train)
ax.scatter(X_train['mean compactness'], X_train['worst concave points'], c=y_train, cmap='bwr', s=2)
plot_2d_classification(clf, np.array(X_train[['mean compactness', 'worst concave points']]), ax=ax, alpha=.4, cm='bwr')
ax.set_aspect("equal")
ax.set_xlim(0.05, 0.17)
ax.set_ylim(0.06, 0.2)
Then run the code, you'll get the figure:

How to plot 3d scatter with QDA decision boundary?

With generated data, I am trying to plot 3d decision boundary of QDA in 3d spaces. I used sklearn library to calculate QDA, but couldn't plot 3d QDA decision boundary.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.datasets import make_classification
# Generate a random dataset for classification
X, y = make_classification(n_features=3, n_informative=2, n_redundant=0, n_repeated=0, random_state=0)
# Create and fit a QDA classifier
qda = QuadraticDiscriminantAnalysis()
qda.fit(X, y)
# Plot the decision boundary of the QDA classifier
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
z_min, z_max = X[:, 2].min() - 1, X[:, 2].max() + 1
xx, yy, zz = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1),
np.arange(z_min, z_max, 0.1))
X_grid = np.c_[xx.ravel(), yy.ravel(), zz.ravel()]
Z = qda.predict(X_grid)
Z = Z.reshape(xx.shape)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.Paired)
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
ax.plot_surface(xx, yy, zz, facecolors=plt.cm.Paired(Z), alpha=0.2)
plt.show()
Above code saying zz must be 2d instead of 3d, but I don't really get why it has to be 2d?
At the end I want to see figure with scatter plot of binary classes and decision boundary of QDA on data points.

With respect to z, this also needs to be a 2D array since Axes3D.surface_plot maps each element of the 2D array z with the 2D grid defined by x and y.
Hence, when you use your own x, y and z make sure that you use numpy.meshgrid for x and y and, then, define z = f(x,y) (e.g. the function flux_qubit_potential you show).
maybe help you: Link

Visualize 2D / 3D decision surface in SVM scikit-learn

I made sklearn svm classifier work. I simply classify 2 options 0 or 1
using feature vectors. It works fine.
I want to visualize it on page using graphs.
Problem is that my vector is 512 item length, so hard to show on x,y graph.
Is there any way to visualize classification hyperplane for a long vector of features like 512?

You cannot visualize the decision surface for a lot of features. This is because the dimensions will be too many and there is no way to visualize an N-dimensional surface.
However, you can use 2 features and plot nice decision surfaces as follows.
I have also written an article about this here:
https://towardsdatascience.com/support-vector-machines-svm-clearly-explained-a-python-tutorial-for-classification-problems-29c539f3ad8?source=friends_link&sk=80f72ab272550d76a0cc3730d7c8af35
Case 1: 2D plot for 2 features and using the iris dataset
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
model = svm.SVC(kernel='linear')
clf = model.fit(X, y)
fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('y label here')
ax.set_xlabel('x label here')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()
Case 2: 3D plot for 3 features and using the iris dataset
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from mpl_toolkits.mplot3d import Axes3D
iris = datasets.load_iris()
X = iris.data[:, :3] # we only take the first three features.
Y = iris.target
#make it binary classification problem
X = X[np.logical_or(Y==0,Y==1)]
Y = Y[np.logical_or(Y==0,Y==1)]
model = svm.SVC(kernel='linear')
clf = model.fit(X, Y)
# The equation of the separating plane is given by all x so that np.dot(svc.coef_[0], x) + b = 0.
# Solve for w3 (z)
z = lambda x,y: (-clf.intercept_[0]-clf.coef_[0][0]*x -clf.coef_[0][1]*y) / clf.coef_[0][2]
tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot3D(X[Y==0,0], X[Y==0,1], X[Y==0,2],'ob')
ax.plot3D(X[Y==1,0], X[Y==1,1], X[Y==1,2],'sr')
ax.plot_surface(x, y, z(x,y))
ax.view_init(30, 60)
plt.show()

How can I visualize border/decision function of two classes using scikit-learn

I am pretty new in machine learning, so I still don't understand how I can visualize the border between 2 classes in bag of words case.
I found the following exaplpe to plot data
plot a document tfidf 2D graph
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
X = pipeline.fit_transform(newsgroups_train.data).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1], c=newsgroups_train.target)
plt.show()
In my project I use SVC estimator
clf = SVC(random_state=241, kernel = 'linear')
clf.fit(X,newsgroups_train.target)
I have tried to use the example
http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
but it didn't work in text clasification case
So how can I add the border of two classes to this plot?
Thank you!

The problem is that you need to select only 2 features in order to create the 2-dimensional decision surface plot. I will provide 2 examples. The first using iris data and the second using your data.
I have also written an article about this here:
https://towardsdatascience.com/support-vector-machines-svm-clearly-explained-a-python-tutorial-for-classification-problems-29c539f3ad8?source=friends_link&sk=80f72ab272550d76a0cc3730d7c8af35
In both cases, I select only 2 features in order to create the plot.
Example 1 using iris data:
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
model = svm.SVC(kernel='linear')
clf = model.fit(X, y)
fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('y label here')
ax.set_xlabel('x label here')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()
RESULTS
Example 2 using your data:
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer())])
X = pipeline.fit_transform(newsgroups_train.data).todense()
# Select ONLY 2 features
X = np.array(X)
X = X[:, [0,1]]
y = newsgroups_train.target
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
model = svm.SVC(kernel='linear')
clf = model.fit(X, y)
fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('y label here')
ax.set_xlabel('x label here')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()
RESULTS
Important note:
In the second case, the plot is not nice since we selected randomly only 2 features to create it. One way to make it nice is the following: You could use a univariate ranking method (e.g. ANOVA F-value test) and find the best top-2 features from the 22464 that you initially have. Then using these top-2 you could create a nice separating surface plot.

Graph k-NN decision boundaries in Matplotlib

How do I color the decision boundaries for a k-Nearest Neighbor classifier as seen here:
I've got the data for the 3 classes successfully plotted out using scatter (left picture).
Image source: http://cs231n.github.io/classification/

To plot Desicion boundaries you need to make a meshgrid. You can use np.meshgrid to do this. np.meshgrid requires min and max values of X and Y and a meshstep size parameter. It is sometimes prudent to make the minimal values a bit lower then the minimal value of x and y and the max value a bit higher.
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
You then feed your classifier your meshgrid like so Z=clf.predict(np.c_[xx.ravel(), yy.ravel()]) You need to reshape the output of this to be the same format as your original meshgrid Z = Z.reshape(xx.shape). Finally when you are making your plot you need to call plt.pcolormesh(xx, yy, Z, cmap=cmap_light) this will make the dicision boundaries visible in your plot.
Below is a complete example to achieve this found at http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
n_neighbors = 15
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()
This results in the following two graphs to be outputted

X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
If i take this X as 3-dim dataset what would be the change in the following code:
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
% (n_neighbors, weights))
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare sampler combining over- and under-sampling ERROR - python

Related

Having problems with code "plot_2d_classification"

How to plot 3d scatter with QDA decision boundary?

Visualize 2D / 3D decision surface in SVM scikit-learn

How can I visualize border/decision function of two classes using scikit-learn

Graph k-NN decision boundaries in Matplotlib

Categories

Resources