Using python I'm trying to plot a sin wave and random distribution, then show where the ratio is greater than or equal to 3.
I think I'm 90% of the way there but keep getting the error message 'x and y must be the same size' when I try to plot it. I've been racking my brains but can't figure out what I'm missing.
Any help or pointers gratefully received.
import numpy as np
import math
import matplotlib.pyplot as plt
r= 2*math.pi
dev = 0.1
x = np.array(np.arange(0, r, dev))
y1 = np.array(np.sin(x))
y2 = np.array(np.random.normal(loc=0, scale=0.1, size=63))
mask = y1//y2 >= 3
fit = np.array(x[mask])
print(fit)
plt.plot(x, y1)
plt.scatter(x, fit)
plt.scatter(x, y2, marker=".")
plt.show()
Insert this line into your code, just before the point of error:
print(len(x), len(fit))
Output:
63 28
You explicitly removed elements from your sequence, and then expected them to be of the same size. You still have 63 x values, but now only 28 y values. Since you didn't trace the problem and explain what you intend for this scatter plot, I have no way of knowing what a "fix" might be. Perhaps make a list of point (x-y pairs), and then filter that for the appropriate y1/y2 ratio?
Not sure if this is what you want but this will scatter dots on the sin-curve corresponding to your mask.
import numpy as np
import math
import matplotlib.pyplot as plt
r= 2*math.pi
dev = 0.1
x = np.array(np.arange(0, r, dev))
y1 = np.array(np.sin(x))
y2 = np.array(np.random.normal(loc=0, scale=0.1, size=63))
mask = y1//y2 >= 3
fit_x = np.array(x[mask])
fit_y = np.array(y1[mask])
plt.plot(x, y1)
plt.scatter(fit_x, fit_y)
plt.scatter(x, y2, marker=".")
plt.show()
In your line plt.scatter(x, fit) you are trying to scatter your x-values with your fit-values. However fit is only of size 25 file while x is of size 63 (as are y1 and y2 btw., thats why that part works).
mask is basically an array of False or True values. That means if you use the np.array(x[mask]) function. It will only create an array of the values where x is actually True, which seems to be what you want. But you can only scatter this against something like np.array(np.sin(fit)), otherwise the sizes are incompatible to scatter.
"""## Splitting the dataset into the Training set and Test set"""
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
"""## Training the Simple Linear Regression model on the Training set"""
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
"""## Predicting the Test set results"""
y_pred = regressor.predict(X_test)
"""## Visualising the Training set results"""
plt.scatter(X_train, y_train, color = 'green')
plt.plot(X_train, regressor.predict(X_train), color = 'yellow')
plt.title('Doctor visits(Training set)')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
"""## Visualising the Test set results"""
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Doctor visits (Test set)')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
Related
Up until yesterday I had no problem plotting a particular plot but since earlier today I'm getting several plots instead of just the one.
The code below is meant to generate just one plot but somehow im getting a lot
of different ones.
How can I fix this?
Update: I was able to get the plot on Jupyter online just fine, I think the problem might be something related to Cloudera Data Science Workbench, any ideas?
data: http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
The code I´m using to plot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
df = pd.read_csv('http://archive.ics.uci.edu/ml/'
'machine-learning-databases/'
'breast-cancer-wisconsin/wdbc.data'
, header=None)
X = df.loc[:,2:].values
y = df.loc[:,1].values
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test =\
train_test_split(X, y, test_size=0.20, \
stratify=y, random_state=1)
pipe_lr = make_pipeline(StandardScaler(),
LogisticRegression(penalty='l2',
random_state=1))
train_sizes, train_scores, test_scores = \
learning_curve(estimator = pipe_lr, X = X_train,
y = y_train,
train_sizes=np.linspace(0.1, 1.0, 10),
cv = 10, n_jobs = 1)
train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)
plt.plot(train_sizes, train_mean, color = 'blue',
marker = 'o', markersize = 5,
label = 'training_accuracy')
plt.fill_between(train_sizes,
train_mean + train_std,
train_mean - train_std,
alpha = 0.5, color = 'blue')
plt.plot(train_sizes, test_mean, color = 'green',
linestyle = '--', marker = 's', markersize = 5,
label = 'validation accuracy')
plt.fill_between(train_sizes,
test_mean + test_std,
test_mean - test_std,
alpha = 0.15, color = 'green')
plt.grid()
plt.xlabel("Number of training samples")
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.85, 1.025])
plt.show()
Using Jupyter Online I am able to get the plot I want
Do I need to reset something on matplotlib in CDSW?
Cloudera Data Science Workbench processes each line of code individually (unlike notebooks that process code per-cell). This means if your plot requires multiple commands, you will see incomplete plots in the workbench as each line is processed.
To get around this behavior, wrap all your plotting commands in one Python function. Cloudera Data Science Workbench will then process the function as a whole, and not as individual lines. You should then see your plots as expected.
The above information is from https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_visualize_report.html
You probably need to be more explicit about the figure and axes you're plotting on.
fig = pyplot.figure()
ax = fig.add_axes(1,1,1)
ax.plot(...)
ax.fill_between(...)
et cetera
I made sklearn svm classifier work. I simply classify 2 options 0 or 1
using feature vectors. It works fine.
I want to visualize it on page using graphs.
Problem is that my vector is 512 item length, so hard to show on x,y graph.
Is there any way to visualize classification hyperplane for a long vector of features like 512?
You cannot visualize the decision surface for a lot of features. This is because the dimensions will be too many and there is no way to visualize an N-dimensional surface.
However, you can use 2 features and plot nice decision surfaces as follows.
I have also written an article about this here:
https://towardsdatascience.com/support-vector-machines-svm-clearly-explained-a-python-tutorial-for-classification-problems-29c539f3ad8?source=friends_link&sk=80f72ab272550d76a0cc3730d7c8af35
Case 1: 2D plot for 2 features and using the iris dataset
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
model = svm.SVC(kernel='linear')
clf = model.fit(X, y)
fig, ax = plt.subplots()
# title for the plots
title = ('Decision surface of linear SVC ')
# Set-up grid for plotting.
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_ylabel('y label here')
ax.set_xlabel('x label here')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
ax.legend()
plt.show()
Case 2: 3D plot for 3 features and using the iris dataset
from sklearn.svm import SVC
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from mpl_toolkits.mplot3d import Axes3D
iris = datasets.load_iris()
X = iris.data[:, :3] # we only take the first three features.
Y = iris.target
#make it binary classification problem
X = X[np.logical_or(Y==0,Y==1)]
Y = Y[np.logical_or(Y==0,Y==1)]
model = svm.SVC(kernel='linear')
clf = model.fit(X, Y)
# The equation of the separating plane is given by all x so that np.dot(svc.coef_[0], x) + b = 0.
# Solve for w3 (z)
z = lambda x,y: (-clf.intercept_[0]-clf.coef_[0][0]*x -clf.coef_[0][1]*y) / clf.coef_[0][2]
tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot3D(X[Y==0,0], X[Y==0,1], X[Y==0,2],'ob')
ax.plot3D(X[Y==1,0], X[Y==1,1], X[Y==1,2],'sr')
ax.plot_surface(x, y, z(x,y))
ax.view_init(30, 60)
plt.show()
I'm trying to plot the predicted mean data from Gaussian process regression into a 3-D contour. I've followed Plot 3D Contour from an Image using extent with Matplotlib
and mplot3d example code: contour3d_demo3.py threads. Following is my code:
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
from matplotlib import cm
x_train = np.array([[0,0],[2,2],[3,3]])
y_train = np.array([[200,321,417]])
xvalues = np.array([0,1,2,3])
yvalues = np.array([0,1,2,3])
a,b = np.meshgrid(xvalues,yvalues)
positions = np.vstack([a.ravel(), b.ravel()])
x_test = (np.array(positions)).T
kernel = C(1.0, (1e-3, 1e3)) * RBF(10)
gp = GaussianProcessRegressor(kernel=kernel)
gp.fit(x_train, y_train)
y_pred_test = gp.predict(x_test)
fig = plt.figure()
ax = fig.add_subplot(projection = '3d')
x=y=np.arange(0,3,1)
X, Y = np.meshgrid(x,y)
Z = y_pred_test
cset = ax.contour(X, Y, Z, cmap=cm.coolwarm)
ax.clabel(cset, fontsize=9, inline=1)
plt.show()
After running the above code, I get following error on console:
I want x and y-axis as 2-D plane and the predicted values on the z-axis.The sample plot is as follows:
What is wrong with my code?
Thank you!
The specific error you've mentioned comes from your y_train, which might be a typo. It should be:
y_train_ : array-like, shape = (n_samples, [n_output_dims])
According to your x_train, you have 3 samples. So your y_train should have shape (3, 1) rather than (1, 3).
You also have other bugs in the plotting part:
add_subplot should have a position before projection = '3d'.
Z should have the same shape as X and Y for contour plot.
Because of 2, your x and y should match xvalues and yvalues.
Taken together, you might need to make the following changes:
...
y_train = np.array([200,321,417])
...
ax = fig.add_subplot(111, projection = '3d')
x=y=np.arange(0,4,1)
...
Z = y_pred_test.reshape(X.shape)
...
Just to mention two things:
The plot you will get after these changes won't match the figure you've shown. The figure in your question is a surface plot instead of a contour plot. You can use ax.plot_surface to get that type of plot.
I think you've already know this. But just in case, your plot won't be as smooth as your sample plot since your np.meshgrid is sparse.
I just started to learn about polynomial regression. And I was trying to create plot for polynomial regression but the plot is wrong.
My code is like this
#Linear regression
from sklearn import linear_model
clf = linear_model.LinearRegression()
x = data.loc[:, ['col1']]
y = data.loc[:, ['col2']]
clf.fit(x, y)
plt.scatter(x, y)
plt.plot(x, clf.predict(x))
plt.show()
This code is for linear regression model and the plot is this.
and the code for polynomial regression is here
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
poly_x = poly_reg.fit_transform(x)
clf.fit(poly_x, y)
plt.scatter(x, y)
plt.plot(x, clf.predict(poly_x))
The plot is wrong like this.
I just started learning about it and I just tried to copy the way some tutorial does so my understanding for this is still bad. How can I fix this plot and also I would appreciate good resources to understand the concept.
You need to first sort the values in X and y according to the values in X.
# This is your data
x = data.loc[:, ['col1']]
y = data.loc[:, ['col2']]
# This is what you need to do.
# argsort() will return the indices of the sorting order
inds = x.values.ravel().argsort() # Here I am assuming that x has single feature
x = x.values.ravel()[inds].reshape(-1,1)
y = y.values[inds]
# Then continue your code.
poly_reg = PolynomialFeatures(degree=2)
poly_x = poly_reg.fit_transform(x)
clf.fit(poly_x, y)
plt.scatter(x, y)
plt.plot(x, clf.predict(poly_x))
plt.show()
Try to use pipeline to fit and predict
example:
poly_model = make_pipeline(PolynomialFeatures(7),
LinearRegression())
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
xfit = np.linspace(0, 10, 1000)
poly_model.fit(x.reshape(-1,1), y)
yfit = poly_model.predict(xfit.reshape(-1,1))
plt.scatter(x, y)
plt.plot(xfit, yfit)
from above example, you will able to understand and it is very important to use .reshape(-1,1) if you have a single column.
See if this helps you to understand...
I am trying to do more all within sklearn. Here I am trying to generate an unbalanced classification set, run a logistic regression, plot the data points and plot a decision boundary line.
In order to plot the decision boundary line, I first get the coefficients:
coef = clf.best_estimator_.coef_
intercept = clf.best_estimator_.intercept_
And then I construct the line:
x1 = np.linspace(-8, 10, 100)
x2 = -(coef[0][0] * x1 + intercept[0]) / coef[0][1]
plt.plot(x1, x2, color='#414e8a', linewidth=2)
However, the line doesn't plot because x2 is all inf because coef[0][1] is equal to 0. This is the problem that I am having. Why is the second term of these coefficients 0?
Full code below:
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold, train_test_split
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%pylab inline
pylab.rcParams['figure.figsize'] = (12, 6)
plt.style.use('fivethirtyeight')
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# Generate data with two classes
X, y = make_classification(class_sep=1.2, weights=[0.1, 0.9], n_informative=3, n_redundant=1, n_features=5, n_clusters_per_class=1, n_samples=10000, flip_y=0, random_state=10)
pca = PCA(n_components=2)
X = pca.fit_transform(X)
y = y.astype('str')
y[y=='1'] ='L'
y[y=='0'] ='S'
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
X_1, X_2 = X_train[y_train=='S'], X_train[y_train=='L']
# Fit a Logistic Regression model
clf_base = LogisticRegression()
grid = {'C': 10.0 ** np.arange(-2, 3),'penalty': ['l1', 'l2']}
cv = KFold(X_train.shape[0], n_folds=5, shuffle=True, random_state=0)
clf = GridSearchCV(clf_base, grid, cv=cv, n_jobs=8, scoring='f1_macro')
clf.fit(X_train, y_train)
# Get coefficients
coef = clf.best_estimator_.coef_
intercept = clf.best_estimator_.intercept_
# Create separation line
x1 = np.linspace(-8, 10, 100)
x2 = -(coef[0][0] * x1 + intercept[0]) / coef[0][1]
plt.scatter(X_1[:,0], X_1[:,1], color='#1abc9c')
plt.scatter(X_2[:,0], X_2[:,1], color='#e67e22')
x_coords = np.concatenate([X_1[:,0],X_2[:,0]])
y_coords = np.concatenate([X_1[:,1],X_2[:,1]])
plt.axis([min(x_coords), max(x_coords), min(y_coords), max(y_coords)])
plt.title("Original Dataset - Fitted Logistic Regression")
plt.plot(x1, x2, color='#414e8a', linewidth=2)
plt.show()
print(coef)
As you can see the second term in coef is 0.
What am I doing wrong here?
Thank you!
EDIT
It seems like the grid search parameters is leading to the second coefficient being zero. For example:
When I set the grid parameter to:
grid = {'C': 10.0 ** np.arange(-2, 3),'penalty': ['l1', 'l2'],'class_weight': ['balanced']}
This gives me two non-zero coeficients
When I remove the class weight parameter:
grid = {'C': 10.0 ** np.arange(-2, 3),'penalty': ['l1', 'l2']}
This gives me a zero for the second element in coef.
Hope that simplifies the problem. Anyone out there have an idea? Thank you!
You got zero first coefficient, because you use strong L1 regularization, which removes all the not-so-useful features from the model.
You can see it with clf.best_params_ - it equals {'C': 0.01, 'penalty': 'l1'}. Switch to 'l2' penalty, and you will have all coefficients non-zero.
If you want to plot an arbitrary line of form Ax+By+C=0, you can use this function:
import matplotlib.pyplot as plt
import numpy as np
def plot_normal_line(A, B, C, ax=None, **kwargs):
""" Plot equation of Ax+By+C=0"""
if ax is None:
ax = plt.gca()
if A == 0 and B == 0:
raise Exception('A or B should be non-zero')
if B == 0:
# plot vertical
ax.vlines(-C / A, *ax.get_ylim(), **kwargs)
else:
# plot functoon
x = np.array(ax.get_xlim())
y = (A*x+C) / -B
ax.plot(x, y, **kwargs)
Then the command plot_normal_line(*coef[0], intercept) will draw your decision boundary.
However, since your dataset is balanced, for nearly all the points the most probable class is the second one (the orange). So the decision boundary for 50% probability (the thick black line) lies at the left from the scatter: