ConfusionMatrixDisplay (Scikit-Learn) plot labels out of range - python

The following code plots a confusion matrix:
from sklearn.metrics import ConfusionMatrixDisplay
confusion_matrix = confusion_matrix(y_true, y_pred)
target_names = ["aaaaa", "bbbbbb", "ccccccc", "dddddddd", "eeeeeeeeee", "ffffffff", "ggggggggg"]
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix, display_labels=target_names)
disp.plot(cmap=plt.cm.Blues, xticks_rotation=45)
plt.savefig("conf.png")
There are two problems with this plot.
The y-axis label is cut off (True Label). The x label is cut off too.
The names are to long for the x-axis.
To solve the first problem I tried to use poof(bbox_inches='tight') which is unfortunately not available for sklearn.
In the second case I tried the following solution for 2. which lead to a completely distorted plot.
All in all I'm struggeling with both problems.

I think the easiest way would be to switch into tight_layout and add pad_inches= something.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from numpy.random import default_rng
rand = default_rng()
y_true = rand.integers(low=0, high=7, size=500)
y_pred = rand.integers(low=0, high=7, size=500)
confusion_matrix = confusion_matrix(y_true, y_pred)
target_names = ["aaaaa", "bbbbbb", "ccccccc", "dddddddd", "eeeeeeeeee", "ffffffff", "ggggggggg"]
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix, display_labels=target_names)
disp.plot(cmap=plt.cm.Blues, xticks_rotation=45)
plt.tight_layout()
plt.savefig("conf.png", pad_inches=5)
Result:

Related

Too many lines and curves on the polynomial graph

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.preprocessing import PolynomialFeatures
df = pd.read_csv("C:\\Users\\MONSTER\\Desktop\\dosyalar\\datasets\\Auto.csv")
x = df["horsepower"].to_numpy()
y = df["mpg"].to_numpy()
x = x.reshape(-1,1)
poly = PolynomialFeatures(degree = 5)
X_poly = poly.fit_transform(x)
poly.fit(X_poly,y)
lr = LinearRegression()
lr.fit(X_poly, y)
y_pred = lr.predict(X_poly)
plt.scatter(x,y,color="blue",marker=".")
plt.plot(x,y_pred,color="red")
I have tried to draw a polynomial regression curve but I couldn't manage it. Someone told me to sorting values before plotting via "numpy.argsort" but nothing has changed. How can I fix it?
probably scatter is better for you:
plt.scatter(x,y_pred,color="red")
Or with argsort as mentioned:
orders = np.argsort(x.ravel())
plt.plot(x[orders], y[orders], color='red')

Linear regression plot is really bad

observations that are different from each other so i run regression again but for only one cluster.But it also came out wrong What exactly is wrong here? I'll also have to point out that i am still new to this (linerear regression etc.) so my understanding of all this is still bad. How can i fix this plot and please if it's possible try to explain why it's wrong.
Code :
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(0)
kmeans.cluster_centers_
kmeans.labels_
n,y_test = train_test_split(X, Y, test_size = 0.4, random_state = 0)
plt.scatter(X.iloc[:, 1], Y)
plt.show()
You're performing multiple linear regression, since you have 2 input features ('Age', 'Annual Income (k$)') that try to predict the output feature ('Spending Score (1-100)'). You need to plot this data in 3D, in order to properly visualize the regression.
Even though I can't test your code without the data, something like this should work (after training the model):
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X.iloc[:, 0], X.iloc[:, 1], Y)
ax.plot(X.iloc[:, 0], X.iloc[:, 1], y_pred, color='red')
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income')
ax.set_zlabel('Spending Score')

How to plot a ROC curve by varying a parameter in a pandas dataframe

I am trying to plot multiple ROC curves on a plot by varying a variable in a cell in a pandas dataframe.
So in a particular row, if the total is above a certain threshold then it will be classified as an invoice. I want to be plotting the different curves on different thresholds of total.
This is the code that I have so far that measures basic metrics and is an attempt to plot the ROC curve but I have been unsuccessful so far.
import os
import pandas as pd
from sklearn import datasets, metrics, model_selection, svm
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv("test_results.csv", header = 0)
true_array = list(df["actual"].to_numpy())
predicted_array = list(df["predicted"].to_numpy())
accuracy = accuracy_score(true_array, predicted_array)
precision, recall, fscore, support = score(true_array, predicted_array, average = None, labels = ['invoice', 'non-invoice'])
print("Labels: \t invoice", "non-invoice")
print('Accuracy: \t {}'.format(accuracy))
print('Precision: \t {}'.format(precision))
print('Recall: \t {}'.format(recall))
print('Fscore: \t {}'.format(fscore))
skplt.metrics.plot_roc_curve(true_array, predicted_array)
plt.show()
The error I am getting is
fpr[i], tpr[i], _ = roc_curve(y_true, probas[:, i],
IndexError: too many indices for array
Any help would be appreciated..
The following documentation mentions that skplt.metrics.plot_roc_curve takes ground truth (correct) target values and prediction probabilities for each class returned by a classifier. So you should change 2nd input prediction_array.
https://scikit-plot.readthedocs.io/en/stable/metrics.html?highlight=roc#scikitplot.metrics.plot_roc

My Python linear regression has incomplete line

I am trying to plot a linear regression. But the line is incomeplete. Below is my Python code:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
features = [[2],[4],[8],[5]]
labels = [320, 610, 1190, 726]
plt.scatter(features,labels,color="black")
plt.xlabel("Number of room")
plt.ylabel('price')
clf = linear_model.LinearRegression()
clf=clf.fit(features,labels)
result = clf.predict([[11],[9]])
plt.plot([[11],[9]], result, color='blue', linewidth=3)
print(result)
plt.show()
And the picture of the plot is given below:
From the picture, you can see that the line is incomplete. Please help me to solve this problem. Make the line complete with all the other values.

Translate cross_validation algorithm to model_selection

In 2016, I ran a lasso regression model using the code below:
#Import required packages
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pylab as plt
import matplotlib.pyplot as plp
import seaborn as sns
import statsmodels.formula.api as smf
from scipy import stats
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4, random_state=123)
#%
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
#%
# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))
#regcoef.to_csv('variable+regresscoef.csv')
#%%
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
#%
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
#%
# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
#%
# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)
Now I want to run it again and got the following warning:
DeprecationWarning: This module was deprecated in version 0.18 in
favor of the model_selection module into which all the refactored
classes and functions are moved.
How can I rewrite this code using model_selection ?
Only thing I can see here that used cross_validation module earlier is train_test_split.
So just change your import from:
from sklearn.cross_validation import train_test_split
to:
from sklearn.model_selection import train_test_split
and you are good to go.

Categories