I have a function that imports a random forest classifier from scikit learn, i fit it with data and finally I want to display accuracy, kappa and confusion matrix. All works except printing the confusion matrix. I do not get any error, but the confusion matrix does not print.
I have tried calling print(cm) and it works, but it does not print in usual pandas dataframe style, which is what I am looking for.
Here's the code
def rf_clf(X, y, test_size = 0.3, random_state = 42):
"""This function splits the data into train and test and fits it in a random forest classifier
to the data provided, analysing its errors (Accuracy and Kappa). Also as this is classification,
the function will output a confusion matrix"""
#Split data in train and test, as well as predictors (X) and targets, (y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)
#import random forest classifier
base_model = RandomForestClassifier(random_state=random_state)
#Train the model
base_model.fit(X_train,y_train)
#make predictions on test set
y_pred=base_model.predict(X_test)
#Print Accuracy and Kappa
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Kappa:",metrics.cohen_kappa_score(y_test, y_pred))
#create confusion matrix
labs = [y_test[i][0] for i in range(len(y_test))]
cm = pd.DataFrame(confusion_matrix(labs, y_pred))
cm #here is the issue. Kinda works with print(cm)
Import metrics from sklearn at the beginning.
from sklearn import metrics
Use this when you want to show confussion matrix.
# Get and show confussion matrix
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)
With this you should view confussion matrix in raw text.
If you want to show confussion Matrix with colours do it in this other way:
Import
from sklearn.metrics import confusion_matrix
import pandas as pd
import seaborn as sns; sns.set()
Use it that way:
cm = confusion_matrix(y_test, y_pred)
cmat_df = pd.DataFrame(cm, index=class_names, columns=class_names)
ax = sns.heatmap(cmat_df, square=True, annot=True, cbar=False)
ax.set_xlabel('Predicción')
ax.set_ylabel('Real')`
Hope for the best!
Related
I would like to plot y_test and prediction in a scatter plot.
I am using the logistic regression as model.
from sklearn.linear_model import LogisticRegression
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Spam'])
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=27)
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)
pred_log = lr.predict(X_test)
I have tried as follows
## Plot the model
plt.scatter(y_test, pred_log)
plt.xlabel("True Values")
plt.ylabel("Predictions")
and I got this:
that I do not think it is what I should expect.
y_test is (250,), similarly pred_log is (250,)
Am I considering the wrong variables to plot, or they are right?
I have no idea one what the plot with those four values mean. I would have been expected more dots in the plot, but maybe I am wrong.
Please let me know if you need more info. Thanks
I think you know LogisticRegression is a classification algorithm. If you do binary classification it will predict whether predicted class is 0 or 1.If you want to get visualization about how model preform, you should consider confusion matrix.You can't use scatterplot for visualize classification results.
import seaborn as sns
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cf_matrix, annot=True)
confusion matrix shows how many labels have correct predictions and how many are wrong.Looking at confusion matrix you can calculate how accurate the model.We can use different metrices like precision,recall and F1 score.
[ASK]
I am working on a lecture but I do not understand and still lack of knowledge about how to change the transformation feature from linear to gaussian, I have searched for information but still haven't found the right one and still lack of knowledge about python, below is a python script that I made with linear regression, and I want to transform features with Gaussian, please help to write the script for feature transformation
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
df1=data[['time', 'Cases']]
from sklearn.utils import shuffle
df1 = shuffle(df1,random_state=0)
X=np.array(df1[['time']])
y=np.array(df1[['Cases']])
X_train = X[:-10]
X_test = X[-10:]
y_train = y[:-10]
y_test = y[-10:]
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(y_test, y_pred))
# Plot outputs
plt.scatter(X_test, y_test, color='black')
plt.scatter(X_train, y_train, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
I have to fit 40 time series in a VectorAutoregressive model, the enormous quantity of variables suggest to use a selection method. I would love to use the LASSO method, but I'm using statsmodel for the fitting, and the only way to implement LASSO with that library is for a Linear regression model. Someone can help?
You can try using fit_regularized, it's like when you fit an OLS, and you set L1_wt to be 1 so that it is a lasso:
sm.OLS(..,..).fit_regularized(alpha=..,L1_wt=1)
We can with an example, first load the boston dataset:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
import numpy as np
import statsmodels.api as sm
scaler = StandardScaler()
data = load_boston()
data_scaled = scaler.fit_transform(data.data)
X_train, X_test, y_train, y_test = train_test_split(data_scaled, data.target, test_size=0.33, random_state=42)
Below to show that it works similarly, and you need to tweak the shrinkage parameter, alpha anyway in your model:
alphas = [0.0001,0.001, 0.01, 0.1,0.2, 0.5, 1]
mse_sklearn = []
mse_sm = []
for a in alphas:
clf = linear_model.Lasso(alpha=a)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
mse_sklearn.append(mean_squared_error(y_test, y_pred))
mdl = sm.OLS(y_train,sm.add_constant(X_train)).fit_regularized(alpha=a,L1_wt=1)
y_pred = mdl.predict(sm.add_constant(X_test))
mse_sm.append(mean_squared_error(y_test, y_pred))
Visualize the results:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(alphas,mse_sm,label="sm")
ax.plot(alphas,mse_sklearn,label="sklearn")
ax.legend()
I have been battling this problem with my MSE while predicting with regression. I have encountered the same problem with different regression models I have tried to build.
The problem is, my MSE is humongous. 83661743.99 to be exact. My R squared is 0.91 which does not seem problematic.
I manually implemented the cost function and gradient descent while doing the coursework in Andrew Ng's Stanford ML classes and I have a reasonable cost function; but when I try to implement it with SKLearn lib the MSE is something else. I don't know what I have done wrong and I need some help checking it out.
Here is the link to the dataset I used: https://www.kaggle.com/farhanmd29/50-startups
My code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
df = pd.read_csv('50_Startups.csv')
#checking the level of correlations between the predictors and response
sns.heatmap(df.corr(), annot=True)
#Splitting the predictors from the response
X = df.iloc[:,:-1].values
y = df.iloc[:,4].values
#Encoding the Categorical values
label_encoder_X = LabelEncoder()
X[:,3] = label_encoder_X.fit_transform(X[:,3])
#Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
#splitting train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
#Linear Regression
model = LinearRegression()
model.fit(X_train,y_train)
pred = model.predict(X_test)
#Cost Function
mse = mean_squared_error(y_test,pred)
mse
As you used standard normalization for scaling, the values of the dataset can be humongous. As desertnaut said, MSE is not scaled so it can be huge due to the big values of the dataset. You can try to normalize data using a MinMaxScaler to get the iput between [0-1]
I have gotten to understand the error of my ways. The MSE is 1/n (No of Samples) multiplied by the summation of the actual response subtracted by the predicted response SQUARED. Hence the error given will be SQUARED the expected error value. what I should have looked out for was the RMSE which will find the sqrt of the MSE. my predictions were off as well and that was because I scaled my values. Un-scaled X values gave me much better predictions. This I will have to look into more as I do not understand why.
How can I show the important features that contribute to the SVM model along with the feature name?
My code is shown below,
First I Imported the modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
Then I divided my data into features and variables
y = df_new[['numeric values']]
X = df_new.drop('numeric values', axis=1).values
Then I Setup the pipeline
steps = [('scalar', StandardScaler()),
('SVM', SVC(kernel='linear'))]
pipeline = Pipeline(steps)
Then I Specified my the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
'SVM__gamma':[0.1, 0.01]}
I Created a train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=21)
Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline,param_grid = parameters,cv=5)
Fit to the training set
cv.fit(X_train,y_train.values.ravel())
Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)
feature_importances = cv.best_estimator_.feature_importances_
The error message I get
'Pipeline' object has no attribute 'feature_importances_'
What I understood is that, lets suppose you are building a model with 100 feature and you want to know which feature is more important and which is less if this is the case ?
Just try Uni-variate feature selection method, Its very basic method and you can play with this before going to advance methods for your data. Sample code is provided scikit-learn it self. You can modified it as per your requirement.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif
###############################################################################
# import some data to play with
# The iris dataset
iris = datasets.load_iris()
# Some noisy data not correlated
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))
# Add the noisy data to the informative features
X = np.hstack((iris.data, E))
y = iris.target
###############################################################################
plt.figure(1)
plt.clf()
X_indices = np.arange(X.shape[-1])
###############################################################################
# Univariate feature selection with F-test for feature scoring
# We use the default selection function: the 10% most significant features
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X, y)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
label=r'Univariate score ($-Log(p_{value})$)', color='g')
###############################################################################
# Compare to the weights of an SVM
clf = svm.SVC(kernel='linear')
clf.fit(X, y)
svm_weights = (clf.coef_ ** 2).sum(axis=0)
svm_weights /= svm_weights.max()
plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight', color='r')
clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector.transform(X), y)
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()
plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
width=.2, label='SVM weights after selection', color='b')
plt.title("Comparing feature selection")
plt.xlabel('Feature number')
plt.yticks(())
plt.axis('tight')
plt.legend(loc='upper right')
plt.show()
Code ref.
http://scikit-learn.org/0.15/auto_examples/plot_feature_selection.html
Note;
For each feature, this method will plot p-values for the univariate feature selection and the corresponding weights of an SVM. This method selects those feature which shows larger SVM weights.