SMOTE function not working in make_pipeline - python

I wanna simultaneously apply cross-validation and over-sampling.
I get the following error from this code:
from sklearn.pipeline import Pipeline, make_pipeline
imba_pipeline = make_pipeline(SMOTE(random_state=42),
LogisticRegression(C=3.4))
cross_val_score(imba_pipeline, X_train_tf, y_train, scoring='f1-weighted', cv=kf)
ll intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1,
out_step='deprecated', random_state=42, ratio=None,
sampling_strategy='auto', svm_estimator='deprecated')' (type ) doesn't
PS. I get the same error using imblearn.over_sampling.RandomOverSampler rather than SMOTE.

You should import make_pipeline from imblearn.pipeline and not from sklearn.pipeline: make_pipeline from sklearn needs the transformers to implement fit and transform methods but SMOTE does not implement transform.

Related

How to mlflow-autolog a sklearn ConfusionMatrixDisplay?

I'm trying to log the plot of a confusion matrix generated with scikit-learn for a test set using mlflow's support for scikit-learn.
For this, I tried something that resemble the code below (I'm using mlflow hosted on Databricks, and sklearn==1.0.1)
import sklearn.datasets
import pandas as pd
import numpy as np
import mlflow
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Users/name.surname/plotcm")
data = sklearn.datasets.fetch_20newsgroups(categories=['alt.atheism', 'sci.space'])
df = pd.DataFrame(data = np.c_[data['data'], data['target']])\
.rename({0:'text', 1:'class'}, axis = 'columns')
train, test = train_test_split(df)
my_pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', SGDClassifier(loss='modified_huber')),
])
mlflow.sklearn.autolog()
from sklearn.metrics import ConfusionMatrixDisplay # should I import this after the call to `.autolog()`?
my_pipeline.fit(train['text'].values, train['class'].values)
cm = ConfusionMatrixDisplay.from_predictions(
y_true=test["class"], y_pred=my_pipeline.predict(test["text"])
)
while the confusion matrix for the training set is saved in my mlflow run, no png file is created in the mlflow frontend for the test set.
If I try to add
cm.figure_.savefig('test_confusion_matrix.png')
mlflow.log_artifact('test_confusion_matrix.png')
that does the job, but requires explicitly logging the artifact.
Is there an idiomatic/proper way to autolog the confusion matrix computed using a test set after my_pipeline.fit()?
The proper way to do this is to use mlflow.log_figure as a fluent API announced in MLflow 1.13.0. You can read the documentation here. This code will do the job.
mlflow.log_figure(cm.figure_, 'test_confusion_matrix.png')
This function implicitly store the image, and then calls log_artifact against that path, something like you did.

NotFittedError: This KNeighborsClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator

How to fix this
from sklearn.datasets import load_iris
import os
import math
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import csv
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
iris=load_iris()
print(iris.keys())
print(iris['target'].shape)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(iris['data'],iris['target'],random_state=0)
print(y_train.shape)
print(X_train.shape)
iris_dataframe=pd.DataFrame(X_train,columns=iris.feature_names)
knn=KNeighborsClassifier(n_neighbors=1)
y_pred = knn.predict(X_train) //error:NotFittedError: This KNeighborsClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
what is the error?
please help me!!!! I don't know how to fix this
You'll want to start by training you K-means model (computing clusters on training data): knn.fit(X_train, y_train).
You can then use it to classify your testing data, using the cluster centers computed during the training phase: y_pred = knn.predict(X_test).

GridSearch example from SCIKIT learn user guide tried giving error

Was trying to run the same code as per the SCIKIT user guide of Grid search but giving error.Quite surprised.
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
X,y=make_moons()
calibrated_forest=CalibratedClassifierCV(base_estimator=RandomForestClassifier(n_estimators=10))
paramgrid={'base_estimator_max_depth':[2,4,6,8]}
search=GridSearchCV(calibrated_forest,paramgrid,cv=5)
search.fit(X,y)
Error message as below:
ValueError: Invalid parameter base_estimator_max_depth for estimator CalibratedClassifierCV(base_estimator=RandomForestClassifier(n_estimators=10)). Check the list of available parameters with `estimator.get_params().keys()`.
I tried with Iris data set which also gave the same error as above.
Then i used the make_moon dataset X,y and run the Random classifier as below.
clf = RandomForestClassifier(n_estimators=10, max_depth=2)
cross_val_score(clf, X, y, cv=5)
Got the output as below.
array([0.8 , 0.8 , 0.9 , 0.95, 0.95])
Looking strange and not sure what is happening and where iam wrong. Request help please.
Note the double score __ between base_estimator and a param:
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
X,y=make_moons()
calibrated_forest=CalibratedClassifierCV(base_estimator=RandomForestClassifier(n_estimators=10))
paramgrid={'base_estimator__max_depth':[2,4,6,8]}
search=GridSearchCV(calibrated_forest,paramgrid,cv=5)
search.fit(X,y)
GridSearchCV(cv=5,
estimator=CalibratedClassifierCV(base_estimator=RandomForestClassifier(n_estimators=10)),
param_grid={'base_estimator__max_depth': [2, 4, 6, 8]})

Perceptron in Python

I'm using sklearn library. I have a question about the attribute: n_iter_. When executing the code I get TypeError: __init__() got an unexpected keyword argument 'n_iter_'. Also try using n_iter but I get the same error, or maybe I am misspelling the attribute. It is not all the code, if you need more information, let me know
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn= Perceptron(n_iter_=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Perceptron Model in sklearn.linear_model doesn't have n_iter_ as a parameter. It has following parameters with similar names.
max_iter: int, default=1000
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit method.
and
n_iter_no_change : int, default=5
Number of iterations with no improvement to wait before early stopping.
New in version 0.20.
By looking at your code it looks like you intended to use max_iter.
So do
ppn=Perceptron(max_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Note:
You should first upgrade your sklearn using
pip install sklearn -upgrade
The attribute given in the documentation is n_iter and not n_iter_
So this should work:
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn=Perceptron(n_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
First check which Scikit-learn version you have installed. You can do that by executing
python -c "import sklearn;print(sklearn.__version__)"
on your terminal/environment to which you have the python that executes your code.
Perceptron initial parameters have changed from n_iter to max_iter in version 0.20. The best way to keep up, head to the documentation or source code of the correct version and read the params: e.g.
documentation: perceptron docs v.0.23
source code: perceptions.0.23 code

Python 2.7.13: RandomForestClassifier could not be imported from the other file

in my project I am trying to separate different implemented classifiers in the separate files (in order to make code less messy).
Unfortunately, when I am importing file containing the RandomForestClassifier the program neither stops its execution nor shows an error, the code is provided here:
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
def train_rfc(X,y):
n_estimators = [100]
min_samples_split = [2]
min_samples_leaf = [1]
bootstrap = [True]
parameters = {'n_estimators': n_estimators, 'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_samples_split}
clf = GridSearchCV(RandomForestClassifier(verbose=1,n_jobs=-1), cv=4, param_grid=parameters)
print "=="
clf.fit(X, y)
print "++"
return clf
rfc_clf = train_rfc(corpus,data["target"])
print ("Accuracy of RF on CV sets :{}".format(rfc_clf.best_score_))
As I was able to figure out the problem is most likely to be somewhere in the RandomForest, and is in clf.fit(X,y) function to be exact, since the program does not do anything as soon as it reaches this point.
I had no problem running the other, similarly implemented SVM classifier that also used GridSearchCV, and was also inside of a function.
I would really appreciate any help on that.

Categories