Prediction techniques using scikit-kearn (Polynomial regression) - python

I try to test a first example using sklearn:
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
X = [[0.44, 0.68], [0.99, 0.23]]
vector = [109.85, 155.72]
predict= [0.49, 0.18]
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
predict_ = poly.fit_transform(predict)
clf = linear_model.LinearRegression()
clf.fit(X_, vector)
print clf.predict(predict_)
But i have these errors:
/usr/lib/python2.7/dist-packages/scipy/sparse/csgraph/__init__.py:148:
RuntimeWarning: numpy.dtype size changed, may indicate binary
incompatibility
from ._shortest_path import shortest_path, floyd_warshall, dijkstra,\
/usr/lib/python2.7/dist-packages/scipy/sparse/csgraph/_validation.py:5:
RuntimeWarning: numpy.dtype size changed, may indicate binary
incompatibility
File "hi.py", line 1, in <module>
from sklearn.preprocessing import PolynomialFeatures
ImportError: cannot import name PolynomialFeatures
python -V --> 2.7.6
Please, how can I deal with these errors?
Bests.

You can check your sklearn version, use:
import sklearn
print('Version {}.'.format(sklearn.__version__))
For me it shows:
Version 0.17.1.
Then check (from help of PolynomialFeatures) which version offers PolynomialFeatures and make an update. If your version is 0.14.1 or below, you will get this error. Check this page for more details on how to upgrade it: Not able to import PolynomialFeatures, make_pipeline in Scikit-learn (Official: http://scikit-learn.org/stable/install.html)

Related

How to mlflow-autolog a sklearn ConfusionMatrixDisplay?

I'm trying to log the plot of a confusion matrix generated with scikit-learn for a test set using mlflow's support for scikit-learn.
For this, I tried something that resemble the code below (I'm using mlflow hosted on Databricks, and sklearn==1.0.1)
import sklearn.datasets
import pandas as pd
import numpy as np
import mlflow
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Users/name.surname/plotcm")
data = sklearn.datasets.fetch_20newsgroups(categories=['alt.atheism', 'sci.space'])
df = pd.DataFrame(data = np.c_[data['data'], data['target']])\
.rename({0:'text', 1:'class'}, axis = 'columns')
train, test = train_test_split(df)
my_pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', SGDClassifier(loss='modified_huber')),
])
mlflow.sklearn.autolog()
from sklearn.metrics import ConfusionMatrixDisplay # should I import this after the call to `.autolog()`?
my_pipeline.fit(train['text'].values, train['class'].values)
cm = ConfusionMatrixDisplay.from_predictions(
y_true=test["class"], y_pred=my_pipeline.predict(test["text"])
)
while the confusion matrix for the training set is saved in my mlflow run, no png file is created in the mlflow frontend for the test set.
If I try to add
cm.figure_.savefig('test_confusion_matrix.png')
mlflow.log_artifact('test_confusion_matrix.png')
that does the job, but requires explicitly logging the artifact.
Is there an idiomatic/proper way to autolog the confusion matrix computed using a test set after my_pipeline.fit()?
The proper way to do this is to use mlflow.log_figure as a fluent API announced in MLflow 1.13.0. You can read the documentation here. This code will do the job.
mlflow.log_figure(cm.figure_, 'test_confusion_matrix.png')
This function implicitly store the image, and then calls log_artifact against that path, something like you did.

print(sklearn.__version__) NameError: name 'sklearn' is not defined

from sklearn.datasets import make_blobs
# Generate out datasets
dataset = make_blobs(n_samples=200,centers=4,n_features=2,cluster_std=1.6,random_state=50)
points = dataset[0]
## print(dataset)
from sklearn.cluster import KMeans
print(sklearn.__version__)
Isn't it possible to check sklearn version by print(sklearn.version)? Unfortunately, I got error which says name 'sklearn' is not defined
you need to import sklearn too.
import sklearn
print(sklearn.__version__)

Perceptron in Python

I'm using sklearn library. I have a question about the attribute: n_iter_. When executing the code I get TypeError: __init__() got an unexpected keyword argument 'n_iter_'. Also try using n_iter but I get the same error, or maybe I am misspelling the attribute. It is not all the code, if you need more information, let me know
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn= Perceptron(n_iter_=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Perceptron Model in sklearn.linear_model doesn't have n_iter_ as a parameter. It has following parameters with similar names.
max_iter: int, default=1000
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit method.
and
n_iter_no_change : int, default=5
Number of iterations with no improvement to wait before early stopping.
New in version 0.20.
By looking at your code it looks like you intended to use max_iter.
So do
ppn=Perceptron(max_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Note:
You should first upgrade your sklearn using
pip install sklearn -upgrade
The attribute given in the documentation is n_iter and not n_iter_
So this should work:
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn=Perceptron(n_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
First check which Scikit-learn version you have installed. You can do that by executing
python -c "import sklearn;print(sklearn.__version__)"
on your terminal/environment to which you have the python that executes your code.
Perceptron initial parameters have changed from n_iter to max_iter in version 0.20. The best way to keep up, head to the documentation or source code of the correct version and read the params: e.g.
documentation: perceptron docs v.0.23
source code: perceptions.0.23 code

catboost shows very bad result on a toy dataset

Today I've tried to test an amazing Catboost library published recently by Yandex but it shows very poor results even on a toy dataset. I've tried to find a root of my problem but due to the lack of proper documentation and topics about the library I can't figure out what's going on. Please help me =)
I'm using Anaconda 3 x64 with Python 3.6.
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, f1_score, make_scorer
from catboost import CatBoostClassifier
X,y = make_classification( n_classes=2
,n_clusters_per_class=2
,n_features=10
,n_informative=4
,n_repeated=2
,shuffle=True
,random_state=564
,n_samples=10000
)
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.8)
cb = CatBoostClassifier(depth=3,custom_loss=
['Accuracy','AUC'],
logging_level='Silent',
iterations=500,
od_type='Iter',
od_wait=20)
cb.fit(X_train,y_train,eval_set=(X_test,y_test),plot=True,use_best_model=True)
pred = cb.predict_proba(X_test)[:,1]
tpr,fpr,_=roc_curve(y_score=pred,y_true=y_test)
#just to show the difference
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier().fit(X_train,y_train)
pred_gbc = gbc.predict_proba(X_test)[:,1]
tpr_xgb,fpr_xgb,_=roc_curve(y_score=pred_gbc,y_true=y_test)
plt.plot(tpr,fpr,color='orange')
plt.plot(tpr_xgb,fpr_xgb,color='red')
plt.show()
It was a bug. Be careful and ensure you are using the latest version. The bug was fixed in 0.6.1 version.

scikit-learn GridSearchCV doesn't work as samples increase

The following script runs fine on my machine with n_samples=1000, but dies (no error, just stops working) with n_samples=10000. This only happens using the Anaconda python distribution (numpy 1.8.1) but is fine with Enthought's (numpy 1.9.2). Any ideas what would be causing this?
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.metrics.scorer import log_loss_scorer
from sklearn.cross_validation import KFold
from sklearn import datasets
import numpy as np
X, y = datasets.make_classification(n_samples=10000, n_features=50,
n_informative=35, n_redundant=10,
random_state=1984)
lr = LogisticRegression(random_state=1984)
param_grid = {'C': np.logspace(-1, 2, 4, base=2)}
kf = KFold(n=y.size, n_folds=5, shuffle=True, random_state=1984)
gs = GridSearchCV(estimator=lr, param_grid=param_grid, scoring=log_loss_scorer, cv=kf, verbose=100,
n_jobs=-1)
gs.fit(X, y)
Note: I'm using sklearn 0.16.1 in both distributions and am using OS X.
I've noticed that upgrading to numpy version 1.9.2 with Enthought distribution (by updating manually) breaks the grid search. I haven't had any luck downgrading Anaconda numpy version to 1.8.1 though.
Are you on windows? If so, you need to protect the code with
if __name__ == "__main__":
do_stuff()
Otherwise multiprocessing will not work.
Per Andreas's comment, the problem seems to be with multi threading in the linear algebra library. I solved it with the following command in the terminal:
export VECLIB_MAXIMUM_THREADS=1
My (weak) understanding is that this limits the linear algebra's library use of multiple threads and lets multiprocessing handle multithreading as it wants.

Categories