I'm using sklearn library. I have a question about the attribute: n_iter_. When executing the code I get TypeError: __init__() got an unexpected keyword argument 'n_iter_'. Also try using n_iter but I get the same error, or maybe I am misspelling the attribute. It is not all the code, if you need more information, let me know
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn= Perceptron(n_iter_=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Perceptron Model in sklearn.linear_model doesn't have n_iter_ as a parameter. It has following parameters with similar names.
max_iter: int, default=1000
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit method.
and
n_iter_no_change : int, default=5
Number of iterations with no improvement to wait before early stopping.
New in version 0.20.
By looking at your code it looks like you intended to use max_iter.
So do
ppn=Perceptron(max_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Note:
You should first upgrade your sklearn using
pip install sklearn -upgrade
The attribute given in the documentation is n_iter and not n_iter_
So this should work:
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn=Perceptron(n_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
First check which Scikit-learn version you have installed. You can do that by executing
python -c "import sklearn;print(sklearn.__version__)"
on your terminal/environment to which you have the python that executes your code.
Perceptron initial parameters have changed from n_iter to max_iter in version 0.20. The best way to keep up, head to the documentation or source code of the correct version and read the params: e.g.
documentation: perceptron docs v.0.23
source code: perceptions.0.23 code
Related
I'm trying to run ML trials in parallel using HyperOpt with SparkTrials on Databricks.
My opjective function converts the outputs to a spark dataframe using spark.createDataFrame(results) (to reuse some preprocessing code I've previously created - I'd prefer not to have to rewrite this).
However, this causes an error when attempting to use HyperOpt and SparkTrials, as the SparkContext used to create the dataframe "should only be created or accessed on the driver". Is there any way I can create a sparkDataFrame in my objective function here?
For a reproducible example:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
from pyspark.sql import SparkSession
# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line.
import mlflow
# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target
def objective(C):
# Create a support vector classifier model
clf = SVC(C)
# THESE TWO LINES CAUSE THE PROBLEM
ss = SparkSession.builder.getOrCreate()
sdf = ss.createDataFrame([('Alice', 1)])
# Use the cross-validation accuracy to compare the models' performance
accuracy = cross_val_score(clf, X, y).mean()
# Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
return {'loss': -accuracy, 'status': STATUS_OK}
search_space = hp.lognormal('C', 0, 1.0)
algo=tpe.suggest
# THIS WORKS (It's not using SparkTrials)
argmin = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=16)
from hyperopt import SparkTrials
spark_trials = SparkTrials()
# THIS FAILS
argmin = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=16,
trials=spark_trials)
I have tried looking at this, but it is solving a different problem - I can't see an obvious way to apply it to my situation.
How can I get the current SparkSession in any place of the codes?
I think the short answer is that it's not possible. The spark context can only exist on the driver node. Creating a new instance would be a kind of nesting, see this related question.
Nesting parallelizations in Spark? What's the right approach?
I solved my problem in the end by rewriting the transformations in pandas, which would then work.
If the transformations are too big for a single node then you'd probably have to pre-compute them and let hyperopt choose which version as part of the optimisation.
I am running GridSearchCV on small dataset, which size is (13380,37) and code is as below:
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
parameters = {kernel:('linear','rbf','poly'),'C':[1.5,2,3,4,5,6,7,8,9,10],'gamma':[1e-7,1e-6,1e-4,1e-3,1e-2]}
svc = SVC()
clf = GridSearchCV(svc,parameters,n_jobs=38)
search = clf.fit(X_train,y_train)
search.best_params_
It is running for more than a day. But with the same parameters if i run it on iris dataset,it is giving the result in 1 min. The data is standardized and using multiprocessing too. Am i missing anything here.
I think the problem is with the njobs you indicated. change it from 38 to -1 and that should do the job fast enough.
in my project I am trying to separate different implemented classifiers in the separate files (in order to make code less messy).
Unfortunately, when I am importing file containing the RandomForestClassifier the program neither stops its execution nor shows an error, the code is provided here:
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
def train_rfc(X,y):
n_estimators = [100]
min_samples_split = [2]
min_samples_leaf = [1]
bootstrap = [True]
parameters = {'n_estimators': n_estimators, 'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_samples_split}
clf = GridSearchCV(RandomForestClassifier(verbose=1,n_jobs=-1), cv=4, param_grid=parameters)
print "=="
clf.fit(X, y)
print "++"
return clf
rfc_clf = train_rfc(corpus,data["target"])
print ("Accuracy of RF on CV sets :{}".format(rfc_clf.best_score_))
As I was able to figure out the problem is most likely to be somewhere in the RandomForest, and is in clf.fit(X,y) function to be exact, since the program does not do anything as soon as it reaches this point.
I had no problem running the other, similarly implemented SVM classifier that also used GridSearchCV, and was also inside of a function.
I would really appreciate any help on that.
I have the following code:
from xgboost import XGBClassifier
print(df_train.shape)
print(df_train_labels.shape)
clf = clf.fit(df_train, df_train_labels, verbose=True)
print("after fit")
Here df_train and df_train_labels are pandas which I read from a CSV.
The above code prints:
(1460, 7)
(1460,)
However, nothing else is printed for 10 minutes which means the code is stuck at clf.fit so I'm assuming the algorithm shouldn't spend a long time on this.
As you can see, there are only 1460 examples, so I'm assuming the algorithm shouldn't spend a long time on this.
Furthermore, since I passed verbose=True, I would have expected the model to print some output, but that is not happening.
Any idea why there is no output printed and why XGBClassifier takes such a long time?
Looking at the documentation:
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
It looks like the parameter for printing the progress is called verbosity.
Set it to anything from 0-3 (3 for debug).
from xgboost import XGBClassifier
model = XGBClassifier()
setattr(model, 'verbosity', 2)
You should first initialize the model you want to use, in your case XGBClassifier().
A common pipeline to do so is:
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(df_train, df_train_labels)
y_pred = model.predict(df_test)
The following script runs fine on my machine with n_samples=1000, but dies (no error, just stops working) with n_samples=10000. This only happens using the Anaconda python distribution (numpy 1.8.1) but is fine with Enthought's (numpy 1.9.2). Any ideas what would be causing this?
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.metrics.scorer import log_loss_scorer
from sklearn.cross_validation import KFold
from sklearn import datasets
import numpy as np
X, y = datasets.make_classification(n_samples=10000, n_features=50,
n_informative=35, n_redundant=10,
random_state=1984)
lr = LogisticRegression(random_state=1984)
param_grid = {'C': np.logspace(-1, 2, 4, base=2)}
kf = KFold(n=y.size, n_folds=5, shuffle=True, random_state=1984)
gs = GridSearchCV(estimator=lr, param_grid=param_grid, scoring=log_loss_scorer, cv=kf, verbose=100,
n_jobs=-1)
gs.fit(X, y)
Note: I'm using sklearn 0.16.1 in both distributions and am using OS X.
I've noticed that upgrading to numpy version 1.9.2 with Enthought distribution (by updating manually) breaks the grid search. I haven't had any luck downgrading Anaconda numpy version to 1.8.1 though.
Are you on windows? If so, you need to protect the code with
if __name__ == "__main__":
do_stuff()
Otherwise multiprocessing will not work.
Per Andreas's comment, the problem seems to be with multi threading in the linear algebra library. I solved it with the following command in the terminal:
export VECLIB_MAXIMUM_THREADS=1
My (weak) understanding is that this limits the linear algebra's library use of multiple threads and lets multiprocessing handle multithreading as it wants.