Why does the accuracy of the model change? - python

I'm kinda new in the area of ML. There is something I wonder.. when I use 'random_state=10' the variables remain same and nothing changes also it doesn't effect the accuracy of the model.. everythings fine until now.. but when I don't use it, the variables change and it changes the accuracy of the model, the variables are different now but they're still in the same data frame, I thought the accuracy still would be same.. is that how the things work in ML? or am I missing something? Here is my code.
X =df[["Mileage","Age(yrs)"]]
y=df["Sell Price($)"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
from sklearn.linear_model import LinearRegression
clf=LinearRegression()
clf.fit(X_train,y_train)
clf.predict(X_test)
>>>array([ 38014.9266005 , 14240.40458389, 33695.58936258, 29870.44475795])
y_test
>>>3 40000
8 12000
1 34000
4 31500
clf.score(X_test,y_test)
>>>0.97343231831177046

That random state you mention is an argument from sklearn module. It basically tells the module to split in a particular way. Usually random state = 42 is used. When used, the train and test data get split in the same way. This is highly useful when you want somebody else to test your model or maintain the same split everytime. I suggest you use the random state = 42.

Related

How do you create a spark dataframe on a worker node when using HyperOpt and SparkTrials?

I'm trying to run ML trials in parallel using HyperOpt with SparkTrials on Databricks.
My opjective function converts the outputs to a spark dataframe using spark.createDataFrame(results) (to reuse some preprocessing code I've previously created - I'd prefer not to have to rewrite this).
However, this causes an error when attempting to use HyperOpt and SparkTrials, as the SparkContext used to create the dataframe "should only be created or accessed on the driver". Is there any way I can create a sparkDataFrame in my objective function here?
For a reproducible example:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
from pyspark.sql import SparkSession
# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line.
import mlflow
# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target
def objective(C):
# Create a support vector classifier model
clf = SVC(C)
# THESE TWO LINES CAUSE THE PROBLEM
ss = SparkSession.builder.getOrCreate()
sdf = ss.createDataFrame([('Alice', 1)])
# Use the cross-validation accuracy to compare the models' performance
accuracy = cross_val_score(clf, X, y).mean()
# Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
return {'loss': -accuracy, 'status': STATUS_OK}
search_space = hp.lognormal('C', 0, 1.0)
algo=tpe.suggest
# THIS WORKS (It's not using SparkTrials)
argmin = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=16)
from hyperopt import SparkTrials
spark_trials = SparkTrials()
# THIS FAILS
argmin = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=16,
trials=spark_trials)
I have tried looking at this, but it is solving a different problem - I can't see an obvious way to apply it to my situation.
How can I get the current SparkSession in any place of the codes?
I think the short answer is that it's not possible. The spark context can only exist on the driver node. Creating a new instance would be a kind of nesting, see this related question.
Nesting parallelizations in Spark? What's the right approach?
I solved my problem in the end by rewriting the transformations in pandas, which would then work.
If the transformations are too big for a single node then you'd probably have to pre-compute them and let hyperopt choose which version as part of the optimisation.

CatBoost -- suppressing iteration results in a grid search

I am trying to use CatBoost Classifier. Using it I do perform a grid search using randomised_search() method. Unfortunately, the method prints to stdout iteration results for each tree built for each model tried.
There is a parameter supposed to control this: verbose. Ideally verbose could be set to False to inhibit all stdout prints, or set to an integer, specifying an interval between models which are reported (models, no trees).
Do you know how to control this? I get millions of lines in log files...
This question is somehow related to How to suppress CatBoost iteration results?, but that one related to the fit() method, which has a logging_level, silent parameters as well. Another method, the cv() cross validation, responds to logging_level='Silent' cutting out all output.
Setting both logging_level='Silent' when instantiating the model and verbose=False when running the random search should suppress all outputs.
import catboost
from sklearn.datasets import make_classification
from scipy import stats
# generate some data
X, y = make_classification(n_features=10)
# instantiate the model with logging_level='Silent'
model = catboost.CatBoostClassifier(iterations=1000, logging_level='Silent')
pool = catboost.Pool(X, y)
parameters = {
'learning_rate': stats.uniform(0.01, 0.1),
'depth': stats.binom(n=10, p=0.2)
}
# run random search with verbose=False
randomized_search_results = model.randomized_search(
parameters,
pool,
n_iter=10,
shuffle=False,
plot=False,
verbose=False,
)

Perceptron in Python

I'm using sklearn library. I have a question about the attribute: n_iter_. When executing the code I get TypeError: __init__() got an unexpected keyword argument 'n_iter_'. Also try using n_iter but I get the same error, or maybe I am misspelling the attribute. It is not all the code, if you need more information, let me know
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn= Perceptron(n_iter_=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Perceptron Model in sklearn.linear_model doesn't have n_iter_ as a parameter. It has following parameters with similar names.
max_iter: int, default=1000
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit method.
and
n_iter_no_change : int, default=5
Number of iterations with no improvement to wait before early stopping.
New in version 0.20.
By looking at your code it looks like you intended to use max_iter.
So do
ppn=Perceptron(max_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Note:
You should first upgrade your sklearn using
pip install sklearn -upgrade
The attribute given in the documentation is n_iter and not n_iter_
So this should work:
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn=Perceptron(n_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
First check which Scikit-learn version you have installed. You can do that by executing
python -c "import sklearn;print(sklearn.__version__)"
on your terminal/environment to which you have the python that executes your code.
Perceptron initial parameters have changed from n_iter to max_iter in version 0.20. The best way to keep up, head to the documentation or source code of the correct version and read the params: e.g.
documentation: perceptron docs v.0.23
source code: perceptions.0.23 code

Python : GridSearchCV is taking too long to execute

I am running GridSearchCV on small dataset, which size is (13380,37) and code is as below:
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
parameters = {kernel:('linear','rbf','poly'),'C':[1.5,2,3,4,5,6,7,8,9,10],'gamma':[1e-7,1e-6,1e-4,1e-3,1e-2]}
svc = SVC()
clf = GridSearchCV(svc,parameters,n_jobs=38)
search = clf.fit(X_train,y_train)
search.best_params_
It is running for more than a day. But with the same parameters if i run it on iris dataset,it is giving the result in 1 min. The data is standardized and using multiprocessing too. Am i missing anything here.
I think the problem is with the njobs you indicated. change it from 38 to -1 and that should do the job fast enough.

XGBClassifier is slow and does not print any output despite passing verbose parameter

I have the following code:
from xgboost import XGBClassifier
print(df_train.shape)
print(df_train_labels.shape)
clf = clf.fit(df_train, df_train_labels, verbose=True)
print("after fit")
Here df_train and df_train_labels are pandas which I read from a CSV.
The above code prints:
(1460, 7)
(1460,)
However, nothing else is printed for 10 minutes which means the code is stuck at clf.fit so I'm assuming the algorithm shouldn't spend a long time on this.
As you can see, there are only 1460 examples, so I'm assuming the algorithm shouldn't spend a long time on this.
Furthermore, since I passed verbose=True, I would have expected the model to print some output, but that is not happening.
Any idea why there is no output printed and why XGBClassifier takes such a long time?
Looking at the documentation:
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
It looks like the parameter for printing the progress is called verbosity.
Set it to anything from 0-3 (3 for debug).
from xgboost import XGBClassifier
model = XGBClassifier()
setattr(model, 'verbosity', 2)
You should first initialize the model you want to use, in your case XGBClassifier().
A common pipeline to do so is:
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(df_train, df_train_labels)
y_pred = model.predict(df_test)

Categories