predict_proba gives different results in similar environments - python

So I'm training a model(lightgbm) using Google Colab. Then I serialize model into a pickle file(*.pkl).
This file I use(read_pickle) locally in a script. For this purpose I use Spyder(Anaconda) IDE.
To validate that I get the same results I compared predict_proba results between Spyder(Anaconda) IDE and in Google Colab. And unfortunately they differ by a significant margin!
Despite that I have the same versions:
python=3.9.13
pandas=1.4.4
numpy=1.21.5
lightgbm=3.2.1
scikit-learn=1.0.2
I use Spyder IDE because I have Windows machine and also because software which communicates with script is also Windows based. I didn't have any problems with this setup, until recently when I had to reinstall Anaconda. After which I noticed the difference.
How to achieve that predict_proba gives one and the same results in both environments?
At this point I have triple checked that I am using one and the same input data and one and the same pickle file in both environments.
When I first noticed discrepancy I equalized python, pandas, numpy, lightgbm and scikit-learn versions and retrained model. It didn't help and predict_proba still gives different results in both environments.
Code, here is basically how I train model:
import pandas as pd
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
#...read feature file, label file, drop some columns etc.
Xtrain, Xtest, ytrain, ytest = train_test_split(X, labelss,test_size=0.2, random_state=52, shuffle=False)
eval_setb = [(Xtest, ytest)]
modelb_opt = LGBMClassifier(n_estimators=110000,learning_rate=0.001,max_depth=20,
num_leaves=46,colsample_bytree=0.012217664054347373,
lambda_l1=2.0236823308340988e-07,lambda_l2=6.127039442996004e-06,
min_data_in_leaf=335,feature_fraction=0.7647946800740761,
bagging_fraction=0.46500934520303194,objective="binary",
class_weight='balanced')
modelb_opt.fit(Xtrain, ytrain, eval_set = eval_setb,early_stopping_rounds=1000, eval_metric='logloss')
pd.to_pickle(modelb_opt, "C:/path/to/file/file_name.pkl")
Update:
I re-trained model locally on my PC in Spyder(Anaconda) IDE. The results of predict_proba didn't change in Spyder IDE, e.g. when reading model locally either locally trained or in Google Colab, I got the same faulty predict_proba output.

Related

model.features_names_in not working for decision trees in Anaconda Jupyter Notebook

is there any way I can get the feature names of the decision tree model as defined below using sklearn or any other packages in the Anaconda Jupyter Notebook? I'm trying to work on this issue for a long time now, but have not been able to search for exact work through.
Code snippet with the error
Latest Code for preparing model
Code for running the model
Code for visualizing the DT
You just need to use a Pandas dataframe with the feature names when you train your decision tree model instead of a Numpy array with no feature names.
Scikit-learn decision tree documentation

Is there a way to know which sklearn version was used to train a pickle model?

I was given a pickle file with a trained gradient boosting model that was trained by someone else on another machine. I realised that I could not load this pickle file on my machine using
with open('gb_model.pickle','rb') as f:
gbmodel = pickle.load(f)
My current version of scikit-learn==0.24.2. I got the error ModuleNotFoundError: No module named 'sklearn.ensemble.gradient_boosting'. I then tried installing other versions of sklearn but I keep getting other errors related to sklearn. I also tried using joblib but get the same results:
from joblib import load
clf = load('gb_model.pickle')
I realised I need to load the pickled file with the same sklearn version this was installed with. I saw here that one is able to check the version after loading it, but it seems like I can't even load the pickle file. Is there another way of doing this? I want to end up being able to load the pickled model. According to official documentation, ideally there should be metadata saved along the pickled model, but I was not provided this, is there a way to obtain this from the pickle file alone?
If you trained the model with sklearn version 0.18 or higher, then try:
import pickle
clf = pickle.load(open('gb_model.pickle', 'rb'))
clf.__getstate__()['_sklearn_version']
However, there is literally no module called gradient_boosting inside sklearn.ensemble, which is what's causing the problem. The closest module would be sklearn.ensemble.GradientBoostingClassifier, or this module from OpenML (which I had never heard of).

Amazon sagemaker. SKlearn estimator vs Tensorflow estimator - why requirements_file is not present in one of them?

I am looking at definitions of two estimators SKLearn and Tensorflow in Amazon Sagemaker:
SKLearn
Tensorflow
class sagemaker.sklearn.estimator.SKLearn(entry_point, framework_version='0.20.0', source_dir=None, hyperparameters=None, py_version='py3', image_name=None, **kwargs)
class sagemaker.tensorflow.estimator.TensorFlow(training_steps=None, evaluation_steps=None, checkpoint_path=None, py_version='py2', framework_version=None, model_dir=None, requirements_file='', image_name=None, script_mode=False, distributions=None, **kwargs)
Tensorflow has requirements_file parameter, while SKLearn does not. Is there reason why? How can I add requirements.txt to SKLearn estimator?
I had similar usecase, while this issue states that they will be supporting it, I found that if you keep requirements.txt file besides your entry point file, it downloads the required dependencies.

Feature Selection in PySpark

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.
from sklearn.feature_selection import RFECV,RFE
logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)
However, I could not find any article which could show how can I perform recursive feature selection in pyspark.
I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.
Could please someone help me achieve this in pyspark
You have a few options for doing this.
If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.
We can try following feature selection methods in pyspark
Chi-Squared selector
Randomforest selector
References:
https://spark.apache.org/docs/2.2.0/ml-features.html#feature-selectors
https://databricks.com/session/building-custom-ml-pipelinestages-for-feature-selection
I suggest with stepwise regression model you can easily find the important features and only that dataset them in logistics regression. Stepwise regression works on correlation but it has variations.
Below link will help to implement stepwise regression for feature selection.
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn

how to change parameters of fasttext api in a python script

We have fasttext commands to run in command prompt
I have cloned the github repository and for example to change parameters of the network for a supervised learning in the command I used are like
./fasttext supervised -input FT_Race_data.txt -output race_model -lr 0.4 -epoch 30 -loss hs
I am changing lr and epoch and loss. I can train and fetch the required output.
For programming in python script, I installed the fasttext library and I tried like
classifier = fasttext.supervised('FT_Race_data.txt','race_model')
The model gets trained but the results are not good, In this case, I didn't define any parameters. So I tried like
classifier = fasttext.supervised('FT_Race_data.txt','race_model', 0.4, 30, 'hs')
The programs run with no error but don't give any result. So I tried like
classifier = fasttext.supervised(input = 'FT_Race_data.txt',output ='race_model', lr = 0.4,epoch= 30,loss = 'hs')
it gives an error that fasttext takes only two arguments.
How to change parameters in python script like in command prompt to fine tune the supervised learning ?
For future references, Form discussions here, it seems that the pip install fasttext doesn't install the full features available in the repo.
So till when the latest features are included in https://pypi.python.org/pypi/fasttext, for python bindings with features to train models and set parameters, follow the following installation procedure as outlined here.
git clone https://github.com/facebookresearch/fastText.git
cd fastText
pip install .
And then using train_supervised a function which returns a model object one can set the different parameters as in the following example in that repo.
fastText.train_supervised(input, lr=0.1, dim=100, ws=5, epoch=5, minCount=1, minCountLabel=0, minn=0, maxn=0, neg=5, wordNgrams=1, loss='softmax', bucket=2000000, thread=12, lrUpdateRate=100, t=0.0001, label='__label__', verbose=2, pretrainedVectors='')

Categories