Save classifier for future prediction - python

I am new to machine learning with python and I am trying to build a Sentiment Analyzer in which I am using this dataset and this tutorial. Everything is working fine on the test data. But I'm trying to save my classifier for future use. I'm doing this using pickle by saving it as
sentiment_analyzer = open("Sentiment_Analyzer.pkl", "wb")
pkl.dump(classifier_linear, sentiment_analyzer)
sentiment_analyzer.close()
Later, I'm extracting my saved analyzer by doing this
model_pkl = open("Sentiment_Analyzer.pkl", "rb")
model = pkl.load(model_pkl)
But I'm unable to understand how to call the predict method on my extracted model classifier.

You need to save the vectorizer too, just the same way you are pickling the model. Then during future use, load both the vectorizer and classifier, transform the new X using the loaded vectorizer and then call predict() on classifier.

Related

Inversed prediction after using *pickle* loaded *sklearn* model

I'm making a stat algorithm. I pre process the data using PCA method (sklearn.decomposition.PCA) and then apply a classification model (MLP for example, from sklearn.neural_network.MLPClassifier ) to predict the category. I first fit the model and test it. It works well. Then I save the model using pickle module
with open(path+'/methodes/PCA_fitted_model.sav','wb') as file:
pickle.dump(pca,file)
file.close()
with open(path+'/methodes/MLP_fitted_model.sav','wb') as file:
pickle.dump(mlp,file)
file.close()
I have a problem when I reload the model to predict the category of new data. I know the category of it (it's a test data) and the predict category is the exact opposite of the true category (binary classification). I've checked, the pre processing of the data using the PCA is good. Is it due to pickle or is it something else ?
I reload the classifier using :
with open(path+'/classifier/'MLP_trained_model.sav','rb') as file:
MLP = pickle.load(file)
file.close()
and then the prediction using :
prediction=MLP.predict(pca_data)
where pca_data is the data after the pca preprocessing

IsolationForest is always predicting 1

I am working with a project to detect out-of-domain text input, with the help of IsolationForest and tf-idf feature. Following is my works in summarized form:
TRAINING
On tfidf:
Fit and transform in-domain dataset using CountVectorizer().
Fit a tfidftransformer() with my with this CountVectorizer() and save the transformer (to use it during test time).
Therefore, transform the training data using tfidftransformer()
Save both CountVectorizer()'s vocabulary_ and TfidfTransformer() object using pickle for test time usage.
On IsolationForest:
Collect the transformed in-domain dataset and train a IsolationForest() novelity detector.
Save the model using joblib.
TESTING:
Load all of the saved models.
Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
Predict if it is out-of-domain or not, using the saved IsolationForest model.
But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest always predicting 1.
What is probably going wrong?
NB: I also tried inputting dummy vectors to IsolationForest model by mimicking the output of tf-idf transformer to make sure if the tf-idf module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest. Also note that, tf-idf has a lot of features (tokens), in my case the count is 48015.

How to evaluate the PMML file using python

I have pmml file generated by python having random forest classifier, I need to test the model again in python. Kindly let me know how to import the pmml file back to python so that I can test the model using new dataset.
I have tried using titanium package but it went to error because of the version issue of PMML.
The expected output to be the predicted value of the model so that I can verify the accuracy of the model.
You could use PyPMML to load PMML in Python, then make predictions on new dataset, e.g.
from pypmml import Model
model = Model.fromFile('the/pmml/file/path')
result = model.predict(data)
The data could be dict, string in JSON, Series or DataFrame of Pandas.

How to train a trained model with new examples in scikit-learn?

I'm working on a machine learning classification task in which I have trained many models with different algorithms in scikit-learn and Random Forest Classifier performed the best. Now I want to train the model further with new examples but if I train the same model by calling the fit method on new examples then it will start training the model from beginning by erasing the old parameters.
So, how can I train the trained model by training it with new examples in scikit-learn?
I got some idea by reading online to pickle and unpickle the model but how would it help I don't know.
You should use incremental learning and estimators implementing the partial_fit API.
RandomForrestClassifier has a flag warm_start. Note that this will not give the same results as if you train on both sets at once.
Append the new data to your existing dataset, and train over the whole thing. Might want to reserve some of the new data for your testset.

predicting new non-standardized data with classifier trained on standardized data

I have some data with say, L features. I have standardized them using StandardScaler() by doing a fit_transform on X_train. Now while predicting, i did clf.predict(scaler.transform(X_test)). So far so good... now if I want to pickle the model for later reuse, how would I go about predicting on the new data in future with this saved model ? the new (future) data will not be standardized and I didn't pickle the scaler.
Is there anything else that I have to do before pickling the model the way I am doing it right now (to be able to predict on non-standardized data)?
reddit post: https://redd.it/4iekc9
Thanks. :)
To solve this problem you should use a pipeline. The first stage there is scaling, and the second one is your model. Then you can pickle the whole pipeline and have fun with your new data.

Categories