I'm making a stat algorithm. I pre process the data using PCA method (sklearn.decomposition.PCA) and then apply a classification model (MLP for example, from sklearn.neural_network.MLPClassifier ) to predict the category. I first fit the model and test it. It works well. Then I save the model using pickle module
with open(path+'/methodes/PCA_fitted_model.sav','wb') as file:
pickle.dump(pca,file)
file.close()
with open(path+'/methodes/MLP_fitted_model.sav','wb') as file:
pickle.dump(mlp,file)
file.close()
I have a problem when I reload the model to predict the category of new data. I know the category of it (it's a test data) and the predict category is the exact opposite of the true category (binary classification). I've checked, the pre processing of the data using the PCA is good. Is it due to pickle or is it something else ?
I reload the classifier using :
with open(path+'/classifier/'MLP_trained_model.sav','rb') as file:
MLP = pickle.load(file)
file.close()
and then the prediction using :
prediction=MLP.predict(pca_data)
where pca_data is the data after the pca preprocessing
Related
I am working with a project to detect out-of-domain text input, with the help of IsolationForest and tf-idf feature. Following is my works in summarized form:
TRAINING
On tfidf:
Fit and transform in-domain dataset using CountVectorizer().
Fit a tfidftransformer() with my with this CountVectorizer() and save the transformer (to use it during test time).
Therefore, transform the training data using tfidftransformer()
Save both CountVectorizer()'s vocabulary_ and TfidfTransformer() object using pickle for test time usage.
On IsolationForest:
Collect the transformed in-domain dataset and train a IsolationForest() novelity detector.
Save the model using joblib.
TESTING:
Load all of the saved models.
Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
Predict if it is out-of-domain or not, using the saved IsolationForest model.
But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest always predicting 1.
What is probably going wrong?
NB: I also tried inputting dummy vectors to IsolationForest model by mimicking the output of tf-idf transformer to make sure if the tf-idf module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest. Also note that, tf-idf has a lot of features (tokens), in my case the count is 48015.
I am in need of a SKLearn Classifier which can be trained periodically by fitting new data to the already trained algorithm while retaining the classes it has learned to fit to previously.
I have tried using warm_start on LogisticRegression and RandomForestClassifier but can't seem to figure out the configuration needed to retain the classes from the previous fitting. Using a retraining python tool I have developed, a previously saved sklearn algorithm can be loaded using the pickle module and fit to.
#Train File:
algorithm = LogisticRegression(warm_start=True)
#Fit Data, (X,y) is (100,2008),(100,) No y value is the same
algorithm.fit(X,y)
print(len(algorithm.classes_)) #Expected 100
#Save Trained Object
with open("logreg.alg", "wb") as f:
pickle.dump(algorithm,f)
#Re-train File:
#Load Trained Object
with open("logreg.alg", "rb") as f:
algorithm = pickle.load(f)
#Fit New Data, (X,y) again is (100,2008),(100,) No y value has been trained before or is the same
algorithm.fit(X,y)
print(len(algorithm.classes_)) #Expected 200, Actual 100
#Save Trained Object
with open("logreg.alg", "wb") as f:
pickle.dump(algorithm,f)
I am expecting the number of classes to increase after each re-fitting, but it seems the algorithm resets its values after each run. For example, the first fitting should have set classes "0" through "99", now I want to fit again with classes "100" through "199" to have a trained algorithm of classes from "0" though "199".
Am I doing something wrong or misinterpreting the "warm_start" parameter? I would love to use Logistic Regression but am open to other Classifiers.
Thanks in advance!
I am trying to update my already trained model so that it can correct the errors it is making. For that, I am partially fitting the new data for which it was giving incorrect labels with the new correct labels
I have saved my Bayesian model in a file like this:
model1 = MultinomialNB() #NaiveBayes model
model1.partial_fit(features_matrix, label_matrix, [0,1,2])
filename = 'trained_NBmodel.pkl' #saving the trained model
joblib.dump(model1, filename)
and then loading it another file like this:
loaded_model = joblib.load('trained_NBmodel.pkl')
loaded_model.partial_fit(new_features_matrix, new_label_matrix)
filename = 'trained_NBmodel.pkl' #saving the trained model
joblib.dump(loaded_model, filename)
now it should use the updated model and if new_features_matrix is given to predict, it should predict new_label_matrix with high accuracy but the model is not. It gives the same label matrix as it was giving before refitting. Could it be that I have trained my initial model with similar data many times with different labels that it is not able to learn from fewer data?
I have using nltk packages and train a model using Naive Bayes. I have save the model to a file using pickle package. Now i wonder how can i use this model to test like a random text not in the dataset and the model will tell if the sentence belong to which categorize?
Like my idea is i have a sentence : " Ronaldo have scored 2 goals against Egypt" And pass it to the model file and return categorize "sport".
Just saving the model will not help. You should also save your VectorModel (like tfidfvectorizer or countvectorizer what ever you have used for fitting the train data). You can save those the same way using pickle. Also save all those models you used for pre-processing the train data like normalization/scaling models, etc. For the test data repeat the same steps by loading the pickle models that you saved and transform the test data in train data format that you used for model building and then you will be able to classify.
I am new to machine learning with python and I am trying to build a Sentiment Analyzer in which I am using this dataset and this tutorial. Everything is working fine on the test data. But I'm trying to save my classifier for future use. I'm doing this using pickle by saving it as
sentiment_analyzer = open("Sentiment_Analyzer.pkl", "wb")
pkl.dump(classifier_linear, sentiment_analyzer)
sentiment_analyzer.close()
Later, I'm extracting my saved analyzer by doing this
model_pkl = open("Sentiment_Analyzer.pkl", "rb")
model = pkl.load(model_pkl)
But I'm unable to understand how to call the predict method on my extracted model classifier.
You need to save the vectorizer too, just the same way you are pickling the model. Then during future use, load both the vectorizer and classifier, transform the new X using the loaded vectorizer and then call predict() on classifier.