How to Fit New Classes to SKLearn Algorithm - python

I am in need of a SKLearn Classifier which can be trained periodically by fitting new data to the already trained algorithm while retaining the classes it has learned to fit to previously.
I have tried using warm_start on LogisticRegression and RandomForestClassifier but can't seem to figure out the configuration needed to retain the classes from the previous fitting. Using a retraining python tool I have developed, a previously saved sklearn algorithm can be loaded using the pickle module and fit to.
#Train File:
algorithm = LogisticRegression(warm_start=True)
#Fit Data, (X,y) is (100,2008),(100,) No y value is the same
algorithm.fit(X,y)
print(len(algorithm.classes_)) #Expected 100
#Save Trained Object
with open("logreg.alg", "wb") as f:
pickle.dump(algorithm,f)
#Re-train File:
#Load Trained Object
with open("logreg.alg", "rb") as f:
algorithm = pickle.load(f)
#Fit New Data, (X,y) again is (100,2008),(100,) No y value has been trained before or is the same
algorithm.fit(X,y)
print(len(algorithm.classes_)) #Expected 200, Actual 100
#Save Trained Object
with open("logreg.alg", "wb") as f:
pickle.dump(algorithm,f)
I am expecting the number of classes to increase after each re-fitting, but it seems the algorithm resets its values after each run. For example, the first fitting should have set classes "0" through "99", now I want to fit again with classes "100" through "199" to have a trained algorithm of classes from "0" though "199".
Am I doing something wrong or misinterpreting the "warm_start" parameter? I would love to use Logistic Regression but am open to other Classifiers.
Thanks in advance!

Related

Inversed prediction after using *pickle* loaded *sklearn* model

I'm making a stat algorithm. I pre process the data using PCA method (sklearn.decomposition.PCA) and then apply a classification model (MLP for example, from sklearn.neural_network.MLPClassifier ) to predict the category. I first fit the model and test it. It works well. Then I save the model using pickle module
with open(path+'/methodes/PCA_fitted_model.sav','wb') as file:
pickle.dump(pca,file)
file.close()
with open(path+'/methodes/MLP_fitted_model.sav','wb') as file:
pickle.dump(mlp,file)
file.close()
I have a problem when I reload the model to predict the category of new data. I know the category of it (it's a test data) and the predict category is the exact opposite of the true category (binary classification). I've checked, the pre processing of the data using the PCA is good. Is it due to pickle or is it something else ?
I reload the classifier using :
with open(path+'/classifier/'MLP_trained_model.sav','rb') as file:
MLP = pickle.load(file)
file.close()
and then the prediction using :
prediction=MLP.predict(pca_data)
where pca_data is the data after the pca preprocessing

IsolationForest is always predicting 1

I am working with a project to detect out-of-domain text input, with the help of IsolationForest and tf-idf feature. Following is my works in summarized form:
TRAINING
On tfidf:
Fit and transform in-domain dataset using CountVectorizer().
Fit a tfidftransformer() with my with this CountVectorizer() and save the transformer (to use it during test time).
Therefore, transform the training data using tfidftransformer()
Save both CountVectorizer()'s vocabulary_ and TfidfTransformer() object using pickle for test time usage.
On IsolationForest:
Collect the transformed in-domain dataset and train a IsolationForest() novelity detector.
Save the model using joblib.
TESTING:
Load all of the saved models.
Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
Predict if it is out-of-domain or not, using the saved IsolationForest model.
But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest always predicting 1.
What is probably going wrong?
NB: I also tried inputting dummy vectors to IsolationForest model by mimicking the output of tf-idf transformer to make sure if the tf-idf module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest. Also note that, tf-idf has a lot of features (tokens), in my case the count is 48015.

Loaded model receives different prediction compared to saved model

I am trying to save a model and load it into a different session, but I am having prediction inconsistencies, and I would appreciate any help that can be offered. So here is what I did...
First, after running the model, I used this code to save the model:
from sklearn.externals import joblib
joblib.dump(clf, "models.pkl")
and then to load the file in a different colaboratory notebook, I used the function
from sklearn.externals import joblib
loaded_model = joblib.load('models.pkl')
then the program I used to process a single image for testing
img_toArray = cv2.imread("/content/ESD/ESD/folder1/img1.png")
new_array = cv2.resize(img_toArray, (220, 220))
new_array = np.array(new_array).reshape(1,145200)
but this results in an output of array([4]) with every image I test, and I am not sure why.
I have also tried to reload the entire dataset again and separate the labels from the features (the image), and use train_test_split to dedicate 90% of the dataset for testing, and when I run the features (images) to test with, through the block of code:
loaded_model.predict(np.array(xTest[whatEverNumber]).reshape(1,145200))
I get the right predictions. So I am confused as to what I a doing wrong, because in both examples,I am processing the images in basically the same method, and then separating the images and running them through the same prediction method. So I would appreciate any help in figuring out what I did wrong.
Extra information that may prove beneficial: I am using colaboratory and my model is an sklearn SVM that runs through a cross_validation_predict, cross_validation_predict, and finally an SVM fit function.
Thank you in advance!
Is loaded_model always trained with the same data? you might be encountering this problem because your fitted model is trained with different chunks (folds) of your dataset and you are fitting/saving it with the last iteration only and hence, each time you test it, the model learns from different data (given by each fold) and returns different predictions. This if model fitting is within your cross-validation loop. May I ask, what type of train-test split did you use? shuffled?

How to load previously saved model and expand the model with new training data using scikit-learn

I'm using scikit-learn where I've saved a logistic regression model with unigrams as features from training set 1. Is it possible to load this model and then expand it with new data instances from a second training set (training set 2)? If yes, how can this be done? The reason for doing this is because I'm using two different approaches for each of the training sets (the first approach involves feature corruption/regularization, and the second approach involves self-training).
I've added some simple example code for clarity:
from sklearn.linear_model import LogisticRegression as log
from sklearn.feature_extraction.text import CountVectorizer as cv
import pickle
trainText1 # Training set 1 text instances
trainLabel1 # Training set 1 labels
trainText2 # Training set 2 text instances
trainLabel2 # Training set 2 labels
clf = log()
# Count vectorizer used by the logistic regression classifier
vec = cv()
# Fit count vectorizer with training text data from training set 1
vec.fit(trainText1)
# Transforms text into vectors for training set1
train1Text1 = vec.transform(trainText1)
# Fitting training set1 to the linear logistic regression classifier
clf.fit(trainText1,trainLabel1)
# Saving logistic regression model from training set 1
modelFileSave = open('modelFromTrainingSet1', 'wb')
pickle.dump(clf, modelFileSave)
modelFileSave.close()
# Loading logistic regression model from training set 1
modelFileLoad = open('modelFromTrainingSet1', 'rb')
clf = pickle.load(modelFileLoad)
# I'm unsure how to continue from here....
LogisticRegression uses internally the liblinear solver that does not support incremental fitting. Instead you could use SGDClassifier(loss='log') that as a partial_fit method that could be used for this although in practice. The other hyperparameters are different. Be careful to grid search their optimal value carefully. Read the SGDClassifier documentation for the meaning of those hyperparameters.
CountVectorizer does not support incremental fitting. You would have to reuse the vectorizer fitted on train set #1 to transform #2. That means that any token from set #2 not already seen in #1 will be completely ignored though. This might not be what you expect.
To mitigate this you can use HashingVectorizer that is stateless at the cost of not knowing what the features mean. Read the documentation for more details.

Out-of-core training of Scikit's LinearSVC Classifier

How do you train Scikit's LinearSVC on a dataset too big or impractical to fit into memory? I'm trying to use it to classify documents, and I have a few thousand tagged example records, but when I try to load all this text into memory and train LinearSVC, it consumes over 65% of my memory and I'm forced to kill it before my system becomes totally unresponsive.
Is it possible to format my training data as a single file and feed it into LinearSVC with a filename instead of having to call the fit() method?
I found this guide, but it only really covers classification, and assumes training is done incrementally, something LinearSVC doesn't support.
As far as I know, non-incremental implementations like LinearSVC would need the entire data set to train on. Unless you create an incremental version of it, you might be unable to use LinearSVC.
There are classifiers in scikit-learn that can be used incrementally just like in the guide you found wherein it was using an SGDClassifier. The SGDClassifier has the *partial_fit* method which allows you to train it in batches. There are a couple of other classifiers that support incremental learning such as SGDCLassifier, Multinomial Naive Bayes and Bernoulli Naive Bayes
You can use a Generator function like this.
def lineGenerator():
with open(INPUT_FILENAMES_TITLE[0],'r') as f1:
title_reader = csv.reader(f1)
for line in title_reader:
yield line[0]
Then you can call the Classifier as
clf = LinearSVC()
clf.fit(lineGenerator())
This assumes INPUT_FILENAMES_TITLE[0] is your filename.

Categories