This question already has an answer here:
Scikit learn - fit_transform on the test set
(1 answer)
Closed 8 years ago.
I am currently trying to build a text classification model (document classification) with roughly 80 classes. When I build and train the model using random forest (after vectorizing the text into a TF-IDF matrix), the model works well. However, when I introduce new data, the same words that I used to build my RF aren't necessarily identical to the training set. This is a problem because I have a different number of features in my training set than I do in my test set (so the dimensions for the training set are less than the test).
####### Convert bag of words to TFIDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data)
print tfidf_matrix.shape
## number of features = 421
####### Train Random Forest Model
clf = RandomForestClassifier(max_depth=None,min_samples_split=1, random_state=1,n_jobs=-1)
####### k-fold cross validation
scores = cross_val_score(clf, tfidf_matrix.toarray(),labels,cv=7,n_jobs=-1)
print scores.mean()
### this is the new data matrix for unseen data
new_tfidf = tfidf_vectorizer.fit_transform(new_X)
### number of features = 619
clf.fit(tfidf_matrix.toarray(),labels)
clf.predict(new_tfidf.toarray())
How can I go about creating a working RF model for classification that will incorporate new features (words) that weren't seen in the training?
Do not call fit_transform on the unseen data, only transform! That will keep the dictionary from the training set.
You cannot introduce new features into the test set that were not part of your training set. The model is trained on a specific dictionary of terms and that same dictionary of terms must be used across training, validating, testing, and production. Further more, the indices of the words in your feature vector cannot change either.
You should be creating one large matrix using all of your data and then split the rows into your train and test sets. This will guarantee that you will have the same feature set for train and test.
Related
I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.
how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.
I am trying to train an NLP model on one set, save the vocab and the model, then apply it to a separate validation set. The code is running, but how can I be sure it is working as I expect?
In other words, I have saved a vocab and nmodel from the training set, then I created the TFidfVectorizer with saved vocabulary, and finally I use "fit_transform" on the new, validation notes.
Is this applying only the trained vocab and model? Is it not "learning" anything new from the validation set?
Training, then load the vocab and model and apply to the validation set:
train_vector = tfidf_vectorizer.fit_transform(training_notes)
pickle.dump(tfidf_vectorizer.vocabulary_, open('./vocab/' + '_vocab.pkl', 'wb'))
X_train = train_vector.toarray()
y_train = np.array(train_data['ref_std'])
model.fit(X_train, y_train)
dump(model, './model/' + '.joblib')
train_prediction = model.predict(X_train)
vocab = pickle.load(open('./vocab/' + '_vocab.pkl', 'rb'))
tfidf_vectorizer = TfidfVectorizer(vocabulary = vocab)
valid_vector = tfidf_vectorizer.fit_transform(validation_notes)
X_valid = valid_vector.toarray()
y_valid = np.array(validation_data['ref_std'])
model = load('./model/' + '.joblib')
valid_prediction = model.predict(X_valid)```
Answering your questions:
Is this applying only the trained vocab and model?
As stated by #G. Anderson as a comment to your answer, when you call "fit", you are refitting the Tf-idf dict to your new data - this implies giving new weights to words (I assume you know what is TF-IDF). Therefore, to be able to use the trained vocab, use only:
vocab = pickle.load(open('./vocab/' + '_vocab.pkl', 'rb'))
tfidf_vectorizer = TfidfVectorizer(vocabulary = vocab)
valid_vector = tfidf_vectorizer.transform(validation_notes)
Assuming that you apply the above mentioned corrections, the second question can be answered:
Is it not "learning" anything new from the validation set?
No, you're just validating it. You use the same tf-idf vectorization because you want to fit the new entries based on your original data - to that, you have a custom set of weights depicting the words your model values the most. If you keep changing your tf-idf dict, you'll have different weights (they can average out if you consider lots of data, but I assume that this is not te fact).
So, once you have a model and a tf-idf calculation, everything is fixed, nothing more is learnt except if you log data to further enchance the model.
I'm trying Bag of Words problem with a dataset which has two columns - summary and solution. I'm using KNN for it. The train dataset has 91 columns and the test dataset has 15 columns.
To generate the vectors, I'm using the following piece of code.
vectorizer = CountVectorizer()
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
print( vectorizer.fit_transform(dataset[0]).todense() )
print( vectorizer.vocabulary_ )
I trained it.
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_bow_set, dataset[1])
Now, I'm testing it.
y_pred = classifier.predict(test_bow_set)
Here, I'm getting below error when I test it:
sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.query()
**ValueError: query data dimension must match training data dimension**
I guess you are fitting the vectorizer on the test data again instead of using transform function.
Make sure you are doing the following.
test_bow_set = vectorizer.transform(test_dataset)
You are fitting again the vectorizer:
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
You need to keep the vectorizer from the training (all the preprocessing elements actually) and only use transform. Fitting again will profoundly change the results.
train_bow_set = vectorizer.transform(dataset[0]).todense()
In mlxtend library, there is An ensemble-learning meta-classifier for stacking called "StackingClassifier".
Here is an example of a StackingClassifier function call:
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
What is meta_classifier here? What is it used for?
What is stacking ?
Stacking is an ensemble learning technique to combine multiple classification models via a meta-classifier. The individual classification models are trained based on the complete training set; then, the meta-classifier is fitted based on the outputs -- meta-features -- of the individual classification models in the ensemble.
Source : StackingClassifier-mlxtend
So meta_classifier parameter helps us to choose the classifier to fit the output of the individual models.
Example:
Assume that you have used 3 binary classification models say LogisticRegression, DT & KNN for stacking. Lets say 0, 0, 1 be the classes predicted by the models. Now we need a classifier which will do majority voting on the predicted values. And that classifier is the meta_classifier. And in this example it would would pick 0 as the predicted class.
You can extend this for prob values also.
Refer mlxtend-API for more info
meta-classifier is the one that takes in all the predicted values of your models. As in your example you have three classifiers clf1, clf2, clf3 let's say clf1 is naive bayes, clf2 is random-forest, clf3 is svm. Now for every data point x_i in your dataset your all three models will run h_1(x_i), h_2(x_i), h_3(x_i) where h_1,h_2,h_3 corresponds to the function of clf1, clf2, clf3. Now these three models will give three predicted y_i values and all these will run in parallel. Now with these predicted values a model is trained which is known as meta- classifier and that is logistic regression in your case.
So for a new query point (x_q) it will calculated as h^'(h_1(x_q),h_2(x_q),h_3(x_q)) where h^'(h dash) is function that computes y_q.
The advantage of meta-classifier or ensemble models is that suppose your clf1 gives an accuracy of 90%, clf2 gives an accuracy of 92%, clf3 gives an accuracy of 93%. So the end model will give an accuracy that will be greater than 93% which is trained using meta classifier. These stacking classifer are used extensively in kaggle completions.
meta_classifier is simply the classifier that makes a final prediction among all the predictions by using those predictions as features. So, it takes classes predicted by various classifiers and pick the final one as the result that you need.
Here is a nice and simple presentation of StackingClassifier:
I am working on a program where I have some data (labeled and unlabeled) and 2 different groups ("artritis" and "fibro"). I would like to obtain the classifier's accuracy and then classify the unlabeled data. My problem is that I am testing it with 2 classifiers (LDA and QDA). With the first one I obtain an accuracy of 81% and when I classify the unlabeled data (39 objects) it classifies everything correctly. However, when I use QDA I obtain an accuracy of 93,74% and when it classifies the unlabeled data (the same 39 objects) it labels 3 of them with the wrong group. Can someone help me to find my errors?
My code:
#"listaTrain" has a list of dictionaries which are the labeled data and will be used for
# training and Cross-Validation
#"listaLabels" has a list of the train labels
#"listaClasificar" has a list of dictionaries which are the unlabeled data
# which I want to label
#"clasificador" is my classifier
X=vec.fit_transform(listaTrain) #I transform the dictionaries to
#a format that sklearn can use
X=preprocessing.scale(X.toarray()) #I scale the values
clasificador.fit(X, listaLabels) #I train the classifier with the train data and
# the train labels
n_samples = X.shape[0]
cv = cross_validation.ShuffleSplit(n_samples, n_iter=300, test_size=0.6, random_state=4)
#I make Cross-Validation dividing the X's data (40% for training and 60% for testing)
scores = cross_validation.cross_val_score(clasificador, X, listaLabels,v=cv)
#I obtain the Cross-validation accuracy
scores.mean() #I obtain the accuracy mean (here is where i obtain 81% and 93%)
testX=vec.transform(listaClasificar) #I transform the dictionaries to a
#format that sklearn can use
testX=preprocessing.scale(testX.toarray()) #I scale the values
predicted=clasificador.predict(testX) #I predict the labels of the unlabeled data