Semi-supervised sentiment analysis in Python? - python

I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.

how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.

Related

i get one set of feature importance after loocv, is it a result of all loocv model sets or of one mean model? (random forest)

I'm a person who doesnt get used to loocv yet.
I've been curious of the title problem.
I did leave one out cross validation for my random forest model.
My codes are like this.
for train_index, test_index in loo.split(x):
x_train, x_test = x.iloc[train_index], x.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model = RandomForestRegressor()
model.fit(x_train, y_train.values.ravel()) #y_train.values.ravel()
y_pred = model.predict(x_test)
#y_pred = [np.round(x) for x in y_pred]
y_tests += y_test.values.tolist()[0]
y_preds += list(y_pred)
rr = metrics.r2_score(y_tests, y_preds)
ms_error = metrics.mean_squared_error(y_tests, y_preds)**0.5
After that, I wanted to get a feature importance of my model like this.
features = x.columns
sorted_idx = model.feature_importances_.argsort()
It's pretty different to what i've expected.
In loocv process, my computer made many different models with using different test and datasets from my original data, which has a length of literally same to original data.
So I'm thinking that feature importances should be multiple as the length of original data, because the test set of each loocv epoch is just one.. (I don't know which word is best for explaining this in English, english is not my mother tongue)
It was not multiple results, just one though.
It was only one feature importance, like calculated for only one sets (as if loocv hadnt added in my codes)
Then why should i had gotten only one importance? I want to understand the reason of it.
Thank you for reading my question.
want to know the reason why I got only one feature importance even though loocv was added in my codes

KNN query data dimension must match training data dimension

I'm trying Bag of Words problem with a dataset which has two columns - summary and solution. I'm using KNN for it. The train dataset has 91 columns and the test dataset has 15 columns.
To generate the vectors, I'm using the following piece of code.
vectorizer = CountVectorizer()
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
print( vectorizer.fit_transform(dataset[0]).todense() )
print( vectorizer.vocabulary_ )
I trained it.
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_bow_set, dataset[1])
Now, I'm testing it.
y_pred = classifier.predict(test_bow_set)
Here, I'm getting below error when I test it:
sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.query()
**ValueError: query data dimension must match training data dimension**
I guess you are fitting the vectorizer on the test data again instead of using transform function.
Make sure you are doing the following.
test_bow_set = vectorizer.transform(test_dataset)
You are fitting again the vectorizer:
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
You need to keep the vectorizer from the training (all the preprocessing elements actually) and only use transform. Fitting again will profoundly change the results.
train_bow_set = vectorizer.transform(dataset[0]).todense()

Add features to Multinomial Naive Bayes classifier - Python

Using MultinomialNB() from Scikit learn in Python, I want to classify documents not only by word features in the documents but also in sentiment dictionary(meaning just word lists not Python data type).
Suppose these are documents to train
train_data = ['i hate who you welcome for','i adore him with all my heart','i can not forget his warmest welcome for me','please forget all these things! this house smells really weird','his experience helps a lot to complete all these tedious things', 'just ok', 'nothing+special today']
train_labels = ['Nega','Posi','Posi','Nega','Posi','Other','Other']
psentidict = ['welcome','adore','helps','complete','fantastic']
nsentidict = ['hate','weird','tedious','forget','abhor']
osentidict = ['ok','nothing+special']
I can train the lists like these below
from sklearn import naive_bayes
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
('clf', naive_bayes.MultinomialNB(alpha = 1.0)),])
text_clf = text_clf.fit(train_data, train_labels)
Even though I trained the data by calculation of all tokens according to corresponding labels, I want to use my sentiment dictionary as additional classifying features.
This is because with the features trained by the dictionaries, it is possible to predict OOV (out of vocabulary). Only with clumsy Laplace Smoothing(alpha = 1.0), overall accuracy would be severely limited.
test_data = 'it is fantastic'
predicted_labels = text_clf.predict(test_data)
With the dictionary feature added, it would be possible to predict a sentence above although every single token is out of training documents.
How to add features of psentidict, nsentidict, and osentidict to Multinomial Naive Bayes classifier? (training them like documents can distort measurement so I think it is better to find another way)
I believe there is no other way to include the features of your Multinomial Naive Bayes Model. This is simply because you want to associated some sort of label to the features ( say 'positive' for the values in psentidict and so on). That can only be achieved by training your model with the said pair of features and labels. What you can do is, improve the model, by creating sentences with the said features, rather than using the words directly, like for example, for the word 'hate', you could instead use ' I hate you with all my heart' and add the sentiment as 'negative', instead of only using the pair 'hate':'negative'. So, you have create more such examples for your dataset.
Hope this link helps.

CountVectorizer not working for test string in sklearn

I've been working on Sentiment Analysis using sklearn. I have a csv file of 3000 odd reviews and I am training my classifier on 60% of those reviews.
When I try to give a custom review for the classifier to predict the label using CountVectorizer.transform() it is throwing the following error :
Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 864, in transform
raise ValueError("Vocabulary wasn't fitted or is empty!")
ValueError: Vocabulary wasn't fitted or is empty!
Please Help me, this is the code for fitting the training set :
def preprocess():
data,target = load_file()
count_vectorizer = CountVectorizer(binary='true',min_df=1)
data = count_vectorizer.fit_transform(data)
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(data)
return tfidf_data
And this is the code for predicting the sentiment of a custom review:
def customQuestionScorer(question, clf):
X_new_tfidf = vectorizer.transform([question]).toarray()
print (clf.predict(X_new_tfidf))
q = "I really like this movie"
customQuestionScorer(q,classifier)
I didn't see classifier here, you are using only transformers (CountVectorizer, TfidfTransformer). To get predictions - you must train classifier on output of TfidfTransformer.
It's not clear whether you are using same CountVectorizer and TfidfTransformer (which were trained on training set before) to transform test-set texts, or some new. To provide correct input for previously fitted classifier - you have to feed it from previously fitted transformers (Not new).
Look here for good example of text processing http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#example-model-selection-grid-search-text-feature-extraction-py

Why should we perform a Kfold cross validation on test set??

I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
scores = []
for _, test_index in kFolds:
test_data = test_x[test_index]
test_labels = test_y[test_index]
scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)
TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.
If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.

Categories