I have used TF-IDF to extract features from a sentiment annotated dataset, I have used the extracted features to train a ML model using random forest algorithm. Is it possible for me to now input a sentence into the model and have it return what it believes the sentiment is?
I would need to take that sentence and convert it to TF-IDF values for my model to understand it.
Do i need to recalculate TF-IDF values for the entire dataset in order to get the values for this new sentence ?
Does anyone know a way of doing this preferably in python?
Related
I have data with 2 important columns, Product Name and Product Category. I wanted to classify a search term into a category. The approach (in Python using Sklearn & DaskML) to create a classifier was:
Clean Product Name column for stopwords, numbers, etc.
Create 90% 10% train-test split
Convert text to vector using OneHotEncoder
Create classifier (Naive Bayes) on the training data
Test the classifier
I realized the OneHotEncoder (or any encoder) converts the text to numbers by creating a matrix keeping into account where and how many times a word occurs.
Q1. Do I need to convert from Word to Vectors before train-test split or after train-test split?
Q2. When I will search for new words (which may not be in the text already), how will I classify it because if I encode the search term, it will be irrelevant to the encoder used for the training data. Can anybody help me with the approach so that I can classify a search term into a category if the term doesn't exist in the training data?
Q1. Do I need to convert from Words to Vectors before train-test split?
Answer: Every algorithm takes input as some number representation of the inputs, so you have to convert from words to vectors. There is no alternative to this. Apart from OneHotEncode, there are other approaches like CountVectorizer and TfIdf-Vectorizer which are recommended to use instead of OneHotEncoding. You can read more about them here .
I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks
Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.
I am new to python and NLTk. I have a model created for sentiment analysis of survey in NLTK (NaivesBayesCalssifier). To improve the accuracy, i wanted to add some dictionary containing list of positive and negative statements in the model. Is there any module in NLTK and are there any additional features that can improve my model?
You can have a look at some public sentiment lexicons which would provide you a corpus of positive and negative words.
One of them can be found at https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Since, you haven't specified any details about your current model, I'm assuming you are using a very basic Naive Bayes classifier. If you are using unigrams(words) to vectorize your text right now, then you can consider using bigrams or trigrams for generating the feature vectors.This would basically, enable you to use the contextual information of the words to a certain extent.
If you are currently using a bag of words model like Tfidf to convert your text to converts then you can consider using word embeddings instead of that. Bag of words doesn't consider the contextual information of the words whereas, word embeddings are able to capitalise on that.
You could use somethings like gensim which uses deep learning to convert words to vectors. Have a look at : https://radimrehurek.com/gensim/models/word2vec.html
Furthermore, you can always try using a linearSVC classifier or a logistic regression classifier and choose whichever one gives the best accuracy.
you can download one from NLTK,just like:
from nltk.corpus import opinion_lexicon
pos_list=set(opinion_lexicon.positive())
neg_list=set(opinion_lexicon.negative())
I 'm using svm to predict the label from the title of a book. However I want to give more weight to some features pre defined. For example, if the title of the book contains words like fairy, Alice I want to label them as children's books. I'm using word n-gram svm. Please suggest how to achieve this using sklearn.
You can create one more feature vector in your training data, if the name of the book contains your predefined words then make it one otherwise zero.
I have text dataset in which I have manually classified each record as either one of two possible classes. I created a TFIDF on the corpus, sans English stopwords, trained/tested a Random Forest classifier, evaluated the model, and applied the model to a larger corpus of text. All is good so far, but how to find out more about my model, i.e., how can find out about which words are "important" the model?
The trained RF should have an attribute feature_importances_. I think you have to train the model with oob_score=True (in the constructor). The feature importances will tell you which features (data matrix columns) are influential. To get the words, you go back to the tfidf vectorizer and get its vocabulary_ attribute (note the trailing underscore), which is a dict from words to column indices.
For an explanation of the vocabulary_ attribute, see this post: sklearn : TFIDF Transformer : How to get tf-idf values of given words in document