Sentiment Classification from own Text Data using NLTK - python

What I am going to ask may sound very similar to the post Sentiment analysis with NLTK python for sentences using sample data or webservice? , But I am done with Parsing and Tokenization of sentences from text. My question is
Whatever examples till now I have seen in NLTK movie review example seems to be most similar to my problem, But for movie_review the training text is already in a form as it has two folders pos and neg and text are stored there. How can I do that classification for my huge text, Do I read data manually and store them into two folders. Does that make the corpus. After that can I work with them just like movie_review data in example?
2.If the answer to the above question is yes, is there any way to speed up that task by any tool. For example I want to work with only the texts which has "Monty Python" in there content. And then I classify them manually and then store them in pos and neg folder. Does that work?
Please help me

Yes, you need a training corpus to train a classifier. Or you need some other way to detect sentiment.
To create a training corpus, you can classify by hand, you can have others classify it for you (mechanical turk is popular for this), or you can do corpus bootstrapping. For sentiment, that could involve creating 2 lists of keywords, positive words and negative words. Using those, you can create an initial training corpus, correct it by hand, then train a classifier. This is an iterative process, and the key thing to remember is "garbage in, garbage out". In other words, if your training corpus is wrong, you can't expect your classifier to be right.

Related

How to get sentiment tag using word2vec

I'm working on word2vec model in order to analysis a corpus of newspaper.
I have a csv which contains some newspaper like tital, journal, and the content of the article.
I know how to train my model in order to get most similar words and their context.
However, I want to do a sentiment analysis on that. I found some ressources in order to do that but in all the test or train dataframe in the examples, there is already a column sentiment (0 or 1). Do you if it's possible to classify automaticaly texts by sentiment ? I mean put 0 or 1 to each text. I search but i don't find any references about that in the word2vec or doc2vec documentation...
Thanks for advance !
Both Word2Vec & Doc2Vec are just ways to turn words or lists-of-words into 'dense' vectors. Alone, they won't tell you sentiment.
When you have a text and want to deduce which categories it belongs to, that's called 'text classification'. Specifically, if you have just two categories (like 'positive-sentiment' vs 'negative-sentiment', or 'spam' vs 'not-spam'), that's called 'binary classification'.
The output of a Word2Vec or Doc2Vec model might be helpful in that task, but mainly as input to some other chosen 'classifier' algorithm. And, such algorithms require some 'labeled examples' of each kind of text - where you supply the right answer – in order to work. So, you will likely have to go through your corpus of newspaper articles & mark a bunch of them with the answer you want.
You should start by working through some examples that use scikit-learn, the most popular Python library with text-classification tools, even without any Word2Vec or Doc2Vec features, at first. For example, in its docs is an intro:
"Working With Text Data"
Only after you've set up some basic code using generic preprocess/feature-extraction/training/evaluation steps, and reviewed some actual results, should you then consider if adding some features based on Word2Vec or Doc2Vec might help.

Text Preprocessing for classification - Machine Learning

what are important steps for preprocess our Twitter texts to classify between binary classes. what I did is that I removed hashtag and keep it without hashtag, I also used some regular expression to remove special char, these are two function I used.
def removeusername(tweet):
return " ".join(word.strip() for word in re.split('#|_', tweet))
def removingSpecialchar(text):
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())
what are other things to preprocess textdata. I have also used nltk stopword corpus to remove all stop words form the tokenize words.
I used NaiveBayes classifer in textblob to train data and I am getting 94% accuracy on training data and 82% on testing data. I want to know is there any other method to get good accuracies. By the way I am new in this Machine Learning field, I have a limited idea about all of it!
Well then you can start by play with the size of your vocabulary. You might exclude some of the words that are too frequent in your data (without being considered stop words). And also do the same with words that appear in only one tweet (misspelled words for example). Sklearn CountVectorizer allow to do this in an easy way have a look min_df and max_df parameters.
Since you are working with tweets you can also think in URL strings. Try to obtain some valuable information from links, there are lots of different options from simple stuff based on regular expressions that retrieve the domain name of the page to more complex NLP based methods that study the link content. Once more it's up to you!
I would also have a look at pronouns (if you are using sklearn) since by default replaces all of them to the keyword -PRON- . This is a classic solution that simplifies things but might end in a loss of information.
For preprocessing raw data, you can try:
Stop word removal.
Stemming or Lemmatization.
Exclude terms that are either too common or too rare.
Then a second step preprocessing is possible:
Construct a TFIDF matrix.
Construct or load pretrained wordEmbedding (Word2Vec, Fasttext, ...).
Then you can load result of the second steps into your model.
These are just the most common "method", many others exists.
I will let you check each one of these methods by yourself, but it is a good base.
There are no compulsory steps. For example, it is very common to remove stop words (also called functional words) such as "yes" , "no" , "with". But - in one of my pipelines, I skipped this step and the accuracy did not change. NLP is an experimental field , so the most important advice is to build a pipeline that run as quickly as possible, to define your goal, and to train with different parameters.
Before you move on, you need to make sure you training set is proper. What are you training for ? is your set clean (e.g the positive has only positives)? how do you define accuracy and why?
Now, the situation you described seems like a case of over-fitting. Why? because you get 94% accuracy on the training set, but only 82% on the test set.
This problem happens when you have a lot of features but relatively small training dataset - so the model is fitted best for the specific train set but fails to generalize.
Now, you did not specify the how large is your dataset, so I'm guessing between 50 and 500 tweets, which is too small given the English vocabulary of some 200k words or more. I would try one of the following options:
(1) Get more training data (at least 2000)
(2) Reduce the number of features, for example you can remove uncommon words, names - anything words that appears only small number of times
(3) Using a better classifier (Bayes is rather weak for NLP). Try SVM, or Deep Learning.
(4) Try regularization techniques

scikit-learn: classifying texts using custom labels

I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).
I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?
Any pointers to existing tutorials or docs appreciated!
There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.
He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:
Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
There are many ways to do this like Tf-Idf (Term frequency - Inverse Document Frequency), Count Vectorizer, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Word2Vec.
Among all of above mentioned methods, Word2Vec is the best method. You can use a pre-trained model by Google for Word2Vec, available on:
https://github.com/mmihaltz/word2vec-GoogleNews-vectors

Machine learning with naive bayes on non english words

I use text blob library of python, and the Naive bayes classifier of text blob. I have learned that it uses nltk naive bayes classifier. Here is the question: My input sentences are non-english (Turkish). Will it be possible? I don't know how it works. But I tried 10 training data, and it seems to work. I wonder how it works, this naive babes classifier of nltk, on non-English data. What are the disadvantages?
Although a classifier trained for English is unlikely to work on other languages, it sounds like you are using textblob to train a classifier for your text domain. Nothing rules out using data from another language, so the real question is whether you are getting acceptable performance. The first thing you should do is test your classifier on a few hundred new sentences (not the ones you trained it on!). If you're happy, that's the end of the story. If not, read on.
What makes or breaks any classifier is the selection of features to train it with. The NLTK's classifiers require a "feature extraction" function that converts a sentence into a dictionary of features. According to its tutorial, textblob provides some kind of "bag of words" feature function by default. Presumably that's the one you're using, but you can easily plug in your own feature function.
This is where language-specific resources come in: Many classifiers use a "stopword list" to discard common words like and and the. Obviously, this list must be language-specific. And as #JustinBarber wrote in a comment, languages with lots of morphology (like Turkish) have more word forms, which may limit the effectiveness of word-based classification. You may see improvement if you "stem" or lemmatize your words; both procedures transform different inflected word forms to a common form.
Going further afield, you didn't say what your classifier is for but it's possible that you could write a custom recognizer for some text properties, and plug them in as features. E.g., in case you're doing sentiment analysis, some languages (including English) have grammatical constructions that indicate high emotion.
For more, read a few chapters of the NLTK book, especially the chapter on classification.

NLTK - Multi-labeled Classification

I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents.
For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords.
For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words.
I am then passing in each document as hashmap of {word: boolean} into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents.
The issues I am having:
Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative.
The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.
Is there an alternative library I should be using, or variation of this algorithm?
Thanks!
Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification.
For the issues which you are facing,
nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.
Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)
There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier.
Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.

Categories