I am new to python and NLTk. I have a model created for sentiment analysis of survey in NLTK (NaivesBayesCalssifier). To improve the accuracy, i wanted to add some dictionary containing list of positive and negative statements in the model. Is there any module in NLTK and are there any additional features that can improve my model?
You can have a look at some public sentiment lexicons which would provide you a corpus of positive and negative words.
One of them can be found at https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Since, you haven't specified any details about your current model, I'm assuming you are using a very basic Naive Bayes classifier. If you are using unigrams(words) to vectorize your text right now, then you can consider using bigrams or trigrams for generating the feature vectors.This would basically, enable you to use the contextual information of the words to a certain extent.
If you are currently using a bag of words model like Tfidf to convert your text to converts then you can consider using word embeddings instead of that. Bag of words doesn't consider the contextual information of the words whereas, word embeddings are able to capitalise on that.
You could use somethings like gensim which uses deep learning to convert words to vectors. Have a look at : https://radimrehurek.com/gensim/models/word2vec.html
Furthermore, you can always try using a linearSVC classifier or a logistic regression classifier and choose whichever one gives the best accuracy.
you can download one from NLTK,just like:
from nltk.corpus import opinion_lexicon
pos_list=set(opinion_lexicon.positive())
neg_list=set(opinion_lexicon.negative())
Related
I have a set of text documents(2000+) with labels (Liked/Disliked).Each document consists of 200+ words.
I am trying to do a supervised learning with these documents.
My approach would be:
Vectorize each document in the corpus. Say we have 2347 docs.
I can have 2347 rows with labels viz. Like as 1 and Dislike as 0.
Using any ML classification supervised model train above dataset with 2347 rows.
How to vectorize and create such dataset?
One of the things you can try is using Doc2Vec. This will allow you to map each document to a vector of dimension N. Then you can use any supervised learning algorithm to train on these N features.
There are other alternatives to doc2vec mentioned here. Try the Average of Word2Vec vectors with TF-IDF approach as well.
Also, make sure you apply appropriate text cleaning before applying doc2vec or word2vec. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I would also suggest engineering some features from your data if you are looking to predict like/dislike. This depends on your data and problem, but some examples are
The proportion of uppercase words
Slang words present or not
Emoticons present or not
Language of the text
The sentiment of the text - this is a whole new topic altogether though
I hope this was helpful...
I'm interested in using tf-idf with FastText library, but have found a logical way to handle the ngrams. I have used tf-idf with SpaCy vectors already for what I have found several examples like these ones:
http://dsgeek.com/2018/02/19/tfidf_vectors.html
https://www.aclweb.org/anthology/P16-1089
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
But for FastText library is not that clear to me, since it has a granularity that isn't that intuitive, E.G.
For a general word2vec aproach I will have one vector for each word, I can count the term frequency of that vector and divide its value accordingly.
But for fastText same word will have several n-grams,
"Listen to the latest news summary" will have n-grams generated by a sliding windows like:
lis ist ste ten tot het...
These n-grams are handled internally by the model so when I try:
model["Listen to the latest news summary"]
I get the final vector directly, hence what I have though is to split the text into n-grams before feeding the model like:
model['lis']
model['ist']
model['ten']
And make the tf-idf from there, but that seems like an inefficient approach both, is there a standar way to apply tf-idf to vector n-grams like these.
I would leave FastText deal with trigrams, but keep building the tfidf-weighted embeddings at the word level.
That is, you send
model["Listen"]
model["to"]
model["the"]
...
to FastText, and then use your old code to get the tf-idf weights.
In any case, it would be good to know whether FastText itself considers the word construct when processing a sentence, or it truly only works it as a sequence of trigrams (blending consecutive words). If the latter is true, then for FastText you would lose information by breaking the sentence into separate words.
You are talking about fasttext tokenization step (not fasttext embeddings) which is a (3,6) char-n-gram tokenization, compatible with tfidf. The full step can be computed outside of fasttext quite easily Calculate TF-IDF using sklearn for n-grams in python
For what I understood from your question you are confusing the difference between word embeddings methods (such as word2vec and many other) and Tf-Idf:
Basically Word Embeddings methods are unsupervised models for
generating word vectors. The word vectors generated by this kind of
models are now very popular in NPL tasks. This is because a word
embedding representation of a word captures more information about
a word than just a one-hot representation of the word, since the
former captures semantic similarity of that word to other words
whereas the latter representation of the word is equidistant from all
other words. FastText is another way to implements word embedding (recently opensourced by facebook researcher).
Tf-idf, instead is a scoring scheme for words, that is a measure of how
important a word is to a document.
From a practical usage standpoint, while tf-idf is a simple scoring scheme and that is its key advantage, word embeddings may be a better choice for most tasks where tf-idf is used, particularly when the task can benefit from the semantic similarity captured by word embeddings (e.g. in information retrieval tasks).
Unlike Word2Vec that learn a vector representation of the entire word, FastText learn a representation for each n-gram of the word as you already seen. So the overall word embeddings is the sum of the n-gram representation. Basically FastText model (number of n-grams > number of words), it performs better than Word2Vec and allows rare words to be represented appropriately.
For my standpoint in general It does not make sense use FastText (or any word embeddings methods) together with Tf-Idf. But if you want use Tf-Idf with FastText you must sum all the n-gram that compose your word and use this representation to calculate the Tf-Idf.
I'm trying to train an NER model using spaCy to identify locations, (person) names, and organisations. I'm trying to understand how spaCy recognises entities in text and I've not been able to find an answer. From this issue on Github and this example, it appears that spaCy uses a number of features present in the text such as POS tags, prefixes, suffixes, and other character and word-based features in the text to train an Averaged Perceptron.
However, nowhere in the code does it appear that spaCy uses the GLoVe embeddings (although each word in the sentence/document appears to have them, if present in the GLoVe corpus).
My questions are -
Are these used in the NER system now?
If I were to switch out the word vectors to a different set, should I expect performance to change in a meaningful way?
Where in the code can I find out how (if it all) spaCy is using the word vectors?
I've tried looking through the Cython code, but I'm not able to understand whether the labelling system uses word embeddings.
spaCy does use word embeddings for its NER model, which is a multilayer CNN. There's a quite a nice video that Matthew Honnibal, the creator of spaCy made, about how its NER works here. All three English models use GloVe vectors trained on Common Crawl, but the smaller models "prune" the number of vectors by having similar words mapped to the same vector link.
It's quite doable to add custom vectors. There's an overview of the process in the spaCy docs, plus some example code on Github.
I use text blob library of python, and the Naive bayes classifier of text blob. I have learned that it uses nltk naive bayes classifier. Here is the question: My input sentences are non-english (Turkish). Will it be possible? I don't know how it works. But I tried 10 training data, and it seems to work. I wonder how it works, this naive babes classifier of nltk, on non-English data. What are the disadvantages?
Although a classifier trained for English is unlikely to work on other languages, it sounds like you are using textblob to train a classifier for your text domain. Nothing rules out using data from another language, so the real question is whether you are getting acceptable performance. The first thing you should do is test your classifier on a few hundred new sentences (not the ones you trained it on!). If you're happy, that's the end of the story. If not, read on.
What makes or breaks any classifier is the selection of features to train it with. The NLTK's classifiers require a "feature extraction" function that converts a sentence into a dictionary of features. According to its tutorial, textblob provides some kind of "bag of words" feature function by default. Presumably that's the one you're using, but you can easily plug in your own feature function.
This is where language-specific resources come in: Many classifiers use a "stopword list" to discard common words like and and the. Obviously, this list must be language-specific. And as #JustinBarber wrote in a comment, languages with lots of morphology (like Turkish) have more word forms, which may limit the effectiveness of word-based classification. You may see improvement if you "stem" or lemmatize your words; both procedures transform different inflected word forms to a common form.
Going further afield, you didn't say what your classifier is for but it's possible that you could write a custom recognizer for some text properties, and plug them in as features. E.g., in case you're doing sentiment analysis, some languages (including English) have grammatical constructions that indicate high emotion.
For more, read a few chapters of the NLTK book, especially the chapter on classification.
What I am going to ask may sound very similar to the post Sentiment analysis with NLTK python for sentences using sample data or webservice? , But I am done with Parsing and Tokenization of sentences from text. My question is
Whatever examples till now I have seen in NLTK movie review example seems to be most similar to my problem, But for movie_review the training text is already in a form as it has two folders pos and neg and text are stored there. How can I do that classification for my huge text, Do I read data manually and store them into two folders. Does that make the corpus. After that can I work with them just like movie_review data in example?
2.If the answer to the above question is yes, is there any way to speed up that task by any tool. For example I want to work with only the texts which has "Monty Python" in there content. And then I classify them manually and then store them in pos and neg folder. Does that work?
Please help me
Yes, you need a training corpus to train a classifier. Or you need some other way to detect sentiment.
To create a training corpus, you can classify by hand, you can have others classify it for you (mechanical turk is popular for this), or you can do corpus bootstrapping. For sentiment, that could involve creating 2 lists of keywords, positive words and negative words. Using those, you can create an initial training corpus, correct it by hand, then train a classifier. This is an iterative process, and the key thing to remember is "garbage in, garbage out". In other words, if your training corpus is wrong, you can't expect your classifier to be right.