NLTK - Multi-labeled Classification - python

I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents.
For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords.
For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words.
I am then passing in each document as hashmap of {word: boolean} into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents.
The issues I am having:
Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative.
The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.
Is there an alternative library I should be using, or variation of this algorithm?
Thanks!

Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification.
For the issues which you are facing,
nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.
Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)
There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier.
Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.

Related

Gensim topic modelling with suggested initial inputs?

I'm doing am LDA topic model on a medium sized corpus using gensim in python.
We already know roughly some of the topics we're expecting. In particular, we know that a particular topic definitely exists within the corpus and we want the model to find that topic for us so that we can extract the elements of the corpus that fall under that topic.
Is there a way of manually setting the initial conditions of one of your topics in gensim to give the model a shove in the 'right' direction?
The idea would be to take a handful of known examples of the target topics and set the probabilities of each words to their frequency within the known examples. Or something in the neighborhood of that idea.
Thanks in advance for your help!
As LDA is traditionally an unsupervised method, it's more common to let it tell you what topics it finds by its rules, then see which (if any) of those match your preconceptions.
Gensim has no way to pre-seed an LDA model/session with biases towards finding/defining certain topics.
You might use your conceptions of a topic that "should" exist, or certain documents that "should" be together, to tune your choice of other parameters to ensure final results best meet that goal, or to postprocess the LDA results with labeling/combinations to match your desired groupings.
But also, if one topic is of preeminent importance, or has your best set of labeled training examples, you may want to consider training a binary classifier to predict whether documents are in that topic, or not. Or, as your ideas of preferable topics, with labeled examples, grows, a multi-label classifier to assign documents to topics.
Classifiers are the more appropriate tool when you want a system to deduce known categories, though of course hybrid approaches can also be useful. For example, LDA runs may help suggest new categories, and the outputs of an LDA run could be added as features to assist downstream supervised classifiers. Or documents decorated with extra tokens from supervised classification could be analyzed by downstream LDA.
(In fact, simply decorating documents that are in a desired known category with an extra synthetic token representing that category might be a interesting way to bias an LDA toward reflecting those categories, but you'd want a rigorous evaluation process, for deciding whether such a hack was overall improving your true end goals or not.)

How to get sentiment tag using word2vec

I'm working on word2vec model in order to analysis a corpus of newspaper.
I have a csv which contains some newspaper like tital, journal, and the content of the article.
I know how to train my model in order to get most similar words and their context.
However, I want to do a sentiment analysis on that. I found some ressources in order to do that but in all the test or train dataframe in the examples, there is already a column sentiment (0 or 1). Do you if it's possible to classify automaticaly texts by sentiment ? I mean put 0 or 1 to each text. I search but i don't find any references about that in the word2vec or doc2vec documentation...
Thanks for advance !
Both Word2Vec & Doc2Vec are just ways to turn words or lists-of-words into 'dense' vectors. Alone, they won't tell you sentiment.
When you have a text and want to deduce which categories it belongs to, that's called 'text classification'. Specifically, if you have just two categories (like 'positive-sentiment' vs 'negative-sentiment', or 'spam' vs 'not-spam'), that's called 'binary classification'.
The output of a Word2Vec or Doc2Vec model might be helpful in that task, but mainly as input to some other chosen 'classifier' algorithm. And, such algorithms require some 'labeled examples' of each kind of text - where you supply the right answer – in order to work. So, you will likely have to go through your corpus of newspaper articles & mark a bunch of them with the answer you want.
You should start by working through some examples that use scikit-learn, the most popular Python library with text-classification tools, even without any Word2Vec or Doc2Vec features, at first. For example, in its docs is an intro:
"Working With Text Data"
Only after you've set up some basic code using generic preprocess/feature-extraction/training/evaluation steps, and reviewed some actual results, should you then consider if adding some features based on Word2Vec or Doc2Vec might help.

nlp multilabel classification tf vs tfidf

I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks
Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.

scikit-learn: classifying texts using custom labels

I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).
I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?
Any pointers to existing tutorials or docs appreciated!
There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.
He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:
Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
There are many ways to do this like Tf-Idf (Term frequency - Inverse Document Frequency), Count Vectorizer, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Word2Vec.
Among all of above mentioned methods, Word2Vec is the best method. You can use a pre-trained model by Google for Word2Vec, available on:
https://github.com/mmihaltz/word2vec-GoogleNews-vectors

Sentiment Classification from own Text Data using NLTK

What I am going to ask may sound very similar to the post Sentiment analysis with NLTK python for sentences using sample data or webservice? , But I am done with Parsing and Tokenization of sentences from text. My question is
Whatever examples till now I have seen in NLTK movie review example seems to be most similar to my problem, But for movie_review the training text is already in a form as it has two folders pos and neg and text are stored there. How can I do that classification for my huge text, Do I read data manually and store them into two folders. Does that make the corpus. After that can I work with them just like movie_review data in example?
2.If the answer to the above question is yes, is there any way to speed up that task by any tool. For example I want to work with only the texts which has "Monty Python" in there content. And then I classify them manually and then store them in pos and neg folder. Does that work?
Please help me
Yes, you need a training corpus to train a classifier. Or you need some other way to detect sentiment.
To create a training corpus, you can classify by hand, you can have others classify it for you (mechanical turk is popular for this), or you can do corpus bootstrapping. For sentiment, that could involve creating 2 lists of keywords, positive words and negative words. Using those, you can create an initial training corpus, correct it by hand, then train a classifier. This is an iterative process, and the key thing to remember is "garbage in, garbage out". In other words, if your training corpus is wrong, you can't expect your classifier to be right.

Categories