scikit-learn: classifying texts using custom labels

scikit-learn: classifying texts using custom labels - python

I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).
I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?
Any pointers to existing tutorials or docs appreciated!

There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.
He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:
Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

There are many ways to do this like Tf-Idf (Term frequency - Inverse Document Frequency), Count Vectorizer, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Word2Vec.
Among all of above mentioned methods, Word2Vec is the best method. You can use a pre-trained model by Google for Word2Vec, available on:
https://github.com/mmihaltz/word2vec-GoogleNews-vectors

Related

How to get sentiment tag using word2vec

I'm working on word2vec model in order to analysis a corpus of newspaper.
I have a csv which contains some newspaper like tital, journal, and the content of the article.
I know how to train my model in order to get most similar words and their context.
However, I want to do a sentiment analysis on that. I found some ressources in order to do that but in all the test or train dataframe in the examples, there is already a column sentiment (0 or 1). Do you if it's possible to classify automaticaly texts by sentiment ? I mean put 0 or 1 to each text. I search but i don't find any references about that in the word2vec or doc2vec documentation...
Thanks for advance !

Both Word2Vec & Doc2Vec are just ways to turn words or lists-of-words into 'dense' vectors. Alone, they won't tell you sentiment.
When you have a text and want to deduce which categories it belongs to, that's called 'text classification'. Specifically, if you have just two categories (like 'positive-sentiment' vs 'negative-sentiment', or 'spam' vs 'not-spam'), that's called 'binary classification'.
The output of a Word2Vec or Doc2Vec model might be helpful in that task, but mainly as input to some other chosen 'classifier' algorithm. And, such algorithms require some 'labeled examples' of each kind of text - where you supply the right answer – in order to work. So, you will likely have to go through your corpus of newspaper articles & mark a bunch of them with the answer you want.
You should start by working through some examples that use scikit-learn, the most popular Python library with text-classification tools, even without any Word2Vec or Doc2Vec features, at first. For example, in its docs is an intro:
"Working With Text Data"
Only after you've set up some basic code using generic preprocess/feature-extraction/training/evaluation steps, and reviewed some actual results, should you then consider if adding some features based on Word2Vec or Doc2Vec might help.

How to extract relevant phrases from sentences regarding a particular topic using Neural networks?

I have training data as two columns
1.'Sentences'
2.'Relevant_text' (text in this column is a subset of text in the column 'Sentences')
I tried training a RNN with LSTM directly treating 'Sentences' as input and 'Relevant_text' and output but the results were disappointing.
I want to know how to approach this type of problem? Does this kind of problem have a name? Which models should I explore?

If the target text is the subset of the input text, then, I believe, this problem can be solved as a tagging problem: make your neural network for each word predict whether it is "relevant" or not.
On the one hand, the problem of taking a text and selecting its subset that best reflects its meaning is called extractive summarization, and has lots of solutions, from the well known unsupervised textRank algorithm to complex BERT-based neural models.
On the other hand, technically your problem is just binary token-wise classification: you label each token (word or other symbol) of your input text as "relevant" or not, and train any neural network architecture which is good for tagging on this data. Specifically, I would look into architectures for POS tagging, because they are very well studied. Typically, it is BiLSTM, maybe with a CRF head. More modern models are based on pretrained contextual word embeddings, such as BERT (maybe, you won't even need to fine tune them - just use it as a feature extractor, and add a BiLSTM on top). If you want a more lightweight model, you can consider a CNN over pretrained and fixed word embeddings.
One final parameter you should time playing with is the threshold for classifying the word as relevant - maybe, the default one, 0.5, is not the best choice. Maybe, instead of keeping all the tokens with probability-of-being-important higher than 0.5, you would like to keep the top k tokens, where k is fixed or is some percentage of the whole text.
Of course, more specific recommendations would be dataset-specific, so if you could share your dataset, it would be a great help.

How to do supervised learning with Gensim/Word2Vec/Doc2Vec having large corpus of text documents?

I have a set of text documents(2000+) with labels (Liked/Disliked).Each document consists of 200+ words.
I am trying to do a supervised learning with these documents.
My approach would be:
Vectorize each document in the corpus. Say we have 2347 docs.
I can have 2347 rows with labels viz. Like as 1 and Dislike as 0.
Using any ML classification supervised model train above dataset with 2347 rows.
How to vectorize and create such dataset?

One of the things you can try is using Doc2Vec. This will allow you to map each document to a vector of dimension N. Then you can use any supervised learning algorithm to train on these N features.
There are other alternatives to doc2vec mentioned here. Try the Average of Word2Vec vectors with TF-IDF approach as well.
Also, make sure you apply appropriate text cleaning before applying doc2vec or word2vec. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I would also suggest engineering some features from your data if you are looking to predict like/dislike. This depends on your data and problem, but some examples are
The proportion of uppercase words
Slang words present or not
Emoticons present or not
Language of the text
The sentiment of the text - this is a whole new topic altogether though
I hope this was helpful...

How to classify unlabelled data?

I am new to Machine Learning. I am trying to build a classifier that classifies the text as having a url or not having a url. The data is not labelled. I just have textual data. I don't know how to proceed with it. Any help or examples is appreciated.

Since it's text, you can use bag of words technique to create vectors.
You can use cosine similarity to cluster the common type text.
Then use classifier, which would depend on number of clusters.
This way you have a labeled training set.
If you have two cluster, binary classifier like logistic regression would work.
If you have multiple classes, you need to train model based on multinomial logistic regression
or train multiple logistic models using One vs Rest technique.
Lastly, you can test your model using k-fold cross validation.

You cannot train a classifier with unlabeled data. You need labeled examples. There are services that will label it for you, but it might be simpler for you to do it by hand (I assume you can go through one per minute).
Stack Overflow is for programming; this question would be better suited in, say, Cross-Validated. Maybe they'll have better suggestions than me.
After you've labeled the data, there's a lot of info on the web on this subject - for example, this blog is a good place to start if you already have some grip on the issue.
Good luck!

NLTK - Multi-labeled Classification

I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents.
For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords.
For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words.
I am then passing in each document as hashmap of {word: boolean} into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents.
The issues I am having:
Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative.
The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.
Is there an alternative library I should be using, or variation of this algorithm?
Thanks!

Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification.
For the issues which you are facing,
nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.
Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)
There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier.
Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.