Identifying multiple categories and associated sentiment within text - python

If you have a corpus of text, how can you identify all the categories (from a list of pre-defined categories) and the associated sentiment (positive/negative writing) with it?
I will be doing this in Python but at this stage I am not necessarily looking for a language specific solution.
Let's look at this question with an example to try and clarify what I am asking.
If I have a whole corpus of reviews for products e.g.:
Microsoft's Xbox One offers impressive graphics and a solid list of exclusive 2015 titles. The Microsoft console currently edges ahead of the PS4 with a better selection of media apps. The console's fall-2015 dashboard update is a noticeable improvement. The console has backward compatibility with around 100 Xbox 360 titles, and that list is poised to grow. The Xbox One's new interface is still more convoluted than the PS4's. In general, the PS4 delivers slightly better installation times, graphics and performance on cross-platform games. The Xbox One also lags behind the PS4 in its selection of indie games. The Kinect's legacy is still a blemish. While the PS4 remains our overall preferred choice in the game console race, the Xbox One's significant course corrections and solid exclusives make it a compelling alternative.
And I have a list of pre-defined categories e.g. :
Game Play
Game Selection
I could take my big corpus of reviews and break them down by sentence. For each sentence in my training data I can hand tag them with the appropriate categories. The problem is that there could be various categories in 1 sentence.
If it was 1 category per sentence then any classification algorithm from scikit-learn would do the trick. When working with multi-classes I could use something like multi-label classification.
Adding in the sentiment is the trickier part. Identifying sentiment in a sentence is a fairly simple task but if there is a mix of sentiment on different labels that becomes different.
The example sentence "The Xbox One has a good selection of games but the performance is worse than the PS4". We can identify two of our pre-defined categories (game selection, performance) but we have positive sentiment towards game selection and a negative sentiment towards performance.
What would be a way to identify all categories in text (from our pre-defined list) with their associated sentiment?

One simple method is to break your training set into minimal sentences using a parser and use that as the input for labelling and sentiment classification.
Your example sentence:
The Xbox One has a good selection of games but the performance is worse than the PS4
Using the Stanford Parser, take S tags that don't have child S tags (and thus are minimal sentences) and put the tokens back together. For the above sentence that would give you these:
The Xbox One has a good selection of games
the performance is worse than the PS4
Sentiment within an S tag should be consistent most of the time. If sentences like The XBox has good games and terrible graphics are common in your dataset you may need to break it down to NP tags but that seems unlikely.
Regarding labelling, as you mentioned any multi-label classification method should work.
For more sophisticated methods, there's a lot of research on join topic-sentiment models - a search for "topic sentiment model" turns up a lot of papers and code. Here's sample training data from a paper introducing a Hidden Topic Sentiment Model that looks right up your alley. Note how in the first sentence with labels there are two topics.
Hope that helps!

The only approach I could think of would consists of a set of steps.
1) Use some library to extract entities from text and their relationships. For example, check this article:
By parsing each text you may figure out which entities you have in each text and which chunks of text are related to the entity.
2) Use NLTKs sentiment extraction to analyze chunks specifically related to each entity and obtain their sentiment. That gives you sentiment of each entity.
3) After that you need to come of with a way to map entities which you may face in text to what you call 'topics'. Unfortunately, I don't see a way to automate it since you clearly not define topics conventionally, through word frequency (like in topic modelling algorithms - LDA, NMF etc).


Exclude values existing in a list that contains words like

I have a list of merchant category:
'General Contractors–Residential and Commercial',
'Air Conditioning, Heating and Plumbing Contractors',
'Electrical Contractors',
'Insulation, Masonry, Plastering, Stonework and Tile Setting Contractors'
I want to exclude merchants from my dataframe if df['merchant_category'].str.contains() any of such merchant categories.
However, I cannot guarantee that the value in my dataframe has the long name as in the list of merchant category. It could be that my dataframe value is just air conditioning.
As such, df = df[~df['merchant_category'].isin(list_of_merchant_category)] will not work.
If you can collect a long list of positive examples (categories you definitely want to keep), & negative examples (categories you definitely want to exclude), you could try to train a text classifier on that data.
It would then be able to look at new texts and make a reasonable guess as to whether you want them included or excluded, based on their similarity to your examples.
So, as you're working in Python, I suggest you look for online tutorials and examples of "binary text classification" using Scikit-Learn.
While there's a bewildering variety of possible approaches to both representing/vectorizing your text, and then learning to make classifications from those vectors, you may have success with some very simple ones commonly used in intro examples. For example, you could represent your textual categories with bag-of-words and/or character-n-gram (word-fragments) representations. Then try NaiveBayes or SVC classifiers (and others if you need to experiment for possibly-bettr results).
Some of these will even report a sort of 'confidence' in their predictions - so you could potentially accept the strong predictions, but highlight the weak predictions for human review. When a human then looks at, an definitively rules on, a new 'category' string – because it was highlighted an iffy prediction, or noticed as an error, you can then improve the overall system by:
adding that to the known set that are automatically included/excluded based on an exact literal comparison
re-training the system, so that it has a better chance at getting other new similar strings correct
(I know this is a very high-level answer, but once you've worked though some attempts based on other intro tutorials, and hit issues with your data, you'll be able to ask more specific questions here on SO to get over any specific issues.)

Unsure of how to get started with using NLP for analyzing user feedback

I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog -
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.

Ways of obtaining a similarity metric between two full text documents?

So imagine I have three text documents, for example (let 3 randomly generated texts).
Document 1:
"Whole every miles as tiled at seven or. Wished he entire esteem mr oh by. Possible bed you pleasure civility boy elegance ham. He prevent request by if in pleased. Picture too and concern has was comfort. Ten difficult resembled eagerness nor. Same park bore on be...."
Document 2:
"Style too own civil out along. Perfectly offending attempted add arranging age gentleman concluded. Get who uncommonly our expression ten increasing considered occasional travelling. Ever read tell year give may men call its. Piqued son turned fat income played end wicket..."
If I want to obtain in python (using libraries) a metric on how similar these 2 documents are to a third one (in other words, which one of the 2 documents is more similar to a third one) , what would be the best way to proceed?
edit: I have observed other questions that they answer by comparing individual sentences to other sentences, but I am not interested on that, as I want to compare a full text (consisting on related sentences) against another full text, and obtaining a number (which for example may be bigger than another comparison obtained with a different document which is less similar to the target one)
There is no simple answer to this question. As similarities will perform better or worse depending on the particular task you want to perform.
Having said that, you do have a couple of options regarding comparing blocks of text. This post compares and ranks several different ways of computing sentence similarity, which you can then aggregate to perform full document similarity. How to aggregate this? will also depend on your particular task. A simple, but often well-performing approach is to compute the average sentence similarities of the 2 (or more) documents.
Other useful links for this topics include:
Introduction to Information Retrieval (free book)
Doc2Vec (from gensim, for paragraph embeddings, which is probably very suitable for your case)
You could try the Simphile NLP text similarity library (disclosure: I'm the author). It offers several language agnostic methods: JaccardSimilarity, CompressionSimilarity, EuclidianSimilarity. Each has their advantages, but all work well on full document comparison:
pip install simphile
This example shows Jaccard, but is exactly the same with Euclidian or Compression:
from simphile import jaccard_similarity
text_a = "I love dogs"
text_b = "I love cats"
print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.
Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.
Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.
You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

How to programmatically classify a list of objects

I'm trying to take a long list of objects (in this case, applications from the iTunes App Store) and classify them more specifically. For instance, there are a bunch of applications currently classified as "Education," but I'd like to label them as Biology, English, Math, etc.
Is this an AI/Machine Learning problem? I have no background in that area whatsoever but would like some resources or ideas on where to start for this sort of thing.
Yes, you are correct. Classification is a machine learning problem, and classifying stuff based on text data involves natural language processing.
The canonical classification problem is spam detection using a Naive Bayes classifier, which is very simple. The idea is as follows:
Gather a bunch of data (emails), and label them by class (spam, or not spam)
For each email, remove stopwords, and get a list of the unique words in that email
Now, for each word, calculate the probability it appears in a spam email, vs a non-spam email (ie count occurrences in spam, vs non spam)
Now you have a model- the probability of a email being spam, given it contains a word. However, an email contains many words. In Naive Bayes, you assume the words occur independently of each other (which turns out to to be an ok assumption), and multiply the probabilities of all words in the email against each other.
You usually divide data into training and testing, so you'll have a set of emails you train your model on, and then a set of labeled stuff you test against where you calculate precision and recall.
I'd highly recommend playing around with NLTK, a python machine learning and nlp library. It's very user friendly and has good docs and tutorials, and is a good way to get acquainted with the field.
EDIT: Here's an explanation of how to build a simple NB classifier with code.
Probably not. You'd need to do a fair bit of work to extract data in some usable form (such as names), and at the end of the day, there are probably few enough categories that it would simply be easier to manually identify a list of keywords for each category and set a parser loose on titles/descriptions.
For example, you could look through half a dozen biology apps, and realize that in the names/descriptions/whatever you have access to, the words "cell," "life," and "grow" appear fairly often - not as a result of some machine learning, but as a result of your own human intuition. So build a parser to classify everything with those words as biology apps, and do similar things for other categories.
Unless you're trying to classify the entire iTunes app store, that should be sufficient, and it would be a relatively small task for you to manually check any apps with multiple classifications or no classifications. The labor involved with using a simple parser + checking anomalies manually is probably far less than the labor involved with building a more complex parser to aid machine learning, setting up machine learning, and then checking everything again, because machine learning is not 100% accurate.
