How to extract sub topic sentences of a review using python & NLTK?

How to extract sub topic sentences of a review using python & NLTK? - python

Is there any efficient way to extract sub topic explanations of a review using python and NLTK library.As an example an user review regarding mobile phone could be "This phone's battery is good but display is a bullshit"
I wanna extract above two features like
"Battery is good"
"display is a bullshit"
The purpose of above is em gonna develop a rating system for products with respect to features of the product.
Analyzing polarity part has done.
But extracting features of review is some difficult for me.But I found a way to extract features using POS tag patterns with regular expressions like
<NN.?><VB.?>?<JJ.?>
this pattern as sub topic.But the problem is there could be lots of patterns in a review according to users description patterns.
Is there any way to solve my problem efficiently???
Thank you !!

The question you posed is multi-faceted and not straightforward to answer.
Conceptually, you may want to go through the following steps:
Identify the names of the features of phones (+ maybe creating an ontology based on these features).
Create a lists of synonyms to the feature names (similarly for evaluative phrases, e.g. nice, bad, sucks, etc.).
Use one of NLTK taggers to parse the reviews.
Create rules for extraction of features and their evaluation (Information Extraction part). I am not sure if NLTK can directly support you with this.
Evaluate and refine the approach.
Or: create a larger annotated corpus and train a Deep learning model on it using TensorFlow, Theano, or anything else alike.

Related

Huggingface NER with custom data

I have a csv data as below.
**token** **label**
0.45" length
1-12 size
2.6" length
8-9-78 size
6mm length
Whenever I get the text as below
6mm 8-9-78 silver head
I should be able to say length = 6mm and size = 8-9-78. I'm new to NLP world, I'm trying to solve this using Huggingface NER. I have gone through various articles. I'm not getting how to train with my own data. Which model/tokeniser should I make use of? Or should I build my own? Any help would be appreciated.

I would maybe look at spaCy's pattern matching + NER to start. The pattern matching rules spacy provides are really powerful, especially when combined with their statistical NER models. You can even use the patterns you develop to create your own custom NER model. This will give you a good idea of where you still have gaps or complexity that might require something else like Huggingface, etc.
If you are willing to pay, you can also leverage prodigy which provides a nice UI with Human In the Loop interactions.
Adding REGEX entities to SpaCy's Matcher

I had two options one is Spacy (as suggested by #scarpacci) and other one is SparkNLP. I opted for SparkNLP and found a solution. I formatted the data in CoNLL format and trained using Spark's NerDlApproach and GLOVE word embedding.

Unsure of how to get started with using NLP for analyzing user feedback

I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!

Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.

VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.

Extracting sentenses belongs to different category from text using NLTK

I am new to Natural Language Processing (NLP) and I came across a problem where from the given text, I have to extract sentences belongs to different category like
1) sentences related to commitments (like sentences including 'will', 'shall' etc)
2) sentences related to cost or budget
3) and so on....
I need to know which features of NLTK should I use to implement this. How easy to add more and more category to extract more subjective information?
Any examples are even more helpful.

You are looking for text classification and NLTK by itself is not enough. NLTK can do tokenization, stemming, word count, etc. but not classification.
An alternate library in Python is Spacy which will do the above plus allow you to train and use a text classifier to identify sentences that belong to certain category. Suggest you go through the use cases at: Spacy Usage Examples
For identifying sentences with commitment, you can do a sentence dependency parse and look for "will", "shall" as the verb.

Improving parsing of unstructured text

I am parsing contract announcements into columns to capture the company, the amount awarded, the description of the project awarded, etc. A raw example can be found here.
I wrote a script using regular expressions to do this but over time contingencies arise that I have to account for which bar the regexp method from being a long term solution. I have been reading up on NLTK and it seems there are two ways to go about using NLTK to solve my problem:
chunk the announcements using RegexpParser expressions - this might be a weak solution if two different fields I want to capture have the same sentence structure.
take n announcements, tokenize and run the n announcements through the pos tagger, manually tag the parts of the announcements I want to capture using the IOB format and then use those tagged announcements to train an NER model. A method discussed here
Before I go about manually tagging announcements I want to gauge
that 2 is a reasonable solution
if there are existing tagged corpus that might be useful to train my model
knowing that accuracy improves with training data size, how many manually tagged announcements I should start with.
Here's an example of how I am building the training set. If there are any apparent flaws please let me know.

Trying to get company names and project descriptions using just POS tags will be a headache. Definitely go the NER route.
Spacy has a default English NER model that can recognize organizations; it may or may not work for you but it's worth a shot.
What sort of output do you expect for "the description of the project awarded"? Typically NER would find items several tokens long, but I could imagine a description being several sentences.
For tagging, note that you don't have to work with text files. Brat is an open-source tool for visually tagging text.
How many examples you need depends on your input, but think of about a hundred as the absolute minimum and build up from there.
Hope that helps!
Regarding the project descriptions, thanks to your example I now have a better idea. It looks like the language in the first sentence of the grants is pretty regular in how it introduces the project description: XYZ Corp has been awarded $XXX for [description here].
I have never seen typical NER methods used for arbitrary phrases like that. If you've already got labels there's no harm in trying and seeing how prediction goes, but if you have issues there is another way.
Given the regularity of language a parser might be effective here. You can try out the Stanford Parser online here. Using the output of that (a "parse tree"), you can pull out the VP where the verb is "award", then pull out the PP under that where the IN is "for", and that should be what you're looking for. (The capital letters are Penn Treebank Tags; VP means "verb phrase", PP means "prepositional phrase", IN means "preposition.)

Similarity score between two sets of tokens

I have a set of urls retrieved for a person. I want to try and classify each url as being about that person (his/her linkedin profile or blog or news article mentioning the person) or not about that person.
I am trying to apply a rudimentary approach where I tokenize each webpage and compare to all others to see how many similar words (excluding stop words) there are between each document and then take the most similar webpages to be positive matches.
I am wondering if there is a machine learning approach I can take to this which will make my task easier and more accurate. Essentially I want to compare webpage content (tokenized into words) between two webpages and determine a score for how similar they are based on their content.

If you are familiar with python this NLP classifier should help you greatly:
http://www.nltk.org/api/nltk.classify.html#module-nltk.classify
For unsupervised clustering you can use this:
http://www.nltk.org/api/nltk.cluster.html#module-nltk.cluster
If you are simply looking for similarity scores then the metrics module should be useful:
http://www.nltk.org/api/nltk.metrics.html#module-nltk.metrics
NLP-toolkit has the answer, just browse through the modules to find what you want, and don't implement it by hand.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.