How to extract specific information from sentences using NLTK

How to extract specific information from sentences using NLTK - python

I am new using Python and NLTK for NLP operations. Starting with different sentences I was wondering how I can extract certain dependent relations within a sentence.
For example:
Edward has a black jacket and white shoes with red laces
Using POS tagging I can extract certain parts of speech, but I want to specifically extract that he has, for example, a black jacket to ultimately list the information like:
Name: Edward
Clothing: Black jacket
Shoes: White shoes with red laces

What you're looking for is NER (Named Entity Recognition) . Since every sentence structure is different and information required from them are different you might need to make your own you get the template or working example from here.
There are also huge corpora available which you can use.

You can consider the problem as extracting relation tuples, may be as binary relations. In that case you need to know about open IE. In that case, you can extract relation tuples like, or . You can build your own relation extraction model if you have supervised data. Otherwise extracting name, clothing or other important information wouldn't be easy using other techniques like NER or POSTagging.
One alternative way can be dependency parsing but i am not sure how to model it for adapting to your particular needs.

Related

Python NLTK Train Data Set

I'm trying to train my NLTK model to recognize movie names (ex. "game of thrones")
I have a text file where each line is a movie name.
How do I train my NLTK model to recognize these movie names if it sees it in a sentence during tokenization?
I searched around but found no resources. Any help is appreciated

It sounds like you are talking about training a named entity recognition (NER) model for movie names. To train an NER model in the traditional way, you'll need more than a list of movie names - you'll need a tagged corpus that might look something like the following (based on the 'dataset format' here):
I PRP O
like VBP O
the DT O
movie NN O
Game NN B-MOV
of IN I-MOV
Thrones NN I-MOV
. Punc O
but going on for a very long time (say, minimum 10,000 words to give enough examples of movie names in running text). Each word is following by the part-of-speech (POS) tag, and then the NER tag. B-MOV indicates that 'Game' is the start of a movie name, and I-MOV indicates that 'of' and 'Thrones' are 'inside' a movie name. (By the way, isn't Game of Thrones a TV series as opposed to a movie? I'm just reusing your example anyway...)
How would you create this dataset? Annotating by hand. It is a laborious process, but this is how state-of-the-art NER systems are trained, because whether or not something should be detected as a movie name depending on the context in which it appears. For example, there is a Disney movie called 'Aliens', but the same word 'Aliens' is a movie title in the second sentence below but not the first.
Aliens are hypothetical beings from other planets.
I went to see Aliens last week.
Tools like docanno exist to aid the annotation process. The dataset to be annotated should be selected depending on the final use case. For example, if you want to be able to find movie names in news articles, use a corpus of news articles. If you want to be able to find movie names in emails, use emails. If you want to be able to find movie names in any type of text, use a corpus with a wide range of different types of texts.
This is a good place to get started if you decide to stick with training and NER model using NLTK, although some of the answers here suggest other libraries you might want to use, such as spaCy.
Alternatively, if the whole tagging process sounds like too much work and you just want to use your list of movie names, look at fuzzy string matching. In this case, I don't think NLTK is the library to use as I'm not aware of any fuzzy string matching features in NLTK. You might instead use fuzzysearch as per the answer here.

How to extract sub topic sentences of a review using python & NLTK?

Is there any efficient way to extract sub topic explanations of a review using python and NLTK library.As an example an user review regarding mobile phone could be "This phone's battery is good but display is a bullshit"
I wanna extract above two features like
"Battery is good"
"display is a bullshit"
The purpose of above is em gonna develop a rating system for products with respect to features of the product.
Analyzing polarity part has done.
But extracting features of review is some difficult for me.But I found a way to extract features using POS tag patterns with regular expressions like
<NN.?><VB.?>?<JJ.?>
this pattern as sub topic.But the problem is there could be lots of patterns in a review according to users description patterns.
Is there any way to solve my problem efficiently???
Thank you !!

The question you posed is multi-faceted and not straightforward to answer.
Conceptually, you may want to go through the following steps:
Identify the names of the features of phones (+ maybe creating an ontology based on these features).
Create a lists of synonyms to the feature names (similarly for evaluative phrases, e.g. nice, bad, sucks, etc.).
Use one of NLTK taggers to parse the reviews.
Create rules for extraction of features and their evaluation (Information Extraction part). I am not sure if NLTK can directly support you with this.
Evaluate and refine the approach.
Or: create a larger annotated corpus and train a Deep learning model on it using TensorFlow, Theano, or anything else alike.

NLP Classification / Inference on Small Dataset -> Word Embedding Approach

I would like to create a model that is given a series of keywords extracted from the description about a company and classifies the 'type' of the company. Let me illustrate with an example.
"Snapchat is an image messaging and multimedia mobile application created by Evan Spiegel, Bobby Murphy, and Reggie Brown,[3] former students at Stanford University, and developed by Snap Inc., originally Snapchat Inc. "
Sample Extracted Keywords: "image messaging" ; "multimedia mobile application"
(from Wikipedia page on Snapchat)
Given this info, my model will need to infer 'IT' and 'SNS' from "image messaging" and "multimedia mobile application".
(In case you are asking why not go with the extracted keywords, I would like to categorize them into as few labels as possible for all companies, so 'IT' and 'SNS' are more general terms compared to 'image messaging' and such.)
Currently, my dataset is not too big. For about hundreds of data entries, about ~80 % contain info in the manner that I want. Given this info, I would like to process the keywords extracted from descriptions about the company and give them correct labels.
Any suggestions to aid me in this project would be great.

If you are targeting companies of specific domains, then using a small dataset may help you. So, one approach you may follow:
Use pre-trained word embeddings (ex. from Glove) of the extracted keywords and find a embedding for companies. It will be like constructing phrase or sentence representation from word embeddings. Lets name it company embeddings! Similar type of companies should have a similar embedding. So, ultimate idea is to form a relationship like Google - Ford = Microsoft - Tesla which we see in word embeddings. You can even think of other interesting arithmetic relations using embeddings, for example, Google = search engine + youtube + android where right-hand-side terms are extracted keywords.
You need company type information for further classification but that should be very simple enough using any machine learning classifier. You can use a simple text classifier to accomplish your overall goal but it would be interesting to achieve this using NLP techniques.

Where can I find patterns like "noun verb noun" etc for each type of grammatical structure so that I can categorize my sentences?

Im working on a chat bot in python using nltk library. I want to use the POS tagger to classify my sentences into categories. For start I want to divide them into four categories "IMPERATIVE", "INTERROGATIVE", "EXCLAMATORY", "DECLARATIVE". Eventually I'd like to add categories like QUESTION, SALUTATION and APOLOGY. I'm looking for some reference on how english sentence patterns are defined. Something like a BNF for english sentences. Where can I find something like this.

Your task description doesn't sound like POS tagging but rather dialogue modelling: Essentially, you need to find a corpus of English sentences annotated according to their dialogue act type. One good annotation scheme I've worked with before is Allen and Core's Dialog Act Markup in Several Layers (DAMSL). You can also see their 1997 paper for more info on how this can be used, but unfortunately I don't know of any freely-available general-purpose corpora annotated with this data.

What is it called, to extract an address from HTML via NLP

I have 300k+ html documents, which I want to extract postal addresses from. The data is different structures, so regex wont work.
I have done a heap of reading on NLP and NLTK for python, however I am still struggling on where to start with this.
Is this approach called Part-of-Speech Tagging or Chunking / Partial Parsing? I can't find any document on how to actually TAG a page so I can train a model on it, or even what I should be training.
So my questions;
What is this approach called?
How can I tag some documents to train from

Qn: Which NLP task is closely related with this task?
Ans: The task of detecting postal address can be viewed as a Name-Entity Recognition (NER) task. But I suggest viewing the task as simply sequence labeling on html (i.e. your input data) and then perform some standard machine learning classification.
Qn: How can I tag some documents to use as training data?
An: What you can do is to:
Label each word or each line as B egin I nside or O utside
Choose a supervised classification method
Decide what are the features (here's some hint: Feature selection)
Build the model (basically just run the classification software with the configured
features)
Voila, output should give you B and I and
O , then just delete all the instances labelled O and you will be left with the lines/words that are addresses

Apple calls their software that does this "Data Detectors" (be careful, it's patented -- they won an injunction against HTC Android phones over this). More generally, I think this application is called Information Extraction.

Strip the text out of the HTML page (unless there is a way from the HTML to identify the address text such as div with a particular class) then build a set of rules that match the address formats used.
If there are postal addresses in several countries then the formats may be markedly different but within a country, the format is the same (with some tweaks) or it is not valid.
Within the US, for example, addresses are 3 or 4 lines (including the person). There is usually a zip code (5 digits optionally followed by four more). Other countries have postal codes in different formats.
Unless your goal is 100% accuracy on all addresses then you probably should aim for extracting as many addresses as you can within the budget for the task.
It doesn't seem like a task for NLP unless you want to use Named Entity identification to find cities, countries etc.

Your task is called information extraction, but that's a very, very broad concept. Luckily your task is more limited (street addresses), but you don't give a lot of information:
What countries are the addresses in? An address in Tokyo looks very different from one in Cleveland. Your odds of succeeding are much better if you're interested in addresses from a limited number of countries-- you can develop a solution from each of them. If we're talking about a very limited number, you could code a recognizer manually.
What kind of webpages are we talking about? Are they a random collection, or can you group them into a limited number of websites and formats? Where do the addresses appear? (I.e., are there any contextual clues you can use to zero in on them?)
I'll take a worse-case scenario for question 2: The pages are completely disparate and the address could be anywhere. Not sure what the state of the art is, but I'd approach it as a chunking problem.
To get any kind of decent results, you need a training set. At a minimum, a large collection of addresses from the same locations and in the same style (informal, incomplete, complete) as the addresses you'll be extracting. Then you can try to coax decent performance out of a chunker (probably with post-processing).
PS. I wouldn't just discard the html mark-up. It contains information about document structure which could be useful. I'd add structural mark-up (paragraphs, emphasis, headings, lists, displays) before you throw out the html tags.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.