For a new project I have a need to extract information from web pages, more precisely imprint information. I use brat to label the documents and have started first experiments with spacy and NER. There are many videos and tutorials about this, but still some basic questions remain.
Is it possible to include the context of an entity?
Example text:
Responsible for the content:
The Good Company GmbH 0331 Berlin
You can contact us via +49 123 123 123.
This website was created by good design GmbH, contact +49 12314 453 5.
Well, spacy is very good at extracting the phone numbers. According to my latest tests, the error rate is less than two percent. I was able to achieve this already after 250 labeled documents, in the meantime I have labeled 450 documents, my goal is about 5000 documents.
Now to the actual point. Relevant are only the phone numbers that are shown in the context of the sentence "Responsible for the content", the other phone numbers are not relevant.
I could now imagine to train these introductory sentences as entities, because they are always somehow similar. But how can I create the context? Are there perhaps already models based on NER that do just that?
Maybe someone has already read some hints or something about it somewhere? As a beginner the hurdle is relatively high, because the material is really deep (little play on words).
Greetings from Germany!
If I understand your question and use-case correctly, I would advise the following approach:
Train/design some system that recognizes all phone numbers - it looks like you've already got that
Train a text classifier to recognize the "responsible for content" sentences.
Implement some heuristics (can probably be rule-based?) to determine whether or not any recognized phone number is connected to any of the predicted "responsible for content" sentences - probably using straightforward features such as number of sentences in between, taking the first phone number after the sentence, etc.
So basically I would advice to solve each NLP challenge separately, and then connect the information throughout the document.
Related
I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.
I am parsing contract announcements into columns to capture the company, the amount awarded, the description of the project awarded, etc. A raw example can be found here.
I wrote a script using regular expressions to do this but over time contingencies arise that I have to account for which bar the regexp method from being a long term solution. I have been reading up on NLTK and it seems there are two ways to go about using NLTK to solve my problem:
chunk the announcements using RegexpParser expressions - this might be a weak solution if two different fields I want to capture have the same sentence structure.
take n announcements, tokenize and run the n announcements through the pos tagger, manually tag the parts of the announcements I want to capture using the IOB format and then use those tagged announcements to train an NER model. A method discussed here
Before I go about manually tagging announcements I want to gauge
that 2 is a reasonable solution
if there are existing tagged corpus that might be useful to train my model
knowing that accuracy improves with training data size, how many manually tagged announcements I should start with.
Here's an example of how I am building the training set. If there are any apparent flaws please let me know.
Trying to get company names and project descriptions using just POS tags will be a headache. Definitely go the NER route.
Spacy has a default English NER model that can recognize organizations; it may or may not work for you but it's worth a shot.
What sort of output do you expect for "the description of the project awarded"? Typically NER would find items several tokens long, but I could imagine a description being several sentences.
For tagging, note that you don't have to work with text files. Brat is an open-source tool for visually tagging text.
How many examples you need depends on your input, but think of about a hundred as the absolute minimum and build up from there.
Hope that helps!
Regarding the project descriptions, thanks to your example I now have a better idea. It looks like the language in the first sentence of the grants is pretty regular in how it introduces the project description: XYZ Corp has been awarded $XXX for [description here].
I have never seen typical NER methods used for arbitrary phrases like that. If you've already got labels there's no harm in trying and seeing how prediction goes, but if you have issues there is another way.
Given the regularity of language a parser might be effective here. You can try out the Stanford Parser online here. Using the output of that (a "parse tree"), you can pull out the VP where the verb is "award", then pull out the PP under that where the IN is "for", and that should be what you're looking for. (The capital letters are Penn Treebank Tags; VP means "verb phrase", PP means "prepositional phrase", IN means "preposition.)
I have few Files containing email conversations for job posting . I want to extract job title,location and duration from its subject line,but having hard time figuring out how I can do that.
here are few examples of subject lines.
Subject: Looking for software developer : Cranbury New Jersey - 12 MOnths Contract
Subject:Immediate requirement for Math teacher in Warsaw IN for Full Time.
Subject:AP FICO Consultant-----North Carolina
It is not possible to use regex to accurately filter the dataset into the categories you need if the dataset has no clear format like the example you posted
You would need to dive deeper and figure out how to analyze those subject lines for the keywords you are looking for. You would need to cross-reference location names, job titles and filter out the fluff words and characters.
If you really wanted to get into this, you should look into Deep Machine Learning and Neural Networks to process those subject lines to extract the relevant information. Only when you are able to do this (or similar) will you be able to categorize your emails and highlight those keywords for sorting/organizing.
This is not an easy process, and if you pursue it, good luck!
I have 300k+ html documents, which I want to extract postal addresses from. The data is different structures, so regex wont work.
I have done a heap of reading on NLP and NLTK for python, however I am still struggling on where to start with this.
Is this approach called Part-of-Speech Tagging or Chunking / Partial Parsing? I can't find any document on how to actually TAG a page so I can train a model on it, or even what I should be training.
So my questions;
What is this approach called?
How can I tag some documents to train from
Qn: Which NLP task is closely related with this task?
Ans: The task of detecting postal address can be viewed as a Name-Entity Recognition (NER) task. But I suggest viewing the task as simply sequence labeling on html (i.e. your input data) and then perform some standard machine learning classification.
Qn: How can I tag some documents to use as training data?
An: What you can do is to:
Label each word or each line as B egin I nside or O utside
Choose a supervised classification method
Decide what are the features (here's some hint: Feature selection)
Build the model (basically just run the classification software with the configured
features)
Voila, output should give you B and I and
O , then just delete all the instances labelled O and you will be left with the lines/words that are addresses
Apple calls their software that does this "Data Detectors" (be careful, it's patented -- they won an injunction against HTC Android phones over this). More generally, I think this application is called Information Extraction.
Strip the text out of the HTML page (unless there is a way from the HTML to identify the address text such as div with a particular class) then build a set of rules that match the address formats used.
If there are postal addresses in several countries then the formats may be markedly different but within a country, the format is the same (with some tweaks) or it is not valid.
Within the US, for example, addresses are 3 or 4 lines (including the person). There is usually a zip code (5 digits optionally followed by four more). Other countries have postal codes in different formats.
Unless your goal is 100% accuracy on all addresses then you probably should aim for extracting as many addresses as you can within the budget for the task.
It doesn't seem like a task for NLP unless you want to use Named Entity identification to find cities, countries etc.
Your task is called information extraction, but that's a very, very broad concept. Luckily your task is more limited (street addresses), but you don't give a lot of information:
What countries are the addresses in? An address in Tokyo looks very different from one in Cleveland. Your odds of succeeding are much better if you're interested in addresses from a limited number of countries-- you can develop a solution from each of them. If we're talking about a very limited number, you could code a recognizer manually.
What kind of webpages are we talking about? Are they a random collection, or can you group them into a limited number of websites and formats? Where do the addresses appear? (I.e., are there any contextual clues you can use to zero in on them?)
I'll take a worse-case scenario for question 2: The pages are completely disparate and the address could be anywhere. Not sure what the state of the art is, but I'd approach it as a chunking problem.
To get any kind of decent results, you need a training set. At a minimum, a large collection of addresses from the same locations and in the same style (informal, incomplete, complete) as the addresses you'll be extracting. Then you can try to coax decent performance out of a chunker (probably with post-processing).
PS. I wouldn't just discard the html mark-up. It contains information about document structure which could be useful. I'd add structural mark-up (paragraphs, emphasis, headings, lists, displays) before you throw out the html tags.
I've been searching for a opensource solution to suggest a category given a question or text.
For example, "who is Lady Gaga?" would probably return 'Entertainment', 'Music', or 'Celebrity'.
"How many strike out there are for baseball?" would give me 'Baseball', or 'Sport'.
The categorization doesn't have to be perfect but should be some what close.
Also is there anywhere I can get a list of popular categories?
This is a document classification problem - your "document" is simply the query or text.
You'll first need to decide what the list of possible categories is. "Who is Lady Gaga?" could be Entertainment, Celebrity, Questions-In-English, Biography, People, etc. Next you'll apply a decision framework to assign a score for each category to the text. The highest score is its category - as long as it's above a noise threshold and there isn't a second-place category that's too close to differentiate. Decision frameworks can include approaches like a Bayesian network or a set of custom rules.
Some open source projects that implement classifiers include:
Classifier4J
Matlab Classification Toolbox
POPFile (for email)
OpenNLP Maximum Entropy Package
Screen scrape Wolfram alpha.
http://www.wolframalpha.com/input/?i=Who+is+lady+gaga
http://www.wolframalpha.com/input/?i=What+is+baseball
You can probably get a good list of categories from dmoz.
Not much of an answer, but perhaps this categorized dictionary would help:
http://www.provalisresearch.com/wordstat/WordNet.html
I imagine you could extract the uncommon words from the string, look them up in the categorized dictionary, and return the categories that get the most matches on your terms. It'll be tricky to deal with pop culture references like "Lady Gaga", though...maybe you could do a Google search and analyze the results of that.
Others have done quite a bit of work on your behalf, so I'd suggest just using something like the OpenCalais API. There's a python wrapper to the API at http://code.google.com/p/python-calais/.
"Who is Lady Gaga?" seems to be too short a piece of text for them to give a decent response. However, if you took the trouble to do a two step process and grab the first paragraph from wikipedia for Lady Gaga, and then supply that to the OpenCalais API you get very good results.
You can check it out quickly by just cut and pasting the first paragraph from wikipedia into the OpenCalais viewer. The result is a classification into the topic "Entertainment Culture" with a 100% confidence estimate.
Similarly, the baseball example returns "sports" as the topic with further social tags of "recreation", "baseball" etc.
Edit Here's another thought prompted by Calais' use of social tags: sending the wikipedia url for Lady Gaga to the delicious API with
curl -k https://user:password#api.del.icio.us/v1/posts/suggest?url=http://en
.wikipedia.org/wiki/Lady_gaga
returns
<?xml version="1.0" encoding="UTF-8"?>
<suggest>
<recommended>music</recommended>
<recommended>wikipedia</recommended>
<recommended>wiki</recommended>
<recommended>people</recommended>
<recommended>bio</recommended>
<recommended>cool</recommended>
<recommended>facts</recommended>
<popular>music</popular>
<popular>gaga</popular>
<popular>ladygaga</popular>
<popular>wikipedia</popular>
<popular>lady</popular>
etc. Should be easy enough to ignore the wikipedia/wiki type entries.