I have 300k+ html documents, which I want to extract postal addresses from. The data is different structures, so regex wont work.
I have done a heap of reading on NLP and NLTK for python, however I am still struggling on where to start with this.
Is this approach called Part-of-Speech Tagging or Chunking / Partial Parsing? I can't find any document on how to actually TAG a page so I can train a model on it, or even what I should be training.
So my questions;
What is this approach called?
How can I tag some documents to train from
Qn: Which NLP task is closely related with this task?
Ans: The task of detecting postal address can be viewed as a Name-Entity Recognition (NER) task. But I suggest viewing the task as simply sequence labeling on html (i.e. your input data) and then perform some standard machine learning classification.
Qn: How can I tag some documents to use as training data?
An: What you can do is to:
Label each word or each line as B egin I nside or O utside
Choose a supervised classification method
Decide what are the features (here's some hint: Feature selection)
Build the model (basically just run the classification software with the configured
features)
Voila, output should give you B and I and
O , then just delete all the instances labelled O and you will be left with the lines/words that are addresses
Apple calls their software that does this "Data Detectors" (be careful, it's patented -- they won an injunction against HTC Android phones over this). More generally, I think this application is called Information Extraction.
Strip the text out of the HTML page (unless there is a way from the HTML to identify the address text such as div with a particular class) then build a set of rules that match the address formats used.
If there are postal addresses in several countries then the formats may be markedly different but within a country, the format is the same (with some tweaks) or it is not valid.
Within the US, for example, addresses are 3 or 4 lines (including the person). There is usually a zip code (5 digits optionally followed by four more). Other countries have postal codes in different formats.
Unless your goal is 100% accuracy on all addresses then you probably should aim for extracting as many addresses as you can within the budget for the task.
It doesn't seem like a task for NLP unless you want to use Named Entity identification to find cities, countries etc.
Your task is called information extraction, but that's a very, very broad concept. Luckily your task is more limited (street addresses), but you don't give a lot of information:
What countries are the addresses in? An address in Tokyo looks very different from one in Cleveland. Your odds of succeeding are much better if you're interested in addresses from a limited number of countries-- you can develop a solution from each of them. If we're talking about a very limited number, you could code a recognizer manually.
What kind of webpages are we talking about? Are they a random collection, or can you group them into a limited number of websites and formats? Where do the addresses appear? (I.e., are there any contextual clues you can use to zero in on them?)
I'll take a worse-case scenario for question 2: The pages are completely disparate and the address could be anywhere. Not sure what the state of the art is, but I'd approach it as a chunking problem.
To get any kind of decent results, you need a training set. At a minimum, a large collection of addresses from the same locations and in the same style (informal, incomplete, complete) as the addresses you'll be extracting. Then you can try to coax decent performance out of a chunker (probably with post-processing).
PS. I wouldn't just discard the html mark-up. It contains information about document structure which could be useful. I'd add structural mark-up (paragraphs, emphasis, headings, lists, displays) before you throw out the html tags.
Related
I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.
For a new project I have a need to extract information from web pages, more precisely imprint information. I use brat to label the documents and have started first experiments with spacy and NER. There are many videos and tutorials about this, but still some basic questions remain.
Is it possible to include the context of an entity?
Example text:
Responsible for the content:
The Good Company GmbH 0331 Berlin
You can contact us via +49 123 123 123.
This website was created by good design GmbH, contact +49 12314 453 5.
Well, spacy is very good at extracting the phone numbers. According to my latest tests, the error rate is less than two percent. I was able to achieve this already after 250 labeled documents, in the meantime I have labeled 450 documents, my goal is about 5000 documents.
Now to the actual point. Relevant are only the phone numbers that are shown in the context of the sentence "Responsible for the content", the other phone numbers are not relevant.
I could now imagine to train these introductory sentences as entities, because they are always somehow similar. But how can I create the context? Are there perhaps already models based on NER that do just that?
Maybe someone has already read some hints or something about it somewhere? As a beginner the hurdle is relatively high, because the material is really deep (little play on words).
Greetings from Germany!
If I understand your question and use-case correctly, I would advise the following approach:
Train/design some system that recognizes all phone numbers - it looks like you've already got that
Train a text classifier to recognize the "responsible for content" sentences.
Implement some heuristics (can probably be rule-based?) to determine whether or not any recognized phone number is connected to any of the predicted "responsible for content" sentences - probably using straightforward features such as number of sentences in between, taking the first phone number after the sentence, etc.
So basically I would advice to solve each NLP challenge separately, and then connect the information throughout the document.
I am parsing contract announcements into columns to capture the company, the amount awarded, the description of the project awarded, etc. A raw example can be found here.
I wrote a script using regular expressions to do this but over time contingencies arise that I have to account for which bar the regexp method from being a long term solution. I have been reading up on NLTK and it seems there are two ways to go about using NLTK to solve my problem:
chunk the announcements using RegexpParser expressions - this might be a weak solution if two different fields I want to capture have the same sentence structure.
take n announcements, tokenize and run the n announcements through the pos tagger, manually tag the parts of the announcements I want to capture using the IOB format and then use those tagged announcements to train an NER model. A method discussed here
Before I go about manually tagging announcements I want to gauge
that 2 is a reasonable solution
if there are existing tagged corpus that might be useful to train my model
knowing that accuracy improves with training data size, how many manually tagged announcements I should start with.
Here's an example of how I am building the training set. If there are any apparent flaws please let me know.
Trying to get company names and project descriptions using just POS tags will be a headache. Definitely go the NER route.
Spacy has a default English NER model that can recognize organizations; it may or may not work for you but it's worth a shot.
What sort of output do you expect for "the description of the project awarded"? Typically NER would find items several tokens long, but I could imagine a description being several sentences.
For tagging, note that you don't have to work with text files. Brat is an open-source tool for visually tagging text.
How many examples you need depends on your input, but think of about a hundred as the absolute minimum and build up from there.
Hope that helps!
Regarding the project descriptions, thanks to your example I now have a better idea. It looks like the language in the first sentence of the grants is pretty regular in how it introduces the project description: XYZ Corp has been awarded $XXX for [description here].
I have never seen typical NER methods used for arbitrary phrases like that. If you've already got labels there's no harm in trying and seeing how prediction goes, but if you have issues there is another way.
Given the regularity of language a parser might be effective here. You can try out the Stanford Parser online here. Using the output of that (a "parse tree"), you can pull out the VP where the verb is "award", then pull out the PP under that where the IN is "for", and that should be what you're looking for. (The capital letters are Penn Treebank Tags; VP means "verb phrase", PP means "prepositional phrase", IN means "preposition.)
I have a list of domain names and want to determine is name of domain looks like it is porno site or not. What the better way to do this? List of porn domains looks like http://dumpz.org/56957/ . This domains can be used to teach the system how porno domains should look like. Also I have other list - http://dumpz.org/56960/ - many domains of this list also is porno and I want to determine them by name.
Use a bayesian filter eg: SpamBayes or Divmods Reverend. You train it with the list you have and could score how likely it is for a given domain, if it is porn.
For a short overview look at this article.
You can't rely on the domain name for that, there are far too many porn domains with decent names and few others with porn-like names but with safe content.
It might depend on what your goals are. I'm guessing that you are mostly interested in minimizing false negatives (accidentally calling a domain a good domain if it isn't). This might be true if, for example, you want all porn links in a forum to be reviewed for spam before being posted. If some non-porn links get flagged for review, it's OK.
In this case, you could probably do something fairly simple. If you could come up with a list of porn'ish words, you could just mark all of the domains that contain any of those words as a substring. This would catch some safe domains though: expertsexchange.com could match "sex" or "sexchange", but "yahoo" wouldn't ever flag positive. Easy to implement, easy to understand, easy to tweak.
Lists of obscene words can be found using your favorite search engine. You could use your list of domains to extract common long substrings across the domains as words as well.
If you want to really get the answers correct though, you'll need to see what is on those domains. Site-About-Kitty-Porn.com could be a lolcats domain or illegal porn. Impossible to know unless you do some crawling. If you fetch the actual content and matched against your list, you'd be doing a little better.
You could also try each domain against some third party service, such as a child-safe internet filter, or even trying to test if the domain will appear for safe-search results in your favorite search engine. Of course, make sure you are following each service's TOS and all of that.
As someone already pointed out, you need some kind of classification to achieve what you are trying to. But then overall accuracy (precision and recall) depends on the training dataset you have. You could use classifiers like SVM, decision tree, etc. for this purpose.
I would suggest to go for a semi-supervised approach where you cluster your different URLs and check a few representative URLs from each cluster to see if that is porn or not. The benefit is you don need any training and you can find porn URLs which probably do not cover your training dataset. The common clustering techniques are k-means, hierarchical, dbscan, etc.
This will still not cover porn sites which do not have porn like URL. For that you have to grab the page and need to do similar training/clustering on the content of the webpage(s).
Do you mean something like this?
scala> val pornList = List("porn1.com","porn2.com","porn3.com")
pornList: List[java.lang.String] = List(porn1.com, porn2.com, porn3.com)
scala> val sites = List("porn1.com","site1.com","porn3.com","site2.com","site3.com")
sites: List[java.lang.String] = List(porn1.com, site1.com, porn3.com, site2.com, site3.com)
scala> val result = sites filterNot { pornList contains _ }
result: List[java.lang.String] = List(site1.com, site2.com, site3.com)
Check out this blog post on classifying webpages by topic. Start with the list of bad sites as your positive examples and use any heuristic for finding good sites (basic web crawler seeded with some innocent Google searches) as negative examples. The post walks you through the process of extracting content through the pages and touches on Weka and how you might apply some of their basic learners.
Note that you may want to add additional data to your training set that is specific to the domain of your problem instead of just using page contents. For example the number of pictures or size of pictures on the page may be a factor that you may want to consider.
I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?
Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal
I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.
Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.