I want to extract relations from unstructured text in the form of (SUBJECT,OBJECT,ACTION) relations,
for instance,
"The boy is sitting on the table eating the chicken"
would give me,
(boy,chicken,eat)
(boy,table,LOCATION)
etc..
although a python program + NLTK could process such a simple sentence as above.
I'd like to know if any of you have used tools or libraries preferably opensource to extract relations from a much wider domain such as a large collection of text documents or the web.
If your sentences do not get much more complicated than the example you have shown (for instance, with respect to anaphoras), the Stanford parser will give good results, based on a probabilistic context-free grammar, that you will easily be able to convert into the format you want. There is a demo available online. For your example, it will give something like
nsubj(sitting, boy)
prep_on(sitting, table)
etc.
If your sentences do get more complicated, you might be interested in trying Boxer, which builds discourse representation structures from C&C parses, based on probabilistic combinatory categorial grammars. Those structures may prove more difficult to adapt to the format you want, but will allow you much more flexibility. There is, again, a demo available online. For your example, it will look something like
sit(x)
boy(y)
table(z)
agent(x,y)
on(x,z)
etc.
The Stanford parser is written in Java and is available under the GPL. C&C is written in C++ and Boxer in SWI Prolog. Those two are not released under a genuinely free licence, but you can obtain the source code, modify it, and use it for any non-commercial project.
Neither will give you a characterisation for the relation between "boy" and "table" in your example—you will need much more powerful semantic reasoning tools for this, and I am not sure whether something like this exists.
Edit
It has now become once more possible to obtain the source code for C&C and Boxer, along with a collection of models.
Related
I am parsing contract announcements into columns to capture the company, the amount awarded, the description of the project awarded, etc. A raw example can be found here.
I wrote a script using regular expressions to do this but over time contingencies arise that I have to account for which bar the regexp method from being a long term solution. I have been reading up on NLTK and it seems there are two ways to go about using NLTK to solve my problem:
chunk the announcements using RegexpParser expressions - this might be a weak solution if two different fields I want to capture have the same sentence structure.
take n announcements, tokenize and run the n announcements through the pos tagger, manually tag the parts of the announcements I want to capture using the IOB format and then use those tagged announcements to train an NER model. A method discussed here
Before I go about manually tagging announcements I want to gauge
that 2 is a reasonable solution
if there are existing tagged corpus that might be useful to train my model
knowing that accuracy improves with training data size, how many manually tagged announcements I should start with.
Here's an example of how I am building the training set. If there are any apparent flaws please let me know.
Trying to get company names and project descriptions using just POS tags will be a headache. Definitely go the NER route.
Spacy has a default English NER model that can recognize organizations; it may or may not work for you but it's worth a shot.
What sort of output do you expect for "the description of the project awarded"? Typically NER would find items several tokens long, but I could imagine a description being several sentences.
For tagging, note that you don't have to work with text files. Brat is an open-source tool for visually tagging text.
How many examples you need depends on your input, but think of about a hundred as the absolute minimum and build up from there.
Hope that helps!
Regarding the project descriptions, thanks to your example I now have a better idea. It looks like the language in the first sentence of the grants is pretty regular in how it introduces the project description: XYZ Corp has been awarded $XXX for [description here].
I have never seen typical NER methods used for arbitrary phrases like that. If you've already got labels there's no harm in trying and seeing how prediction goes, but if you have issues there is another way.
Given the regularity of language a parser might be effective here. You can try out the Stanford Parser online here. Using the output of that (a "parse tree"), you can pull out the VP where the verb is "award", then pull out the PP under that where the IN is "for", and that should be what you're looking for. (The capital letters are Penn Treebank Tags; VP means "verb phrase", PP means "prepositional phrase", IN means "preposition.)
I am asking a related question here but this question is more general. I have taken a large corpora and annotated some words with their named-entities. In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities. For example, following is one sentence:
When the robot had a technical glitch, the object was thrown but was later caught by another robot.
is tagged as:
When the (robot)/Entity had a (technical glitch)/Incident, the
(object)/Entity was (thrown)/Action but was later (caught)/Action by
(another robot)/Entity.
Given examples like this, is there anyway I can train a classifier to recognize new named-entities? For instance, given a sentence like this:
The nanobot had a bug and so it crashed into the wall.
should be tagged somewhat like this:
The (nanobot)/Entity had a (bug)/Incident and so it (crashed)/Action into the (wall)/Entity.
Of course, I am aware that 100% accuracy is not possible but I would be interested in knowing any formal approaches to do this. Any suggestions?
This is not named-entity recognition at all, since none of the labeled parts are names, so the feature sets for NER systems won't help you (English NER systems tend to rely on capitalization quite strongly and will prefer nouns). This is a kind of information extraction/semantic interpretation. I suspect this is going to be quite hard in a machine learning setting because your annotation is really inconsistent:
When the (robot)/Entity had a (technical glitch)/Incident, the (object)/Entity was (thrown)/Action but was later (caught)/Action by another robot.
Why is "another robot" not annotated?
If you want to solve this kind of problem, you'd better start out with some regular expressions, maybe matched against POS-tagged versions of the string.
I can think of 2 approaches.
First is pattern matching over words in sentence. Something like this (pseudocode, though it is similar to NLTK chunk parser syntax):
<some_word>+ (<NN|NNS>) <have|has|had> (<NN|NNS>)
<NN|NNS> (<VB>|was <VB>) (<and|but> (<VB>|was <VB>))* <into|onto|by> (<NN|NNS>)
These 2 patterns can (roughly) catch 2 parts of your first sentence. This is a good choice if you have not very much kinds of sentences. I believe it is possible to get up to 90% accuracy with well-chosen patterns. Drawback is that this model is hard to extend/modify.
Another approach is to mine dependencies between words in sentence, for example, with Stanford Dependency Parser. Among other things, it allows to mine object, subject and predicate, that seems very similar to what you want: in your first sentence "robot" is subject, "had" is predicate and "glitch" is object.
You could try object role modeling at http://www.ormfoundation.com/ which looks at the semantics(facts) between one or more entities or names and their relationships with other objects. There are also tools to convert the orm models into xml and other languages and vice versa. See http://orm.sourceforge.net/
I want to parse a text and categorize the sentences according to their grammatical structure, but I have a very small understanding of NLP so I don't even know where to start.
As far as I have read, I need to parse the text and find out (or tag?) the part-of-speech of every word. Then I search for the verb clause or whatever other defining characteristic I want to use to categorize the sentences.
What I don't know is if there is already some method to do this more easily or if I need to define the grammar rules separately or what.
Any resources on NLP that discuss this would be great. Program examples are welcome as well. I have used NLTK before, but not extensively. Other parsers or languages are OK too!
Python Natural Language Toolkit is a library which is suitable for doing such a work. As with any NLP library, you will have to download the dataset for training separately and corpus(data) and scripts for training are available too.
There are also certain example tutorials which will help you identify parts of the speech for words. By all means, I think nltk.org should be the place to go for what you are looking for.
Specific questions could be posted here again.
May be you need simply define patterns like "noun verb noun" etc for each type of grammatical structure and search matches in part-of-speach tagger output sequence.
I am looking for references (tutorials, books, academic literature) concerning structuring unstructured text in a manner similar to the google calendar quick add button.
I understand this may come under the NLP category, but I am interested only in the process of going from something like "Levi jeans size 32 A0b293"
to: Brand: Levi, Size: 32, Category: Jeans, code: A0b293
I imagine it would be some combination of lexical parsing and machine learning techniques.
I am rather language agnostic but if pushed would prefer python, Matlab or C++ references
Thanks
You need to provide more information about the source of the text (the web? user input?), the domain (is it just clothes?), the potential formatting and vocabulary...
Assuming worst case scenario you need to start learning NLP. A very good free book is the documentation of NLTK: http://www.nltk.org/book . It is also a very good introduction to Python and the SW is free (for various usages). Be warned: NLP is hard. It doesn't always work. It is not fun at times. The state of the art is no where near where you imagine it is.
Assuming a better scenario (your text is semi-structured) - a good free tool is pyparsing. There is a book, plenty of examples and the resulting code is extremely attractive.
I hope this helps...
Possibly look at "Collective Intelligence" by Toby Segaran. I seem to remember that addressing the basics of this in one chapter.
After some researching I have found that this problem is commonly referred to as Information Extraction and have amassed a few papers and stored them in a Mendeley Collection
http://www.mendeley.com/research-papers/collections/3237331/Information-Extraction/
Also as Tai Weiss noted NLTK for python is a good starting point and this chapter of the book, looks specifically at information extraction
If you are only working for cases like the example you cited, you are better off using some manual rule-based that is 100% predictable and covers 90% of the cases it might encounter production..
You could enumerable lists of all possible brands and categories and detect which is which in an input string cos there's usually very little intersection in these two lists..
The other two could easily be detected and extracted using regular expressions. (1-3 digit numbers are always sizes, etc)
Your problem domain doesn't seem big enough to warrant a more heavy duty approach such as statistical learning.
I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?
Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal
I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.
Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.