Extract information from sentence - python

I'm creating a simple chatbot. I want to obtain the information from the user response. An example scenario:
Bot : Hi, what is your name?
User: My name is Edwin.
I wish to extract the name Edwin from the sentence. However, the user can response in different ways such as
User: Edwin is my name.
User: I am Edwin.
User: Edwin.
I'm tried to rely on the dependency relations between words but the result does not do well.
Any idea on what technique I could use to tackle this problem?

First off, I think a complete name detection is really heavy to set up. If you want your bot to be able to detect a name in like 99% of the cases, you've got some work. And I suppose the name detection is only the very beginning of your plans...
This said, here are the first ideas that came to my mind:
Names are, grammatically speaking, nouns. So if one can perform a grammatical analysis of the sentence, some candidates to the name can be found.
Names are supposed to begin with a cap, although on a chat this is likely not to be respected, so it might be of little use... However, if one came across a word beginning with a cap, it is likely to be someone's name (though it could be a place's name...).
The patterns you could reasonably think of when introducing yourself are not that numerous, so you could "hard-code" them, with of course a little tolerance towards typos.
If you are expecting an actual name, you could use a database holding a huge amount of names, but have fun with the Hawaiian or Chinese names. Still, this appears as a viable solution in the case of European names.
However, I am no AI specialist, and I'm looking forward to seeing other proposals.

I'd suggest using NER:
You can play with it yourself: http://nlp.cogcomp.org/

There are many alternatives, over only 2 'models':
Based on NLP training; uses HTTP for integration/delivery:
Microsoft LUIS
API.AI
IBM Watson
based on pattern matching; uses an interpreter (needs an native implementation or a bridge from other implementation)
Rivescript - Python interpreter available
ChatScript - needs a C++ bridge/interop
AIML - Python interpreter available
This is not an extensive listing of current options.

Detecting names can be complicated if you consider things like "My name is not important", "My name is very long", etc.
Here is public domain script in Self that attempts to parse a name, you may be able to adapt it to python, it also does some crazy stuff like lookup the words on Wiktionary to see if they are classified as names,
https://www.botlibre.com/script?id=525804

Related

Query using NLP [PYTHON]

I'm working on a project where I need to extract "inputs" and "query intent" from text.
For example "What is the status of asset X26TH?"
In this case the main issue is to extract asset id which is X26TH, but how can I make my code understand that it's an id?
The other thing is to understand the query intent which is asset status. I found a good library for this called quepy, but it's meant for linux and I couldn't set it up on windows.
Please help me with the techniques and libraries.
So you have two problems, ID extraction and intent detection.
ID Extraction
If your IDs follow a regular pattern and definitely don't look like English, you can catch them with a regex - if that's possible, that's great since it's very easy to do. If you have a fixed list of product IDs, just check to see if any of them are in the input. If neither of those work then you'll have to get more sophisticated.
Can you get your users to remember a little syntax? If you can request that they write things with a prefix like id:X26TH or similar that would make your job easier. You may find the way the plumber in Plan9 works informative.
If you need to work with whatever the users throw at you, you should look into using a sequence labeller or Named Entity Recognition (NER) system to get IDs. CRFs are probably a good fit for this task; here's a good technial introduction, and the New York Times also used one with success. Besides being trickier to set up a downside of this is that it will require training data, but there's really no way to avoid that.
Intent Detection
This is usually modelled as a text classification problem. You can find an overview of how to do that here. Here's some training examples from the article:
training_data.append({"class":"greeting", "sentence":"how are you?"})
training_data.append({"class":"greeting", "sentence":"how is your day?"})
training_data.append({"class":"greeting", "sentence":"good day"})
training_data.append({"class":"greeting", "sentence":"how is it going today?"})
training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"see you later"})
training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"talk to you soon"})
training_data.append({"class":"sandwich", "sentence":"make me a sandwich"})
training_data.append({"class":"sandwich", "sentence":"can you make a sandwich?"})
training_data.append({"class":"sandwich", "sentence":"having a sandwich today?"})
training_data.append({"class":"sandwich", "sentence":"what's for lunch?"})

Python - Deciphering Line from SiriServer Plugin

I'm currently learning how to program plugins for SiriServer, in hope to create a bit of home automation using my phone. I'm trying to figure out how to program the text coverted speech to match and run the plugin.
I've learnt how to to short phrases, like this for example.:
#register("en-US", ".*Start.*XBMC.*")
Though if I'm understanding it's searching at random for the two words. If I were to say XBMC Start, it would probably work as well, but when I start working with wolframalpha, I need to be a bit more specific.
For example, speech to text saying "What's the weather like in Toronto?", somehow connects to this:
#register("en-US", "(what( is|'s) the )?weather( like)? in (?P<location>[\w ]+?)$")
What would all the extra symbols in that line mean that could connect these two together? I've tried messing around with a couple ideas but nothing seems to work the way I want it to. Any help is appreciated, thanks!
I will break down the example you provided so hopefully that is a good start, but searching for python regex would provide more thorough information.
The parentheses set the enclosed items to be seen as the result, not the individual items by the remaining expression. The pipes mean "or", the question marks mean this portion may or may not be present, and the group for location is a regex which sets the variable "location" as the input provided at this point in the input. The $ at the end means that this will complete the sentence. .* means anything at this place in the input is acceptable, but should also be ignored. Hopefully that helps.

Fetching google's answers, and displaying them?

I'm starting a personal assistant project. I've noticed that if you type something like "How old is Obama" on Google, the first hit is a little thing that says "51 years (August 4, 1961)". This works for a lot of things, like, if you type "Who is Romney's wife" it returns "Ann Romney (m. 1969)". This is incredibly useful. How can I fetch this data and retrieve it?
Also, if nothing pops up, like saying "How much money is google worth", then scan each of the hits one by one and determines it. (I can do the determination part, I just need to know the scanning).
Can this be done using urllib2?
Have you considered wolframalpha ?
Its more helpful in doing dynamic computations based on a vast collection of built-in data, algorithms, and methods.
Also, Here is an example:
How old is Obama

Selecting the most fluent text from a set of possibilities via grammar checking (Python)

Some background
I am a literature student at New College of Florida, currently working on an overly ambitious creative project. The project is geared towards the algorithmic generation of poetry. It's written in Python. My Python knowledge and Natural Language Processing knowledge come only from teaching myself things through the internet. I've been working with this stuff for about a year, so I'm not helpless, but at various points I've had trouble moving forward in this project. Currently, I am entering the final phases of development, and have hit a little roadblock.
I need to implement some form of grammatical normalization, so that the output doesn't come out as un- conjugated/inflected caveman-speak. About a month ago some friendly folks on SO gave me some advice on how I might solve this issue by using an ngram language modeller, basically -- but I'm looking for yet other solutions, as it seems that NLTK's NgramModeler is not fit for my needs. (The possibilities of POS tagging were also mentioned, but my text may be too fragmentary and strange for an implementation of such to come easy, given my amateur-ness.)
Perhaps I need something like AtD, but hopefully less complex
I think need something that works like After the Deadline or Queequeg, but neither of these seem exactly right. Queequeg is probably not a good fit -- it was written in 2003 for Unix and I can't get it working on Windows for the life of me (have tried everything). But I like that all it checks for is proper verb conjugation and number agreement.
On the other hand, AtD is much more rigorous, offering more capabilities than I need. But I can't seem to get the python bindings for it working. (I get 502 errors from the AtD server, which I'm sure are easy to fix, but my application is going to be online, and I'd rather avoid depending on another server. I can't afford to run an AtD server myself, because the number of "services" my application is going to require of my web host is already threatening to cause problems in getting this application hosted cheaply.)
Things I'd like to avoid
Building Ngram language models myself doesn't seem right for the task. my application throws a lot of unknown vocabulary, skewing all the results. (Unless I use a corpus that's so large that it runs way too slow for my application -- the application needs to be pretty snappy.)
Strictly checking grammar is neither right for the task. the grammar doesn't need to be perfect, and the sentences don't have to be any more sensible than the kind of English-like jibberish that you can generate using ngrams. Even if it's jibberish, I just need to enforce verb conjugation, number agreement, and do things like remove extra articles.
In fact, I don't even need any kind of suggestions for corrections. I think all I need is for something to tally up how many errors seem to occur in each sentence in a group of possible sentences, so I can sort by their score and pick the one with the least grammatical issues.
A simple solution? Scoring fluency by detecting obvious errors
If a script exists that takes care of all this, I'd be overjoyed (I haven't found one yet). I can write code for what I can't find, of course; I'm looking for advice on how to optimize my approach.
Let's say we have a tiny bit of text already laid out:
existing_text = "The old river"
Now let's say my script needs to figure out which inflection of the verb "to bear" could come next. I'm open to suggestions about this routine. But I need help mostly with step #2, rating fluency by tallying grammatical errors:
Use the Verb Conjugation methods in NodeBox Linguistics to come up with all conjugations of this verb; ['bear', 'bears', 'bearing', 'bore', 'borne'].
Iterate over the possibilities, (shallowly) checking the grammar of the string resulting from existing_text + " " + possibility ("The old river bear", "The old river bears", etc). Tally the error count for each construction. In this case the only construction to raise an error, seemingly, would be "The old river bear".
Wrapping up should be easy... Of the possibilities with the lowest error count, select randomly.
Very cool project, first of all.
I found a java grammar checker. I've never used it but the docs claim it can run as a server. Both java and listening to a port should be supported basically anywhere.
I'm just getting into NLP with a CS background so I wouldn't mind going into more detail to help you integrate whatever you decide on using. Feel free to ask for more detail.
Another approach would be to use what is called an overgenerate and rank approach. In the first step you have your poetry generator generate multiple candidate generations. Then using a service like Amazon's Mechanical Turk to collect human judgments of fluency. I would actually suggest collecting simultaneous judgments for a number of sentences generated from the same seed conditions. Lastly, you extract features from the generated sentences (presumably using some form of syntactic parser) to train a model to rate or classify question quality. You could even thrown in the heuristics listed above.
Michael Heilman uses this approach for question generation. For more details, read these papers:
Good Question! Statistical Ranking for Question Generation and
Rating Computer-Generated Questions with Mechanical Turk.
The pylinkgrammar link provided above is a bit out of date. It points to version 0.1.9, and the code samples for that version no longer work. If you go down this path, be sure to use the latest version which can be found at:
https://pypi.python.org/pypi/pylinkgrammar

Suggest semantic tags for short snippets of text

I am interested in generating a list of suggested semantic tags (via links to Freebase, Wikipedia or another system) to a user who is posting a short text snippet. I'm not looking to "understand" what the text is really saying, or even to automatically tag it, I just want to suggest to the user the most likely semantic tags for his/her post. My main goal is to force users to tag semantically and therefore consistently and not to write in ambiguous text strings. If there were a reasonably functional and reasonably priced tool on the market, I would use it. I have not found such a tool so I am looking in to writing my own.
My question is first of all, if there is such a tool that I have not encountered. I've looked at Zemanta, AlchemyAPI and OpenCalais and none of them seemed to offer the service I need.
Assuming that I'm writing my own, I'd be doing it in Python (unless there was a really compelling reason to use something else). My first guess would be to search for n-grams that match "entities" in Freebase and suggest them as tags, perhaps searching in descriptions of entities as well to get a little "smarter." If that proved insufficient, I'd read up and dip my toes into the ontological water. Since this is a very hard problem and I don't think that my application requires its solution, I would like to refrain from real semantic analysis as much as possible.
Does anyone have experience working with a semantic database system and could give me some pointers regarding where to begin and what sort of pitfalls to expect?
Take a look at NLTK python library. It contains a vast number of tools, dictionaries and algorithms.

Categories