I'm working on a project where I need to extract "inputs" and "query intent" from text.
For example "What is the status of asset X26TH?"
In this case the main issue is to extract asset id which is X26TH, but how can I make my code understand that it's an id?
The other thing is to understand the query intent which is asset status. I found a good library for this called quepy, but it's meant for linux and I couldn't set it up on windows.
Please help me with the techniques and libraries.
So you have two problems, ID extraction and intent detection.
ID Extraction
If your IDs follow a regular pattern and definitely don't look like English, you can catch them with a regex - if that's possible, that's great since it's very easy to do. If you have a fixed list of product IDs, just check to see if any of them are in the input. If neither of those work then you'll have to get more sophisticated.
Can you get your users to remember a little syntax? If you can request that they write things with a prefix like id:X26TH or similar that would make your job easier. You may find the way the plumber in Plan9 works informative.
If you need to work with whatever the users throw at you, you should look into using a sequence labeller or Named Entity Recognition (NER) system to get IDs. CRFs are probably a good fit for this task; here's a good technial introduction, and the New York Times also used one with success. Besides being trickier to set up a downside of this is that it will require training data, but there's really no way to avoid that.
Intent Detection
This is usually modelled as a text classification problem. You can find an overview of how to do that here. Here's some training examples from the article:
training_data.append({"class":"greeting", "sentence":"how are you?"})
training_data.append({"class":"greeting", "sentence":"how is your day?"})
training_data.append({"class":"greeting", "sentence":"good day"})
training_data.append({"class":"greeting", "sentence":"how is it going today?"})
training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"see you later"})
training_data.append({"class":"goodbye", "sentence":"have a nice day"})
training_data.append({"class":"goodbye", "sentence":"talk to you soon"})
training_data.append({"class":"sandwich", "sentence":"make me a sandwich"})
training_data.append({"class":"sandwich", "sentence":"can you make a sandwich?"})
training_data.append({"class":"sandwich", "sentence":"having a sandwich today?"})
training_data.append({"class":"sandwich", "sentence":"what's for lunch?"})
Related
I am doing some research related to fake news on twitter. I have isolated timelines of specific fake news stories in python, and I wanted to find out if there was a way to determine if each the text in each tweet was agreeing with the story or refuting it, thus giving me two seperate categories that I could compare over time. I know I could look for key words like 'fake' or 'false' to help classify, however I was hoping there was a more thorough way of doing it.
Thanks.
I'm creating a simple chatbot. I want to obtain the information from the user response. An example scenario:
Bot : Hi, what is your name?
User: My name is Edwin.
I wish to extract the name Edwin from the sentence. However, the user can response in different ways such as
User: Edwin is my name.
User: I am Edwin.
User: Edwin.
I'm tried to rely on the dependency relations between words but the result does not do well.
Any idea on what technique I could use to tackle this problem?
First off, I think a complete name detection is really heavy to set up. If you want your bot to be able to detect a name in like 99% of the cases, you've got some work. And I suppose the name detection is only the very beginning of your plans...
This said, here are the first ideas that came to my mind:
Names are, grammatically speaking, nouns. So if one can perform a grammatical analysis of the sentence, some candidates to the name can be found.
Names are supposed to begin with a cap, although on a chat this is likely not to be respected, so it might be of little use... However, if one came across a word beginning with a cap, it is likely to be someone's name (though it could be a place's name...).
The patterns you could reasonably think of when introducing yourself are not that numerous, so you could "hard-code" them, with of course a little tolerance towards typos.
If you are expecting an actual name, you could use a database holding a huge amount of names, but have fun with the Hawaiian or Chinese names. Still, this appears as a viable solution in the case of European names.
However, I am no AI specialist, and I'm looking forward to seeing other proposals.
I'd suggest using NER:
You can play with it yourself: http://nlp.cogcomp.org/
There are many alternatives, over only 2 'models':
Based on NLP training; uses HTTP for integration/delivery:
Microsoft LUIS
API.AI
IBM Watson
based on pattern matching; uses an interpreter (needs an native implementation or a bridge from other implementation)
Rivescript - Python interpreter available
ChatScript - needs a C++ bridge/interop
AIML - Python interpreter available
This is not an extensive listing of current options.
Detecting names can be complicated if you consider things like "My name is not important", "My name is very long", etc.
Here is public domain script in Self that attempts to parse a name, you may be able to adapt it to python, it also does some crazy stuff like lookup the words on Wiktionary to see if they are classified as names,
https://www.botlibre.com/script?id=525804
I am writing a user-app that takes input from the user as the current open wikipedia page. I have written a piece of code that takes this as input to my module and generates a list of keywords related to that particular article using webscraping and natural language processing.
I want to expand the functionality of the app by providing in addition to the keywords that i have identified, a set of related topics that may be of interest to the user. Is there any API that wikipedia provides that will do the trick. If there isn't, Can anybody Point me to what i should be looking into (incase i have to write code from scratch). Also i will appreciate any pointers in identifying any algorithm that will train the machine to identify topic maps. I am not seeking any paper but rather a practical implementation of something basic
so to summarize,
I need a way to find topics related to current article in wikipedia (categories will also do)
I will also appreciate a sample algorithm for training a machine to identify topics that usually are related and clustered.
ps. please be specific because i have researched through a number of obvious possibilities
appreciate it thank you
You can scrape the categories if you want. If you're working with python, you can read the wikitext directly from their API, and use mwlib to parse the article and find the links.
A more interesting but harder to implement approach would be to create clusters of related terms, and given the list of terms extracted from an article, find the closest terms to them.
"See also" is a section often present in Wikipedia pages.
It is structured like the example below, from [[Article (publishing)]]:
==See also==
* [[Article directory]]
* [[Electronic article]]
You should then parse the wikicode (you can take that via dumps or the Mediawiki API, as hinted in the previous answers), and use the articles mentioned.
Another way is to use directly the Wikipedia categories, there are APIs for that.
Some background
I am a literature student at New College of Florida, currently working on an overly ambitious creative project. The project is geared towards the algorithmic generation of poetry. It's written in Python. My Python knowledge and Natural Language Processing knowledge come only from teaching myself things through the internet. I've been working with this stuff for about a year, so I'm not helpless, but at various points I've had trouble moving forward in this project. Currently, I am entering the final phases of development, and have hit a little roadblock.
I need to implement some form of grammatical normalization, so that the output doesn't come out as un- conjugated/inflected caveman-speak. About a month ago some friendly folks on SO gave me some advice on how I might solve this issue by using an ngram language modeller, basically -- but I'm looking for yet other solutions, as it seems that NLTK's NgramModeler is not fit for my needs. (The possibilities of POS tagging were also mentioned, but my text may be too fragmentary and strange for an implementation of such to come easy, given my amateur-ness.)
Perhaps I need something like AtD, but hopefully less complex
I think need something that works like After the Deadline or Queequeg, but neither of these seem exactly right. Queequeg is probably not a good fit -- it was written in 2003 for Unix and I can't get it working on Windows for the life of me (have tried everything). But I like that all it checks for is proper verb conjugation and number agreement.
On the other hand, AtD is much more rigorous, offering more capabilities than I need. But I can't seem to get the python bindings for it working. (I get 502 errors from the AtD server, which I'm sure are easy to fix, but my application is going to be online, and I'd rather avoid depending on another server. I can't afford to run an AtD server myself, because the number of "services" my application is going to require of my web host is already threatening to cause problems in getting this application hosted cheaply.)
Things I'd like to avoid
Building Ngram language models myself doesn't seem right for the task. my application throws a lot of unknown vocabulary, skewing all the results. (Unless I use a corpus that's so large that it runs way too slow for my application -- the application needs to be pretty snappy.)
Strictly checking grammar is neither right for the task. the grammar doesn't need to be perfect, and the sentences don't have to be any more sensible than the kind of English-like jibberish that you can generate using ngrams. Even if it's jibberish, I just need to enforce verb conjugation, number agreement, and do things like remove extra articles.
In fact, I don't even need any kind of suggestions for corrections. I think all I need is for something to tally up how many errors seem to occur in each sentence in a group of possible sentences, so I can sort by their score and pick the one with the least grammatical issues.
A simple solution? Scoring fluency by detecting obvious errors
If a script exists that takes care of all this, I'd be overjoyed (I haven't found one yet). I can write code for what I can't find, of course; I'm looking for advice on how to optimize my approach.
Let's say we have a tiny bit of text already laid out:
existing_text = "The old river"
Now let's say my script needs to figure out which inflection of the verb "to bear" could come next. I'm open to suggestions about this routine. But I need help mostly with step #2, rating fluency by tallying grammatical errors:
Use the Verb Conjugation methods in NodeBox Linguistics to come up with all conjugations of this verb; ['bear', 'bears', 'bearing', 'bore', 'borne'].
Iterate over the possibilities, (shallowly) checking the grammar of the string resulting from existing_text + " " + possibility ("The old river bear", "The old river bears", etc). Tally the error count for each construction. In this case the only construction to raise an error, seemingly, would be "The old river bear".
Wrapping up should be easy... Of the possibilities with the lowest error count, select randomly.
Very cool project, first of all.
I found a java grammar checker. I've never used it but the docs claim it can run as a server. Both java and listening to a port should be supported basically anywhere.
I'm just getting into NLP with a CS background so I wouldn't mind going into more detail to help you integrate whatever you decide on using. Feel free to ask for more detail.
Another approach would be to use what is called an overgenerate and rank approach. In the first step you have your poetry generator generate multiple candidate generations. Then using a service like Amazon's Mechanical Turk to collect human judgments of fluency. I would actually suggest collecting simultaneous judgments for a number of sentences generated from the same seed conditions. Lastly, you extract features from the generated sentences (presumably using some form of syntactic parser) to train a model to rate or classify question quality. You could even thrown in the heuristics listed above.
Michael Heilman uses this approach for question generation. For more details, read these papers:
Good Question! Statistical Ranking for Question Generation and
Rating Computer-Generated Questions with Mechanical Turk.
The pylinkgrammar link provided above is a bit out of date. It points to version 0.1.9, and the code samples for that version no longer work. If you go down this path, be sure to use the latest version which can be found at:
https://pypi.python.org/pypi/pylinkgrammar
I am interested in generating a list of suggested semantic tags (via links to Freebase, Wikipedia or another system) to a user who is posting a short text snippet. I'm not looking to "understand" what the text is really saying, or even to automatically tag it, I just want to suggest to the user the most likely semantic tags for his/her post. My main goal is to force users to tag semantically and therefore consistently and not to write in ambiguous text strings. If there were a reasonably functional and reasonably priced tool on the market, I would use it. I have not found such a tool so I am looking in to writing my own.
My question is first of all, if there is such a tool that I have not encountered. I've looked at Zemanta, AlchemyAPI and OpenCalais and none of them seemed to offer the service I need.
Assuming that I'm writing my own, I'd be doing it in Python (unless there was a really compelling reason to use something else). My first guess would be to search for n-grams that match "entities" in Freebase and suggest them as tags, perhaps searching in descriptions of entities as well to get a little "smarter." If that proved insufficient, I'd read up and dip my toes into the ontological water. Since this is a very hard problem and I don't think that my application requires its solution, I would like to refrain from real semantic analysis as much as possible.
Does anyone have experience working with a semantic database system and could give me some pointers regarding where to begin and what sort of pitfalls to expect?
Take a look at NLTK python library. It contains a vast number of tools, dictionaries and algorithms.