Identify domain related important keywords from a given text

Identify domain related important keywords from a given text - python

I am relatively new to the field of NLP/text processing. I would like to know how to identify domain-related important keywords from a given text.
For example, if I have to build a Q&A chatbot that will be used in the Banking domain, the Q would be like: What is the maturity date for TRADE:12345 ?
From the Q, I would like to extract the keywords: maturity date & TRADE:12345.
From the extracted information, I would frame a SQL-like query, search the DB, retrieve the SQL output and provide the response back to the user.
Any help would be appreciated.
Thanks in advance.

So, this is where the work comes in.
Normally people start with a stop word list. There are several, choose wisely. But more than likely you'll experiment and/or use a base list and then add more words to that list.
Depending on the list it will take out
"what, is, the, for, ?"
Since this a pretty easy example, they'll all do that. But you'll notice that what is being done is just the opposite of what you wanted. You asked for domain-specific words but what is happening is the removal of all that other cruft (to the library).
From here it will depend on what you use. NLTK or Spacy are common choices. Regardless of what you pick, get a real understanding of concepts or it can bite you (like pretty much anything in Data Science).
Expect to start thinking in terms of linguistic patterns so, in your example:
What is the maturity date for TRADE:12345 ?
'What' is an interrogative, 'the' is a definite article, 'for' starts a prepositional phrase.
There may be other clues such as the ':' or that TRADE is in all caps. But, it might not be.
That should get you started but you might look at some of the other StackExchange sites for deeper expertise.
Finally, you want to break a question like this into more than one question (assuming that you've done the research and determined the question hasn't already been asked -- repeatedly). So, NLTK and NLP are decently new, but SQL queries are usually a Google search.

Related

How to get indices of words in a Spacy dependency parse?

I am trying to use Spacy to extract word relations/dependencies, but am a little unsure about how to use the information it gives me. I understand how to generate the visual dependency tree for debugging.
Specifically, I don’t see a way to map the list of children of a token to a specific token. There is no index—just a list of words.
Looking at the example here: https://spacy.io/usage/linguistic-features#dependency-parse
nlp("Autonomous cars shift insurance liability toward manufacturers")
Also, if the sentence were nlp("Autonomous cars shift insurance liability toward manufacturers of cars”), how would I disambiguate between the two instances of cars?
The only thing I can think of is that maybe these tokens are actually reference types that I can map to indices myself. Is that the case?
Basically, I am looking to start with getting the predicates and args to understand “who did what to whom and how/using what”.

Yeah, when you print a token it looks like a string. It’s not. It’s an object with tons of metadata, including token.i which is the index you are looking for.
If you’re just getting started with spaCy, the best use of your time is the course, it’s quick and practical.

Identifying similar strings in a database in Python

I have a database table containing well over a million strings. Each string is a term that can vary in length from two words to five or six.
["big giant cars", "zebra videos", "hotels in rio de janeiro".......]
I also have a blacklist of over several thousand smaller terms in a csv file. What I want to do is identify similar terms in the database to the blacklisted terms in my csv file. Similarity in this case can be construed as mis-spellings of the blacklisted terms.
I am familiar with libraries in python such as fuzzywuzzy that can assess string similarity using Levensthein distance and return an integer representation of the similarity. An example from this tutorial would be:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
A downside with this approach would be that it may falsely identify terms that may mean something in a different context.
A simple example of this would be "big butt", a blacklisted string, being confused with a more innocent string like "big but".
My question is, is it programmatically possible in python to accomplish this or would it be easier to just retrieve all the similar looking keywords and filter for false positives?

I'm not sure there's any definitive answer to this problem, so the best I can do is to explain how I'd approach this problem, and hopefully you'll be able to get any ideas from my ramblings. :-)
First.
On an unrelated angle, fuzzy string matching might not be enough. People are going to be using similar-looking characters and non-character symbols to get around any text matches, to the point where there's nearly zero match between a blacklisted word and actual text, and yet it's still readable for what it is. So perhaps you will need some normalization of your dictionary and search text, like converting all '0' (zeroes) to 'O' (capital O), '><' to 'X' etc. I believe there are libraries and/or conversion references to that purpose. Non-latin symbols are also a distinct possibility and should be accounted for.
Second.
I don't think you'll be able to differentiate between blacklisted words and similar-looking legal variants in a single pass. So yes, most likely you will have to search for possible blacklisted matches and then check if what you found matches some legal words too. Which means you will need not only the blacklisted dictionary, but a whitelisted dictionary as well. On a more positive note, there's probably no need to normalize the whitelisted dictionary, as people who're writing acceptable text are probably going to write it in acceptable language without any tricks outlined above. Or you could normalize it if you're feeling paranoid. :-)
Third.
However the problem is that matching words/expressions against black and white lists doesn't actually give you a reliable answer. Using your example, a person might write "big butt" as a honest typo which will be obvious in context (or vice versa, write a "big but" intentionally to get a higher match against a whitelisted word, even if context makes it quite obvious what the real meaning is). So you might have to actually check the context in case there are good enough matches against both black and white lists. This is an area I'm not intimately familiar with. It's probably possible to build correlation maps for various words (from both dictionaries) to identify what words are more (or less) frequently used in combination with them, and use them to check your specific example. Using this very paragraph as example, a word "black" could be whitelisted if it's used together with "list" but blacklisted in some other situations.
Fourth.
Even applying all those measures together you might want to leave a certain amount of gray area. That is, unless there's a high enough certainty in either direction, leave the final decision for a human (screening comments/posts for a time, automatically putting them into moderation queue, or whatever else your project dictates).
Fifth.
You might try to dabble in learning algorithms, collecting human input from previous step and using it to automatically fine-tune your algorithm as time goes by.
Hope that helps. :-)

Generate Random Sentence From Grammar or Ngrams?

I am writing a program that should spit out a random sentence of a complexity of my choosing. As a concrete example, I would like to aid my language learning by spitting out valid sentences of a grammar structure and using words that I have already learned. I would like to use python and nltk to do this, although I am open to other ideas.
It seems like there are a couple of approaches:
Define a grammar file that uses the grammar and lexicon I know about, and then generate all valid sentences from this list, then selecting a random answer.
Load in corpora to train ngrams, which then can be used to construct a sentence.
Am I thinking about this correctly? Is one approach preferred over the other? Any tips are appreciated. Thanks!

If I'm getting it right and if the purpose is to test yourself on the vocabulary you already have learned, then another approach could be taken:
Instead of going through the difficult labor of NLG (Natural Language Generation), you could create a search program that goes online, reads news feeds or even simply Wikipedia, and finds sentences with only the words you have defined.
In any case, for what you want, you will have to create lists of words that you have learned. You could then create search algorithms for sentences that contain only / nearly only these words.
That would have the major advantage of testing yourself on real sentences, as opposed to artificially-constructed ones (which are likely to sound not quite right in a number of cases).
An app like this would actually be a great help for learning a foreign language. If you did it nicely I'm sure a lot of people would benefit from using it.

If your purpose is really to make a language learning aid, you need to generate grammatical (i.e., correct) sentences. If so, do not use ngrams. They stick together words at random, and you just get intriguingly natural-looking nonsense.
You could use a grammar in principle, but it will have to be a very good and probably very large grammar.
Another option you haven't considered is to use a template method. Get yourself a bunch of sentences, identify some word classes you are interested in, and generate variants by fitting, e.g., different nouns as the subject or object. This method is much more likely to give you usable results in a finite amount of time. There's any number of well-known bots that work on this principle, and it's also pretty much what language-teaching books do.

How do I recognize what string from a list is closest to my input?

I am working on a school project and have a function that recognized a comment and finds the information from the comment and writes it down to a file. When how could I check an input string against a list of strings of information. Like if I have an input
input = "How many fingers do I have?"
How do I check which of these is closest to it?
fingers = "You have 10."
pigs = "yummy"
I want it to respond with fingers. I want to match it with the variable name and not the variable's value.

I suggest you to read this chapter.
This is a chapter from Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper.
Detecting patterns is a central part of Natural Language Processing.
Words ending in -ed tend to be past tense verbs (5). Frequent use of
will is indicative of news text (3). These observable patterns — word
structure and word frequency — happen to correlate with particular
aspects of meaning, such as tense and topic. But how did we know where
to start looking, which aspects of form to associate with which
aspects of meaning?
The goal of this chapter is to answer the following questions:
How can we identify particular features of language data that are
salient for classifying it? How can we construct models of language
that can be used to perform language processing tasks automatically?
What can we learn about language from these models?
It all described in python, and it's very efficient.
http://www.nltk.org/book/ch06.html
Also, processing the text by using a keyword that matches a variable name is not good and not efficient. I won't recommend it.

How do I approach this named-entity classification task?

I am asking a related question here but this question is more general. I have taken a large corpora and annotated some words with their named-entities. In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities. For example, following is one sentence:
When the robot had a technical glitch, the object was thrown but was later caught by another robot.
is tagged as:
When the (robot)/Entity had a (technical glitch)/Incident, the
(object)/Entity was (thrown)/Action but was later (caught)/Action by
(another robot)/Entity.
Given examples like this, is there anyway I can train a classifier to recognize new named-entities? For instance, given a sentence like this:
The nanobot had a bug and so it crashed into the wall.
should be tagged somewhat like this:
The (nanobot)/Entity had a (bug)/Incident and so it (crashed)/Action into the (wall)/Entity.
Of course, I am aware that 100% accuracy is not possible but I would be interested in knowing any formal approaches to do this. Any suggestions?

This is not named-entity recognition at all, since none of the labeled parts are names, so the feature sets for NER systems won't help you (English NER systems tend to rely on capitalization quite strongly and will prefer nouns). This is a kind of information extraction/semantic interpretation. I suspect this is going to be quite hard in a machine learning setting because your annotation is really inconsistent:
When the (robot)/Entity had a (technical glitch)/Incident, the (object)/Entity was (thrown)/Action but was later (caught)/Action by another robot.
Why is "another robot" not annotated?
If you want to solve this kind of problem, you'd better start out with some regular expressions, maybe matched against POS-tagged versions of the string.

I can think of 2 approaches.
First is pattern matching over words in sentence. Something like this (pseudocode, though it is similar to NLTK chunk parser syntax):
<some_word>+ (<NN|NNS>) <have|has|had> (<NN|NNS>)
<NN|NNS> (<VB>|was <VB>) (<and|but> (<VB>|was <VB>))* <into|onto|by> (<NN|NNS>)
These 2 patterns can (roughly) catch 2 parts of your first sentence. This is a good choice if you have not very much kinds of sentences. I believe it is possible to get up to 90% accuracy with well-chosen patterns. Drawback is that this model is hard to extend/modify.
Another approach is to mine dependencies between words in sentence, for example, with Stanford Dependency Parser. Among other things, it allows to mine object, subject and predicate, that seems very similar to what you want: in your first sentence "robot" is subject, "had" is predicate and "glitch" is object.

You could try object role modeling at http://www.ormfoundation.com/ which looks at the semantics(facts) between one or more entities or names and their relationships with other objects. There are also tools to convert the orm models into xml and other languages and vice versa. See http://orm.sourceforge.net/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.