How to get indices of words in a Spacy dependency parse?

How to get indices of words in a Spacy dependency parse? - python

I am trying to use Spacy to extract word relations/dependencies, but am a little unsure about how to use the information it gives me. I understand how to generate the visual dependency tree for debugging.
Specifically, I don’t see a way to map the list of children of a token to a specific token. There is no index—just a list of words.
Looking at the example here: https://spacy.io/usage/linguistic-features#dependency-parse
nlp("Autonomous cars shift insurance liability toward manufacturers")
Also, if the sentence were nlp("Autonomous cars shift insurance liability toward manufacturers of cars”), how would I disambiguate between the two instances of cars?
The only thing I can think of is that maybe these tokens are actually reference types that I can map to indices myself. Is that the case?
Basically, I am looking to start with getting the predicates and args to understand “who did what to whom and how/using what”.

Yeah, when you print a token it looks like a string. It’s not. It’s an object with tons of metadata, including token.i which is the index you are looking for.
If you’re just getting started with spaCy, the best use of your time is the course, it’s quick and practical.

Related

How to get all the syllables of a word in Python?

I am looking to split the word into its syllables. I am trying to build a speech-to-text system but focused on transcribing medical terms.
Consider a doctor/pharmacist who instead of typing out the medicine dosage would just speak into the microphone and a digital prescription would be generated automatically.
I want to avoid ML/DL based approaches since I wanted the system to work in real-time. Therefore I wanted to tackle this problem via a dictionary-based approach. I have scraped the rxlist.com to get all the possible medicine names.
Currently, I am using the webspeech API (https://www.google.com/intl/en/chrome/demos/speech.html). This works well but often messes up the medicine names.
Panadol twice a day for three days would become panel twice a day for three days
It works sometimes (super unstable). Also, it is important to consider that panadol is a relatively simple term. Consider Vicodin (changed to why couldn't), Abacavir Sulfate, etc.
Here is the approach I thought could perhaps work.
Maintain a dictionary of all medicines.
Once the detections are there (I append all the detections instead of just using the last output), compare the string distance from each medicine (could be huge, so sorting is important here) and replace the word with minimum error.
If nothing matches (maintain a threshold of error in step 2), check the syllables of prediction and that of medicine name and replace the one with the lowest error.
So I now have the list, I was hoping if I could find a library/dictionary API which could give me the syllables of medicine names. Typing How to pronounce vicodin on Google gets the Learn to Pronounce panel which has: vai·kuh·dn. I would want something similar, now I could scrape it off of Google, but I don't get the results for all the medicine names.
Any help would be appreciated.
Thanks.

You can use a library called pyphen. It's pretty easy to use. To install it run the following command in your terminal:
pip install pyphen
After this, find out the syllables in a string:
import pyphen
a = pyphen.Pyphen(lang='en')
print(a.inserted('vicodin'))
I hope you find this useful

Identify domain related important keywords from a given text

I am relatively new to the field of NLP/text processing. I would like to know how to identify domain-related important keywords from a given text.
For example, if I have to build a Q&A chatbot that will be used in the Banking domain, the Q would be like: What is the maturity date for TRADE:12345 ?
From the Q, I would like to extract the keywords: maturity date & TRADE:12345.
From the extracted information, I would frame a SQL-like query, search the DB, retrieve the SQL output and provide the response back to the user.
Any help would be appreciated.
Thanks in advance.

So, this is where the work comes in.
Normally people start with a stop word list. There are several, choose wisely. But more than likely you'll experiment and/or use a base list and then add more words to that list.
Depending on the list it will take out
"what, is, the, for, ?"
Since this a pretty easy example, they'll all do that. But you'll notice that what is being done is just the opposite of what you wanted. You asked for domain-specific words but what is happening is the removal of all that other cruft (to the library).
From here it will depend on what you use. NLTK or Spacy are common choices. Regardless of what you pick, get a real understanding of concepts or it can bite you (like pretty much anything in Data Science).
Expect to start thinking in terms of linguistic patterns so, in your example:
What is the maturity date for TRADE:12345 ?
'What' is an interrogative, 'the' is a definite article, 'for' starts a prepositional phrase.
There may be other clues such as the ':' or that TRADE is in all caps. But, it might not be.
That should get you started but you might look at some of the other StackExchange sites for deeper expertise.
Finally, you want to break a question like this into more than one question (assuming that you've done the research and determined the question hasn't already been asked -- repeatedly). So, NLTK and NLP are decently new, but SQL queries are usually a Google search.

Generate Random Sentence From Grammar or Ngrams?

I am writing a program that should spit out a random sentence of a complexity of my choosing. As a concrete example, I would like to aid my language learning by spitting out valid sentences of a grammar structure and using words that I have already learned. I would like to use python and nltk to do this, although I am open to other ideas.
It seems like there are a couple of approaches:
Define a grammar file that uses the grammar and lexicon I know about, and then generate all valid sentences from this list, then selecting a random answer.
Load in corpora to train ngrams, which then can be used to construct a sentence.
Am I thinking about this correctly? Is one approach preferred over the other? Any tips are appreciated. Thanks!

If I'm getting it right and if the purpose is to test yourself on the vocabulary you already have learned, then another approach could be taken:
Instead of going through the difficult labor of NLG (Natural Language Generation), you could create a search program that goes online, reads news feeds or even simply Wikipedia, and finds sentences with only the words you have defined.
In any case, for what you want, you will have to create lists of words that you have learned. You could then create search algorithms for sentences that contain only / nearly only these words.
That would have the major advantage of testing yourself on real sentences, as opposed to artificially-constructed ones (which are likely to sound not quite right in a number of cases).
An app like this would actually be a great help for learning a foreign language. If you did it nicely I'm sure a lot of people would benefit from using it.

If your purpose is really to make a language learning aid, you need to generate grammatical (i.e., correct) sentences. If so, do not use ngrams. They stick together words at random, and you just get intriguingly natural-looking nonsense.
You could use a grammar in principle, but it will have to be a very good and probably very large grammar.
Another option you haven't considered is to use a template method. Get yourself a bunch of sentences, identify some word classes you are interested in, and generate variants by fitting, e.g., different nouns as the subject or object. This method is much more likely to give you usable results in a finite amount of time. There's any number of well-known bots that work on this principle, and it's also pretty much what language-teaching books do.

Natural Language Processing - Similar to ngram

I'm currently working on a NLP project that is trying to differentiate between synonyms (received from Python's NLTK with WordNet) in a context. I've looked into a good deal of NLP concepts trying to find exactly what I want, and the closest thing I've found is n-grams, but its not quite a perfect fit.
Suppose I am trying to find the proper definition of the verb "box". "Box" could mean "fight" or "package"; however, somewhere else in the text, the word "ring" or "fighter" appears. As I understand it, an n-gram would be "box fighter" or "box ring", which is rather ridiculous as a phrase, and not likely to appear. But on a concept map, the "box" action might be linked with a "ring", since they are conceptually related.
Is n-gram what I want? Is there another name for this? Any help on where to look for retrieving such relational data?
All help is appreciated.

You might want to look into word sense disambiguation (WSD), it is the problem of determining which "sense" (meaning) of a word is activated by the use of the word in a particular context, a process which appears to be largely unconscious in people.

How do I approach this named-entity classification task?

I am asking a related question here but this question is more general. I have taken a large corpora and annotated some words with their named-entities. In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities. For example, following is one sentence:
When the robot had a technical glitch, the object was thrown but was later caught by another robot.
is tagged as:
When the (robot)/Entity had a (technical glitch)/Incident, the
(object)/Entity was (thrown)/Action but was later (caught)/Action by
(another robot)/Entity.
Given examples like this, is there anyway I can train a classifier to recognize new named-entities? For instance, given a sentence like this:
The nanobot had a bug and so it crashed into the wall.
should be tagged somewhat like this:
The (nanobot)/Entity had a (bug)/Incident and so it (crashed)/Action into the (wall)/Entity.
Of course, I am aware that 100% accuracy is not possible but I would be interested in knowing any formal approaches to do this. Any suggestions?

This is not named-entity recognition at all, since none of the labeled parts are names, so the feature sets for NER systems won't help you (English NER systems tend to rely on capitalization quite strongly and will prefer nouns). This is a kind of information extraction/semantic interpretation. I suspect this is going to be quite hard in a machine learning setting because your annotation is really inconsistent:
When the (robot)/Entity had a (technical glitch)/Incident, the (object)/Entity was (thrown)/Action but was later (caught)/Action by another robot.
Why is "another robot" not annotated?
If you want to solve this kind of problem, you'd better start out with some regular expressions, maybe matched against POS-tagged versions of the string.

I can think of 2 approaches.
First is pattern matching over words in sentence. Something like this (pseudocode, though it is similar to NLTK chunk parser syntax):
<some_word>+ (<NN|NNS>) <have|has|had> (<NN|NNS>)
<NN|NNS> (<VB>|was <VB>) (<and|but> (<VB>|was <VB>))* <into|onto|by> (<NN|NNS>)
These 2 patterns can (roughly) catch 2 parts of your first sentence. This is a good choice if you have not very much kinds of sentences. I believe it is possible to get up to 90% accuracy with well-chosen patterns. Drawback is that this model is hard to extend/modify.
Another approach is to mine dependencies between words in sentence, for example, with Stanford Dependency Parser. Among other things, it allows to mine object, subject and predicate, that seems very similar to what you want: in your first sentence "robot" is subject, "had" is predicate and "glitch" is object.

You could try object role modeling at http://www.ormfoundation.com/ which looks at the semantics(facts) between one or more entities or names and their relationships with other objects. There are also tools to convert the orm models into xml and other languages and vice versa. See http://orm.sourceforge.net/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.