How to identify the subject of a sentence? - python

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.

You can use Spacy.
Code
import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]
print(sub_toks)

As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."
Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.

English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.
It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.
If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.
Also look at the patterns in passive voice and write rules for the same.

rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.
from rake_nltk import Rake
rake = Rake()
kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")
ranked_phrases = rake.get_ranked_phrases()
print(ranked_phrases)
# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']
By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:
rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!##$%^*/\''')
By default string.punctuation is used for punctuation.
The constructor also accepts a language keyword which can be any language supported by nltk.

Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.
Attaching screenshot of same:

code using spacy :
here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object
import spacy
nlp = spacy.load('en_core_web_lg')
def get_subject_object_phrase(doc, dep):
doc = nlp(doc)
for token in doc:
if dep in token.dep_:
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(doc[start:end])

You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.
Credits: https://github.com/explosion/spaCy/issues/380

Related

Is Spacy lemmatization not working properly or does it not lemmatize all words ending with "-ing"?

When I run the spacy lemmatizer, it does not lemmatize the word "consulting" and therefore I suspect it is failing.
Here is my code:
nlp = spacy.load('en_core_web_trf', disable=['parser', 'ner'])
lemmatizer = nlp.get_pipe('lemmatizer')
doc = nlp('consulting')
print([token.lemma_ for token in doc])
And my output:
['consulting']
The spaCy lemmatizer is not failing, it's performing as expected. Lemmatization depends heavily on the Part of Speech (PoS) tag assigned to the token, and PoS tagger models are trained on sentences/documents, not single tokens (words). For example, parts-of-speech.info which is based on the Stanford PoS tagger, does not allow you to enter single words.
In your case, the single word "consulting" is being tagged as a noun, and the spaCy model you are using deems "consulting" to be the appropriate lemma for this case. You'll see if you change your string instead to "consulting tomorrow", spaCy will lemmatize "consulting" to "consult" as it is tagged as a verb (see output from the code below). In short, I recommend not trying to perform lemmatization on single tokens, instead, use the model on sentences/documents as it was intended.
As a side note: make sure you understand the difference between a lemma and a stem. Read this section provided on Wikipedia Lemma (morphology) page if you are unsure.
import spacy
nlp = spacy.load('en_core_web_trf', disable=['parser', 'ner'])
doc = nlp('consulting')
print([[token.pos_, token.lemma_] for token in doc])
# Output: [['NOUN', 'consulting']]
doc_verb = nlp('Consulting tomorrow')
print([[token.pos_, token.lemma_] for token in doc_verb])
# Output: [['VERB', 'consult'], ['NOUN', 'tomorrow']]
If you really need to lemmatize single words, the second approach on this GeeksforGeeks Python lemmatization tutorial produces the lemma "consult". I've created a condensed version of it here for future reference in case the link becomes invalid. I haven't tested it on other single tokens (words) so it may not work for all cases.
# Condensed version of approach #2 given in the GeeksforGeeks lemmatizer tutorial:
# https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
lemmatizer = WordNetLemmatizer()
sentence = 'consulting'
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
lemmatized_sentence = []
for word, tag in pos_tagged:
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos_tagger(tag)))
print(lemmatized_sentence)
# Output: ['consult']
spaCy's lemmatizer behaves differently depending on the part of speech. In particular, for nouns, the "-ing" form is considered to be the lemma already, and is not changed.
Here's an example that illustrates the difference:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "While consulting, I sometimes tell people about the consulting business."
for tok in nlp(text):
print(tok, tok.pos_, tok.lemma_, sep="\t")
Output:
While SCONJ while
consulting VERB consult
, PUNCT ,
I PRON I
sometimes ADV sometimes
tell VERB tell
people NOUN people
about ADP about
the DET the
consulting NOUN consulting
business NOUN business
See how the verb has "consult" as a lemma, while the noun does not.

Autodetect and translate two or more languages ins a sentence using python

I have the following example sentence
text_to_translate1=" I want to go for swimming however the 天气 (weather) is not good"
As you can see there exist two languages in the sentence (i.e, English and Chinese).
I want a translator. The result that I want is the following:
I want to go for swimming however the weather(weather) is not good
I used the deepl translator but cannot autodetect two languages in one.
Code that I follow:
import deepl
auth_key = "xxxxx"
translator = deepl.Translator(auth_key)
result = translator.translate_text(text_to_translate1, target_lang="EN-US")
print(result.text)
print(result.detected_source_lang)
The result is the following:
I want to go for swimming however the 天气 (weather) is not good
EN
Any ideas?
I do not have access to the DeepL Translator, but according to their API you can supply a source language as a parameter:
result = translator.translate_text(text_to_translate1, source_lang="ZH", target_lang="EN-US")
If this doesn't work, my next best idea (if your target language is only English) is to send only the words that aren't English, then translate them in place. You can do this by looping over all words, checking if they are English, and if not, then translating them. You can then merge all the words together.
I would do it like so:
text_to_translate1 = "I want to go for swimming however the 天气 (weather) is not good"
new_sentence = []
# Split the sentence into words
for word in text_to_translate1.split(" "):
# For each word, check if it is English
if not word.isascii():
new_sentence.append(translator.translate_text(word, target_lang="EN-US").text)
else:
new_sentence.append(word)
# Merge the sentence back together
translated_sentence = " ".join(new_sentence)
print(translated_sentence)

POS Tagger for a Virtual Assistant

I'm trying to build a POS Tagger for a Voise Assistant. However, the nltk's pos tagger nltk.pos_tag doesn't work well for me. For example:
sent = 'open Youtube'
tokens = nltk.word_tokenize(sent)
nltk.pos_tag(tokens, tagset='universal')
>>[('open', 'ADJ'), ('Youtube', 'NOUN')]
In the above case I'd want the word open to be a verb and not an adjective. Similarly, it tags the word 'close' as an adverb and not a verb.
I have also tried using an n-gram tagger
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
default_tagger = nltk.DefaultTagger('NOUN')
unigram_tagger = nltk.UnigramTagger(train_sents, backoff = default_tagger)
bigram_tagger = nltk.BigramTagger(train_sents, backoff = unigram_tagger)
trigram_tagger = nltk.TrigramTagger(train_sents, backoff = bigram_tagger)
I have used the brown corpus from nltk. But it still gives the same result.
So I'd like to know:
Is there a better tagged corpus to train a tagger for making a voice/virtual assistant?
Is there a higher n-gram than trigram i.e. that looks at 4 words or more together like trigram and bigram look at 3 and 2 words together respectively. Will it improve the performance?
How can I fix this?
Concerning question #3
I think this is not a general solution, but it works at least for the context you mention "Do this/that". So, if you put a "to" at the beginning the tagger will tend to "understand" a verb instead an adjetive, noun or adverb!
I took this screenshot using Freeling_demo just to compare interpretations
Specifically, if you want to use Freeling there are java/python API avaliable or you can call it just using command line.
Respect question #2 I think include context work better for complete sentences or large texts, maybe is not the case for command a basic virtual assistant.
Good luck!

recognize "it" subject in spaCY

Hi guys I recently discovered spaCY as an interesting way to recognize grammar in sentences, I tried with something easy and it works, but when I try to let it recognize the "it" subject in a short sentence it doesn't work very well, is there a way to improve the accuracy? the sentence I'm talking about is "do you like it?" where "it" is in this case the real subject. When I start the program spaCY recognizes "you" as a subject instead of "it".
What is a good way to avoid this kind of "errors"? here's the simple code:
import spacy
sentence = input('insert sentence: \n\n')
nlp = spacy.load('en')
sent = sentence
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]
print(sub_toks)
print()
This is in fact not a Spacy problem but a grammar problem. In the sentence
Do you like it?
The subject - as Spacy is telling you - is the word "you". The word "it" is the object of the verb "like". You may want to skim the Wiki page for subject and the Wiki page for object.
If you are looking for a sentence where "it" is the subject, Spacy can help you with that.
sent = nlp("it is very good")
for token in sent:
print(token, token.dep_)
>> it nsubj
>> is ROOT
>> very advmod
>> good acomp
In this case, Spacy correctly reports that "it" is the nominal subject, and token.dep is equal to 'nsubj'. Conversely, if what you really want is the direct object, then as you can see from this output:
sent = nlp("do you like it")
for token in sent:
print(token, token.dep_)
>> do aux
>> you nsubj
>> like ROOT
>> it dobj
You should be looking for tokens where token.dep_ == 'dobj'. If you want indirect objects as well, you can also check for 'iobj'. You can read more about the roles of these dependencies here.

Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?

Using NLTK and WordNet, how do I convert simple tense verb into its present, past or past participle form?
For example:
I want to write a function which would give me verb in expected form as follows.
v = 'go'
present = present_tense(v)
print present # prints "going"
past = past_tense(v)
print past # prints "went"
With the help of NLTK this can also be done. It can give the base form of the verb. But not the exact tense, but it still can be useful. Try the following code.
from nltk.stem.wordnet import WordNetLemmatizer
words = ['gave','went','going','dating']
for word in words:
print word+"-->"+WordNetLemmatizer().lemmatize(word,'v')
The output is:
gave-->give
went-->go
going-->go
dating-->date
Have a look at Stack Overflow question NLTK WordNet Lemmatizer: Shouldn't it lemmatize all inflections of a word?.
I think what you're looking for is the NodeBox::Linguistics library. It does exactly that:
print en.verb.present("gave")
>>> give
For Python3:
pip install pattern
then
from pattern.en import conjugate, lemma, lexeme, PRESENT, SG
print (lemma('gave'))
print (lexeme('gave'))
print (conjugate(verb='give',tense=PRESENT,number=SG)) # he / she / it
yields
give
['give', 'gives', 'giving', 'gave', 'given']
gives
thnks to #Agargara for pointing & authors of Pattern for their beautiful work, go support them ;-)
PS. To use most of pattern's functionality in python 3.7+, you might want to use the trick described here
JWI (the WordNet library by MIT) also has a stemmer (WordNetStemmer) which converts different morphological forms of a word like ("written", "writes", "wrote") to their base form. It seems it works only for nouns (like plurals) and verbs though.
Word Stemming in Java with WordNet and JWNL also shows how to do this kind of stemming using JWNL, another Java-based Wordnet library:

Categories