recognize "it" subject in spaCY - python

Hi guys I recently discovered spaCY as an interesting way to recognize grammar in sentences, I tried with something easy and it works, but when I try to let it recognize the "it" subject in a short sentence it doesn't work very well, is there a way to improve the accuracy? the sentence I'm talking about is "do you like it?" where "it" is in this case the real subject. When I start the program spaCY recognizes "you" as a subject instead of "it".
What is a good way to avoid this kind of "errors"? here's the simple code:
import spacy
sentence = input('insert sentence: \n\n')
nlp = spacy.load('en')
sent = sentence
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]
print(sub_toks)
print()

This is in fact not a Spacy problem but a grammar problem. In the sentence
Do you like it?
The subject - as Spacy is telling you - is the word "you". The word "it" is the object of the verb "like". You may want to skim the Wiki page for subject and the Wiki page for object.
If you are looking for a sentence where "it" is the subject, Spacy can help you with that.
sent = nlp("it is very good")
for token in sent:
print(token, token.dep_)
>> it nsubj
>> is ROOT
>> very advmod
>> good acomp
In this case, Spacy correctly reports that "it" is the nominal subject, and token.dep is equal to 'nsubj'. Conversely, if what you really want is the direct object, then as you can see from this output:
sent = nlp("do you like it")
for token in sent:
print(token, token.dep_)
>> do aux
>> you nsubj
>> like ROOT
>> it dobj
You should be looking for tokens where token.dep_ == 'dobj'. If you want indirect objects as well, you can also check for 'iobj'. You can read more about the roles of these dependencies here.

Related

how to evaluate English words as something or someone

I'm writing python code to recognize idioms/phrases in any given English sentence.
here is an example of a phrase
phrase = 'land someone in something'
here are two sentence examples
sentence1 = 'His criminal activity finally landed him in jail.'
sentence2 = 'You really landed yourself in a fine mess!'
is there a way to use nltk (or any other python nlp package) to return 'someone' if the input was 'him' or 'yourself', and return 'something' if the input was 'jail' or 'fine mess'

Autodetect and translate two or more languages ins a sentence using python

I have the following example sentence
text_to_translate1=" I want to go for swimming however the 天气 (weather) is not good"
As you can see there exist two languages in the sentence (i.e, English and Chinese).
I want a translator. The result that I want is the following:
I want to go for swimming however the weather(weather) is not good
I used the deepl translator but cannot autodetect two languages in one.
Code that I follow:
import deepl
auth_key = "xxxxx"
translator = deepl.Translator(auth_key)
result = translator.translate_text(text_to_translate1, target_lang="EN-US")
print(result.text)
print(result.detected_source_lang)
The result is the following:
I want to go for swimming however the 天气 (weather) is not good
EN
Any ideas?
I do not have access to the DeepL Translator, but according to their API you can supply a source language as a parameter:
result = translator.translate_text(text_to_translate1, source_lang="ZH", target_lang="EN-US")
If this doesn't work, my next best idea (if your target language is only English) is to send only the words that aren't English, then translate them in place. You can do this by looping over all words, checking if they are English, and if not, then translating them. You can then merge all the words together.
I would do it like so:
text_to_translate1 = "I want to go for swimming however the 天气 (weather) is not good"
new_sentence = []
# Split the sentence into words
for word in text_to_translate1.split(" "):
# For each word, check if it is English
if not word.isascii():
new_sentence.append(translator.translate_text(word, target_lang="EN-US").text)
else:
new_sentence.append(word)
# Merge the sentence back together
translated_sentence = " ".join(new_sentence)
print(translated_sentence)

Spacy's phrasematcher with reflexive pronoun in french

First you don't to have to know french to help me as i will explain the grammar rules that i need to apply with spacy in python. I have a file (test.txt) with multiple phrases in french (about 5000), each one different one from another and a mail (textstr) which is different each time (a mail that our client send us). And for each mail i have to see if one of the phrases in the file is in the mail. I thought of using spacy's phrasematcher, but i have one problem: In each mail the sentences are conjugated, so i cannot use the default property of the phrasematcher (As it uses the verbatim token text and does not take into account the conjugation of verbs). So i first thought of using spacy's phrasematching with lemmas to resolve my problem as all conjugated verbs have the same lemma:
def treatemail(emailcontent):
nlp = spacy.load("fr_core_news_sm")
with open('test.txt','r',encoding="utf-8") as f:
phrases_list= f.readlines()
phrase_matcher = PhraseMatcher(nlp.vocab,attr="LEMMA")
patterns = [nlp(phrase.strip()) for phrase in phrases_list]
phrase_matcher.add('phrases', None, *patterns)
mail = nlp (emailcontent)
matched_phrases = phrase_matcher(mail)
for match_id, start, end in matched_phrases:
span = sentence[start:end]
print(span.text)
Which is fine for 85% of the phrases from the file, but for the remaining 15% it does not work as some of the verbs in french have reflexive pronouns (Pronouns that comes before a verb): me, te, se, nous, vous, se + verb and the equivalent m',t' and s' + verb, if the verb starts with a voyelle. (They essentially always agree with the subject they refer to.)
In the text file the phrases are written in the infinitive form, so if there is a reflexive pronoun in the phrase, it's written in its infinitive form (either se + verb or s' + verb starting with a voyelle, e.g.: "S'amuser" (to have fun), "se promener" (to take a walk). In the mail the verb is conjugated with its reflective pronoun (Je me promene (I take a walk)).
What i have to do is essentially let the phrasematcher take into account the reflexive pronouns. So here's my question: How can i do that? Should i make a custom component which checks if there's a reflexive pronoun in the email and change the text to its infinitive form or is there some other way?
Thank you very much!
You can use dependency relations for this.
Pasting some example reflexive verb sentences into the displaCy demo, you can see that the reflexive pronouns for these verbs always have an expl:comp relation. A very simple way to find these verbs is to just iterate over tokens and check for that relation. (I am not 100% sure this is the only way it's used, so you should check that, but it seems likely.)
I don't know French so I'm not sure if these verbs have strict ordering, or if words can come between the pronoun and the verb. If the latter (which seems likely), you can't use the normal Matcher or PhraseMatcher because they rely on contiguous sequences of words. But you can use the DependencyMatcher. Something like this:
from spacy.matcher import DependencyMatcher
VERBS = [ ... verbs in your file ... ]
pattern = [
# anchor token: verb
{
"RIGHT_ID": "verb",
"RIGHT_ATTRS": {"LEMMA": {"IN": VERBS}}
},
# has a reflexive pronoun
{
"LEFT_ID": "verb",
"REL_OP": ">",
"RIGHT_ID": "reflexive-pronoun",
"RIGHT_ATTRS": {"DEP": "expl:comp"}
}
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("REFLEXIVE", [pattern])
matches = matcher(doc)
This assumes that you only care about verb lemmas. If you care about the verb/pronoun combination you can just make a bunch of depmatcher rules or something.

Using a regular expression to find all noun phrases in a paragraph following the occurrence of a specific phrase

I have a data frame of paragraphs, which I have (*can) split into word tokens and sentence tokens and am looking to find all the noun phrases following any instance where the phrase: "contribute to" or "donate to" occurs.
Or really some form of that, so:
"Contributions are welcome to be made to the charity of your choice."
---> would return: "the charity of your choice"
and
"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"
---> would return: "ABC Foundation"
I've created a regular expression work-around that captures the correct phrase about 90% of the time... see below:
text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation
I'd like to clean up that regular expression to get rid of the "{,15}" requirements because it's missing some of the values that I need. However, I'm not too polished with the "greedy" expressions and can't get it to work correctly.
so this phrase:
While she lived a full life , had many achievements and made many
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName
is returning: "visit brother FirstName Lastname" due to the previous mentioning of contributions even though the word "to" comes well after 15 words later.
Looks like you're struggling with how to restrict your search criteria to a single sentence. So just use the NLTK to break your text into sentences (which it can do far better than just looking at periods), and your problem disappears.
sents = nltk.sent_tokenize(x) # `x` is a single string, as in your example
recipients = []
for sent in sents:
m = re.search(r"\b(contrib|donat).*?\bto\b([^.,;]*)", sent)
if m:
recipients.append(m.group(2).strip())
For further work, I also recommend that you use a better tool than Text, which is intended for simple interactive explorations. If you do want to do more with your texts, the nltk's PlaintextCorpusReader is your friend.
(?:contrib|donat|gifts)(?=[^\.]+\bto\b[^\.]+).*to\s([^\.]+)
Example
If works and does what you need then let me know and I will explain my regexp.

How to identify the subject of a sentence?

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.
You can use Spacy.
Code
import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]
print(sub_toks)
As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."
Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.
English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.
It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.
If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.
Also look at the patterns in passive voice and write rules for the same.
rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.
from rake_nltk import Rake
rake = Rake()
kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")
ranked_phrases = rake.get_ranked_phrases()
print(ranked_phrases)
# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']
By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:
rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!##$%^*/\''')
By default string.punctuation is used for punctuation.
The constructor also accepts a language keyword which can be any language supported by nltk.
Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.
Attaching screenshot of same:
code using spacy :
here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object
import spacy
nlp = spacy.load('en_core_web_lg')
def get_subject_object_phrase(doc, dep):
doc = nlp(doc)
for token in doc:
if dep in token.dep_:
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(doc[start:end])
You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.
Credits: https://github.com/explosion/spaCy/issues/380

Categories