Spacy's phrasematcher with reflexive pronoun in french - python

First you don't to have to know french to help me as i will explain the grammar rules that i need to apply with spacy in python. I have a file (test.txt) with multiple phrases in french (about 5000), each one different one from another and a mail (textstr) which is different each time (a mail that our client send us). And for each mail i have to see if one of the phrases in the file is in the mail. I thought of using spacy's phrasematcher, but i have one problem: In each mail the sentences are conjugated, so i cannot use the default property of the phrasematcher (As it uses the verbatim token text and does not take into account the conjugation of verbs). So i first thought of using spacy's phrasematching with lemmas to resolve my problem as all conjugated verbs have the same lemma:
def treatemail(emailcontent):
nlp = spacy.load("fr_core_news_sm")
with open('test.txt','r',encoding="utf-8") as f:
phrases_list= f.readlines()
phrase_matcher = PhraseMatcher(nlp.vocab,attr="LEMMA")
patterns = [nlp(phrase.strip()) for phrase in phrases_list]
phrase_matcher.add('phrases', None, *patterns)
mail = nlp (emailcontent)
matched_phrases = phrase_matcher(mail)
for match_id, start, end in matched_phrases:
span = sentence[start:end]
print(span.text)
Which is fine for 85% of the phrases from the file, but for the remaining 15% it does not work as some of the verbs in french have reflexive pronouns (Pronouns that comes before a verb): me, te, se, nous, vous, se + verb and the equivalent m',t' and s' + verb, if the verb starts with a voyelle. (They essentially always agree with the subject they refer to.)
In the text file the phrases are written in the infinitive form, so if there is a reflexive pronoun in the phrase, it's written in its infinitive form (either se + verb or s' + verb starting with a voyelle, e.g.: "S'amuser" (to have fun), "se promener" (to take a walk). In the mail the verb is conjugated with its reflective pronoun (Je me promene (I take a walk)).
What i have to do is essentially let the phrasematcher take into account the reflexive pronouns. So here's my question: How can i do that? Should i make a custom component which checks if there's a reflexive pronoun in the email and change the text to its infinitive form or is there some other way?
Thank you very much!

You can use dependency relations for this.
Pasting some example reflexive verb sentences into the displaCy demo, you can see that the reflexive pronouns for these verbs always have an expl:comp relation. A very simple way to find these verbs is to just iterate over tokens and check for that relation. (I am not 100% sure this is the only way it's used, so you should check that, but it seems likely.)
I don't know French so I'm not sure if these verbs have strict ordering, or if words can come between the pronoun and the verb. If the latter (which seems likely), you can't use the normal Matcher or PhraseMatcher because they rely on contiguous sequences of words. But you can use the DependencyMatcher. Something like this:
from spacy.matcher import DependencyMatcher
VERBS = [ ... verbs in your file ... ]
pattern = [
# anchor token: verb
{
"RIGHT_ID": "verb",
"RIGHT_ATTRS": {"LEMMA": {"IN": VERBS}}
},
# has a reflexive pronoun
{
"LEFT_ID": "verb",
"REL_OP": ">",
"RIGHT_ID": "reflexive-pronoun",
"RIGHT_ATTRS": {"DEP": "expl:comp"}
}
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("REFLEXIVE", [pattern])
matches = matcher(doc)
This assumes that you only care about verb lemmas. If you care about the verb/pronoun combination you can just make a bunch of depmatcher rules or something.

Related

Spacy Regex Phrase Matcher in Python

In a large corpus of text, I am interested in extracting every sentence which has a specific list of (Verb-Noun) or (Adjective-Noun) somewhere in the sentence. I have a long list but here is a sample. In my MWE I am trying to extract sentences with "write/wrote/writing/writes" and "book/s". I have around 30 such pairs of words.
Here is what I have tried but it's not catching most of the sentences:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')
matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)
for sent in doc.sents:
if matcher(nlp(sent.lemma_)):
print(sent.text)
Unfortunately, I am only getting one match:
"While writing this book, he had to fend off aliens and dinosaurs."
Whereas, I expect to get the "He wrote his first book" sentence as well. The other write-books have writer as a noun to its good that its not matching.
The issue is that in the Matcher, by default each dictionary in the pattern corresponds to exactly one token. So your regex doesn't match any number of characters, it matches any one token, which isn't what you want.
To get what you want, you can use the OP value to specify that you want to match any number of tokens. See the operators or quantifiers section in the docs.
However, given your problem, you probably want to actually use the Dependency Matcher instead, so I rewrote your code to use that as well. Try this:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp("""
Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.
While writing this book, he had to fend off aliens and dinosaurs. Greene's second book might not have been written by him.
Greene's cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.""")
matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "write"},{"OP": "*"},{"LEMMA": "book"}]
matcher.add("testy", [pattern])
print("----- Using Matcher -----")
for sent in doc.sents:
if matcher(sent):
print(sent.text)
print("----- Using Dependency Matcher -----")
deppattern = [
{"RIGHT_ID": "wrote", "RIGHT_ATTRS": {"LEMMA": "write"}},
{"LEFT_ID": "wrote", "REL_OP": ">", "RIGHT_ID": "book",
"RIGHT_ATTRS": {"LEMMA": "book"}}
]
from spacy.matcher import DependencyMatcher
dmatcher = DependencyMatcher(nlp.vocab)
dmatcher.add("BOOK", [deppattern])
for _, (start, end) in dmatcher(doc):
print(doc[start].sent)
One other, less important thing - the way you were calling the matcher was kind of weird. You can pass the matcher Docs or Spans, but they should definitely be natural text, so calling .lemma_ on the sentence and creating a fresh doc from that worked in your case, but in general should be avoided.

How can I detect if it is a negation with a verb in a sentence using spacy(or other library)?

..Hello Everybody! I'm working on an NLP project, and I want to detect if there is a negation with a given verb in a sentence
For example : the function "Is_there_negation" should return "True" with the following parameters :
text:"I don't want to eat right now"
verb:"eat"
How can I complete this function(I'm really beginner in NLP)
import spacy
nlp = spacy.load("en_core_web_sm")
def Is_there_negation(doc,verb):
for token in doc :
if(token.dep_=="neg" and token.head.text==verb):
return True
elif(token.text==verb):
for tk in token.subtree:
.
.
.
return False
Thanks in advance
This is quite tricky. From the syntactic point of view, eat is not negated in the sentence, it is the auxiliary word want that is negated as you can see in a dependency parse of the sentence.
What might do the job is:
Find a negative particle (with dependency label neg)
Look what it is its dependency parent. If it is the verb you are looking for, return true.
If not, list descendants of the verb and have a look if the verb you are looking for is in the subtree (i.e, has the correct lemma and is tagged as a VERB). You might want however to limit the search to some subtree depth to ignore weird cases with too complicated phrases.

Determining tense of a sentence Python

Following several other posts, [e.g. Detect English verb tenses using NLTK , Identifying verb tenses in python, Python NLTK figure out tense ] I wrote the following code to determine tense of a sentence in Python using POS tagging:
from nltk import word_tokenize, pos_tag
def determine_tense_input(sentence):
text = word_tokenize(sentence)
tagged = pos_tag(text)
tense = {}
tense["future"] = len([word for word in tagged if word[1] == "MD"])
tense["present"] = len([word for word in tagged if word[1] in ["VBP", "VBZ","VBG"]])
tense["past"] = len([word for word in tagged if word[1] in ["VBD", "VBN"]])
return(tense)
This returns a value for the usage of past/present/future verbs, which I typically then take the max value of as the tense of the sentence. The accuracy is moderately decent, but I am wondering if there is a better way of doing this.
For example, is there now by-chance a package written which is more dedicated to extracting the tense of a sentence? [note - 2 of the 3 stack-overflow posts are 4-years old, so things may have now changed]. Or alternatively, should I be using a different parser from within nltk to increase accuracy? If not, hope the above code may help someone else!
You can strengthen your approach in various ways. You could think more about the grammar of English and add some more rules based on whatever you observe; or you could push the statistical approach, extract some more (relevant) features and throw the whole lot at a classifier. The NLTK gives you plenty of classifiers to play with, and they're well documented in the NLTK book.
You can have the best of both worlds: Hand-written rules can be in the form of features that are fed to the classifier, which will decide when it can rely on them.
As of http://dev.lexalytics.com/wiki/pmwiki.php?n=Main.POSTags, the tags mean
MD Modal verb (can, could, may, must)
VB Base verb (take)
VBC Future tense, conditional
VBD Past tense (took)
VBF Future tense
VBG Gerund, present participle (taking)
VBN Past participle (taken)
VBP Present tense (take)
VBZ Present 3rd person singular (takes)
so that your code would be
tense["future"] = len([word for word in tagged if word[1] in ["VBC", "VBF"])
You could use the Stanford Parser to get a dependency parse of the sentence. The root of the dependency parse will be the 'primary' verb that defines the sentence (I'm not too sure what the specific linguistic term is). You can then use the POS tag on this verb to find its tense, and use that.
This worked for me:
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
`grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}
"""`
The only thing is that you gotta deal with modal verbs, cause "could" or "may", for example, are treated as "will" in this case and give you the future group.
No, of course not. This is what I got so far (you might want to read nltk book grammar parsing section, too):
I left only verb tags to simplify the task a little bit, then used nltk's RegexpParser.
def tense_detect(tagged_sentence):
verb_tags = ['MD','MDF',
'BE','BEG','BEN','BED','BEDZ','BEZ','BEM','BER',
'DO','DOD','DOZ',
'HV','HVG','HVN','HVD','HVZ',
'VB','VBG','VBN','VBD','VBZ',
'SH',
'TO',
'JJ' # maybe?
]
verb_phrase = []
for item in tagged_sentence:
if item[1] in verb_tags:
verb_phrase.append(item)
grammar = r'''
future perfect continuous passive: {<MDF><HV><BEN><BEG><VBN|VBD>+}
conditional perfect continuous passive:{<MD><HV><BEN><BEG><VBN|VBD>+}
future continuous passive: {<MDF><BE><BEG><VBN|VBD>+}
conditional continuous passive: {<MD><BE><BEG><VBN|VBD>+}
future perfect continuous: {<MDF><HV><BEN><VBG|HVG|BEG>+}
conditional perfect continuous: {<MD><HV><BEN><VBG|HVG|BEG>+}
past perfect continuous passive: {<HVD><BEN><BEG><VBN|VBD>+}
present perfect continuous passive: {<HV|HVZ><BEN><BEG><VBN|VBD>+}
future perfect passive: {<MDF><HV><BEN><VBN|VBD>+}
conditional perfect passive: {<MD><HV><BEN><VBN|VBD>+}
future continuous: {<MDF><BE><VBG|HVG|BEG>+ }
conditional continuous: {<MD><BE><VBG|HVG|BEG>+ }
future indefinite passive: {<MDF><BE><VBN|VBD>+ }
conditional indefinite passive: {<MD><BE><VBN|VBD>+ }
future perfect: {<MDF><HV><HVN|BEN|VBN|VBD>+ }
conditional perfect: {<MD><HV><HVN|BEN|VBN|VBD>+ }
past continuous passive: {<BED|BEDZ><BEG><VBN|VBD>+}
past perfect continuous: {<HVD><BEN><HVG|BEG|VBG>+}
past perfect passive: {<HVD><BEN><VBN|VBD>+}
present continuous passive: {<BEM|BER|BEZ><BEG><VBN|VBD>+}
present perfect continuous: {<HV|HVZ><BEN><VBG|BEG|HVG>+}
present perfect passive: {<HV|HVZ><BEN><VBN|VBD>+}
future indefinite: {<MDF><BE|DO|VB|HV>+ }
conditional indefinite: {<MD><BE|DO|VB|HV>+ }
past continuous: {<BED|BEDZ><VBG|HVG|BEG>+}
past perfect: {<HVD><BEN|VBN|HVD|HVN>+}
past indefinite passive: {<BED|BEDZ><VBN|VBD>+}
present indefinite passive: {<BEM|BER|BEZ><VBN|VBD>+}
present continuous: {<BEM|BER|BEZ><BEG|VBG|HVG>+}
present perfect: {<HV|HVZ><BEN|HVD|VBN|VBD>+ }
past indefinite: {<DOD><VB|HV|DO>|<BEDZ|BED|HVD|VBN|VBD>+}
infinitive: {<TO><BE|HV|VB>+}
present indefinite: {<DO|DOZ><DO|HV|VB>+|<DO|HV|VB|BEZ|DOZ|BER|HVZ|BEM|VBZ>+}
'''
cp = nltk.RegexpParser(grammar)
result = cp.parse(verb_phrase)
display(result)
tenses_set = set()
for node in result:
if type(node) is nltk.tree.Tree:
tenses_set.add(node.label())
return result, tenses_set
This works just OK. Even with odd complex sentences. The big problem are the causatives, like "I have my car washed every day". Removing everything but the verbs results in " have washed", which gives Present Perfect.
You gotta tweak it anyway.
I've just fixed the computer and don't have nltk installed yet to show the outcome. Will try to do it tomorrow.

How to identify the subject of a sentence?

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.
You can use Spacy.
Code
import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]
print(sub_toks)
As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."
Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.
English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.
It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.
If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.
Also look at the patterns in passive voice and write rules for the same.
rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.
from rake_nltk import Rake
rake = Rake()
kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")
ranked_phrases = rake.get_ranked_phrases()
print(ranked_phrases)
# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']
By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:
rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!##$%^*/\''')
By default string.punctuation is used for punctuation.
The constructor also accepts a language keyword which can be any language supported by nltk.
Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.
Attaching screenshot of same:
code using spacy :
here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object
import spacy
nlp = spacy.load('en_core_web_lg')
def get_subject_object_phrase(doc, dep):
doc = nlp(doc)
for token in doc:
if dep in token.dep_:
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(doc[start:end])
You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.
Credits: https://github.com/explosion/spaCy/issues/380

Detect whether the word you is a subject or object pronoun based on sentence context.

Ideally using regex, in python. I'm making a simple chatbot, and it's currently having problems responding to phrases like "I love you" correctly (it'll throw back "You love I" out of the grammar handler, when it should be giving back "You love me").
In addition, I'd like it if you could think of good phrases to throw into this grammar handler, that'd be great. I'd love some testing data.
If there's a good list of transitive verbs out there (something like a "top 100 used") it may be acceptable to use that and special case the "transitive verb + you" pattern.
Well, what you're trying to implement is definitely very challenging but also very difficult.
Logic
As a starter, I would look a bit into the Grammar rules first.
Basic sentence structure :
SUBJECT + TRANSITIVE VERB + OBJECT
SUBJECT + INTRANSITIVE VERB
(Of course, we could also talk about "Subject+Verb+Indirect Object+Direct Object" formats, etc (e.g. I give you the ball) but this would get too complicated for now...)
Obviously, this scheme is VERY simplistic, but let's stick to that for now.
Then (another over-simplistic assumption), that each part is a single word.
so basically you have the following Sentence Scheme :
WORD WORD WORD
which could be generally matched using a regex like :
([\w]+)\s+([\w]+)\s+([\w]+)?
Explanation :
([\w]+) # first word (=subject)
\s+ # one or more spaces
([\w]+) # second word (=verb)
\s+ # one or more spaces
([\w]+)? # (optional) third word (=object - if the verb is transitive)
Now, obviously to formulate sentences like "You love me" and not "You love I", your algorithm should also "understand" that :
The third part of the sentence has the role of the Object
Since "I" is a personal pronoun (used only in nominative case : "as a subject"), we should you its "accusative form" (=as an object); so, for this purpose, you may also need e.g. personal pronoun tables like :
I - my - me
You - your - you
He - his - him
etc...
Just a few ideas... (purely out of my enthusiasm for linguistics :-))
Data
As for the wordlists you are interested in, just a few samples :
330 Most Common English Verbs (most - if not all of them - are
transitive)
Personal Pronouns Chart
What you want is a syntactic analyser (aka parser)- this can be done by a rule-based system as described by #Dr.Kameleon, or statistically. There are many implementations out there, one being the Stanford one. These will generally tell you what the syntactic role of a word is (e.g. subject "You are here", or object "She like you"). How you use that information to turn statements into questions is a whole different can of worms. For English, you can get a fairly simple rule-based system to work OK.

Categories