Spacy Regex Phrase Matcher in Python - python

In a large corpus of text, I am interested in extracting every sentence which has a specific list of (Verb-Noun) or (Adjective-Noun) somewhere in the sentence. I have a long list but here is a sample. In my MWE I am trying to extract sentences with "write/wrote/writing/writes" and "book/s". I have around 30 such pairs of words.
Here is what I have tried but it's not catching most of the sentences:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')
matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)
for sent in doc.sents:
if matcher(nlp(sent.lemma_)):
print(sent.text)
Unfortunately, I am only getting one match:
"While writing this book, he had to fend off aliens and dinosaurs."
Whereas, I expect to get the "He wrote his first book" sentence as well. The other write-books have writer as a noun to its good that its not matching.

The issue is that in the Matcher, by default each dictionary in the pattern corresponds to exactly one token. So your regex doesn't match any number of characters, it matches any one token, which isn't what you want.
To get what you want, you can use the OP value to specify that you want to match any number of tokens. See the operators or quantifiers section in the docs.
However, given your problem, you probably want to actually use the Dependency Matcher instead, so I rewrote your code to use that as well. Try this:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp("""
Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.
While writing this book, he had to fend off aliens and dinosaurs. Greene's second book might not have been written by him.
Greene's cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.""")
matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "write"},{"OP": "*"},{"LEMMA": "book"}]
matcher.add("testy", [pattern])
print("----- Using Matcher -----")
for sent in doc.sents:
if matcher(sent):
print(sent.text)
print("----- Using Dependency Matcher -----")
deppattern = [
{"RIGHT_ID": "wrote", "RIGHT_ATTRS": {"LEMMA": "write"}},
{"LEFT_ID": "wrote", "REL_OP": ">", "RIGHT_ID": "book",
"RIGHT_ATTRS": {"LEMMA": "book"}}
]
from spacy.matcher import DependencyMatcher
dmatcher = DependencyMatcher(nlp.vocab)
dmatcher.add("BOOK", [deppattern])
for _, (start, end) in dmatcher(doc):
print(doc[start].sent)
One other, less important thing - the way you were calling the matcher was kind of weird. You can pass the matcher Docs or Spans, but they should definitely be natural text, so calling .lemma_ on the sentence and creating a fresh doc from that worked in your case, but in general should be avoided.

Related

Spacy's phrasematcher with reflexive pronoun in french

First you don't to have to know french to help me as i will explain the grammar rules that i need to apply with spacy in python. I have a file (test.txt) with multiple phrases in french (about 5000), each one different one from another and a mail (textstr) which is different each time (a mail that our client send us). And for each mail i have to see if one of the phrases in the file is in the mail. I thought of using spacy's phrasematcher, but i have one problem: In each mail the sentences are conjugated, so i cannot use the default property of the phrasematcher (As it uses the verbatim token text and does not take into account the conjugation of verbs). So i first thought of using spacy's phrasematching with lemmas to resolve my problem as all conjugated verbs have the same lemma:
def treatemail(emailcontent):
nlp = spacy.load("fr_core_news_sm")
with open('test.txt','r',encoding="utf-8") as f:
phrases_list= f.readlines()
phrase_matcher = PhraseMatcher(nlp.vocab,attr="LEMMA")
patterns = [nlp(phrase.strip()) for phrase in phrases_list]
phrase_matcher.add('phrases', None, *patterns)
mail = nlp (emailcontent)
matched_phrases = phrase_matcher(mail)
for match_id, start, end in matched_phrases:
span = sentence[start:end]
print(span.text)
Which is fine for 85% of the phrases from the file, but for the remaining 15% it does not work as some of the verbs in french have reflexive pronouns (Pronouns that comes before a verb): me, te, se, nous, vous, se + verb and the equivalent m',t' and s' + verb, if the verb starts with a voyelle. (They essentially always agree with the subject they refer to.)
In the text file the phrases are written in the infinitive form, so if there is a reflexive pronoun in the phrase, it's written in its infinitive form (either se + verb or s' + verb starting with a voyelle, e.g.: "S'amuser" (to have fun), "se promener" (to take a walk). In the mail the verb is conjugated with its reflective pronoun (Je me promene (I take a walk)).
What i have to do is essentially let the phrasematcher take into account the reflexive pronouns. So here's my question: How can i do that? Should i make a custom component which checks if there's a reflexive pronoun in the email and change the text to its infinitive form or is there some other way?
Thank you very much!
You can use dependency relations for this.
Pasting some example reflexive verb sentences into the displaCy demo, you can see that the reflexive pronouns for these verbs always have an expl:comp relation. A very simple way to find these verbs is to just iterate over tokens and check for that relation. (I am not 100% sure this is the only way it's used, so you should check that, but it seems likely.)
I don't know French so I'm not sure if these verbs have strict ordering, or if words can come between the pronoun and the verb. If the latter (which seems likely), you can't use the normal Matcher or PhraseMatcher because they rely on contiguous sequences of words. But you can use the DependencyMatcher. Something like this:
from spacy.matcher import DependencyMatcher
VERBS = [ ... verbs in your file ... ]
pattern = [
# anchor token: verb
{
"RIGHT_ID": "verb",
"RIGHT_ATTRS": {"LEMMA": {"IN": VERBS}}
},
# has a reflexive pronoun
{
"LEFT_ID": "verb",
"REL_OP": ">",
"RIGHT_ID": "reflexive-pronoun",
"RIGHT_ATTRS": {"DEP": "expl:comp"}
}
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("REFLEXIVE", [pattern])
matches = matcher(doc)
This assumes that you only care about verb lemmas. If you care about the verb/pronoun combination you can just make a bunch of depmatcher rules or something.

How to get all noun phrases in Spacy(Python)

I would like to extract "all" the noun phrases from a sentence. I'm wondering how I can do it. I have the following code:
doc2 = nlp("what is the capital of Bangladesh?")
for chunk in doc2.noun_chunks:
print(chunk)
Output:
1. what
2. the capital
3. bangladesh
Expected:
the capital of Bangladesh
I have tried answers from spacy doc and StackOverflow. Nothing worked. It seems only cTakes and Stanford core NLP can give such complex NP.
Any help is appreciated.
Spacy clearly defines a noun chunk as:
A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses." (https://spacy.io/api/doc#noun_chunks)
If you process the dependency parse differently, allowing prepositional modifiers and nested phrases/chunks, then you can end up with what you're looking for.
I bet you could modify the existing spacy code fairly easily to do what you want:
https://github.com/explosion/spaCy/blob/06c6dc6fbcb8fbb78a61a2e42c1b782974bd43bd/spacy/lang/en/syntax_iterators.py
For those who are still looking for this answer
noun_pharses=set()
for nc in doc.noun_chunks:
for np in [nc, doc[nc.root.left_edge.i:nc.root.right_edge.i+1]]:
noun_pharses.add(np)
This is how I get all the complex noun phrase

How to get better lemmas from Spacy

While "PM" can mean "pm(time)" it can also mean "Prime Minister".
I want to capture the latter. I want lemma of "PM" to return "Prime Minister". How can I do this using spacy?
Example returning unexpected lemma:
>>> import spacy
>>> #nlp = spacy.load('en')
>>> nlp = spacy.load('en_core_web_lg')
>>> doc = nlp(u'PM means prime minister')
>>> for word in doc:
... print(word.text, word.lemma_)
...
PM pm
means mean
prime prime
minister minister
As per doc https://spacy.io/api/annotation, spacy uses WordNet for lemmas;
A lemma is the uninflected form of a word. The English lemmatization data is taken from WordNet..
When I tried inputting "pm" in Wordnet, it shows "Prime Minister" as one of the lemmas.
What am I missing here?
I think it would help answer your question by clarifying some common NLP tasks.
Lemmatization is the process of finding the canonical word given different inflections of the word. For example, run, runs, ran and running are forms of the same lexeme: run. If you were to lemmatize run, runs, and ran the output would be run. In your example sentence, note how it lemmatizes means to mean.
Given that, it doesn't sound like the task you want to perform is lemmatization. It may help to solidify this idea with a silly counterexample: what are the different inflections of a hypothetical lemma "pm": pming, pmed, pms? None of those are actual words.
It sounds like your task may be closer to Named Entity Recognition (NER), which you could also do in spaCy. To iterate through the detected entities in a parsed document, you can use the .ents attribute, as follows:
>>> for ent in doc.ents:
... print(ent, ent.label_)
With the sentence you've given, spacy (v. 2.0.5) doesn't detect any entities. If you replace "PM" with "P.M." it will detect that as an entity, but as a GPE.
The best thing to do depends on your task, but if you want your desired classification of the "PM" entity, I'd look at setting entity annotations. If you want to pull out every mention of "PM" from a big corpus of documents, use the matcher in a pipeline.
When I run lemmas of prime minister on nltk.wordnet (which uses it as well) I get:
>>>[str(lemma.name()) for lemma in wn.synset('prime_minister.n.01').lemmas()] ['Prime_Minister', 'PM', 'premier']
It keeps acronyms the same so maybe you want to check word.lemma() which would give you a different ID according to the context?

Cannot define rule priority in grako grammar for handling special tokens

I am trying to analyze some documents by a grammar generated via Grako that should parse simple sentences for further analysis but face some difficulties with some special tokens.
The (Grako-style) EBNF looks like:
abbr::str = "etc." | "feat.";
word::str = /[^.]+/;
sentence::Sentence = content:{abbr | word} ".";
page::Page = content:{sentence};
I used the upper grammar on following content:
This is a sentence. This is a sentence feat. an abbrevation. I don't
now feat. etc. feat. know English.
The result using a simple NodeWalker:
[
'This is a sentence.',
'This is a sentence feat.',
'an abbrevation.',
"I don't know feat.",
'etc. feat. know English.'
]
My expectation:
[
'This is a sentence.',
'This is a sentence feat. an abbrevation.',
"I don't know feat. etc. feat. know English."
]
I have no clue why this happens, especially in the last sentence where the abbreviations are part of the sentence while they are not in the prior sentences. To be clear, I want the abbr rule in the sentence definition to have a higher priority than the word rule, but I don't know how to achieve this. I played around with the negative and positive lookahead without success. I know how to achieve my expected results with regular expressions, but a context-free grammar is required for the further analysis, so I want to put everything in one grammar for the sake of readability. It has been a while since I last used grammars this way, but I don't remember running in that kind of problem. I searched a while via Google with no success, so maybe the community might share some insight.
Thanks in advance.
Code I used for testing, if required:
from grako.model import NodeWalker, ModelBuilderSemantics
from parser import MyParser
class MyWalker(NodeWalker):
def walk_Page(self, node):
content = [self.walk(c) for c in node.content]
print(content)
def walk_Sentence(self, node):
return ' '.join(node.content) + "."
def walk_str(self, node):
return node
def main(filename: str):
parser = MyParser(semantics=ModelBuilderSemantics())
with open(filename, 'r', encoding='utf-8') as src:
result = parser.parse(src.read(), 'page')
walker = HRBWalker()
walker.walk(result)
Packages used:
Python 3.5.2
Grako 3.16.5
The problem is with the regular expression you're using for the word rule. Regular expressions will parse over whatever you tell them to, and that regexp is eating over whitespace.
This modified grammar does what you want:
##grammar:: Pages
abbr::str = "etc." | "feat.";
word::str = /[^.\s]+/;
sentence::Sentence = content:{abbr | word} ".";
page::Page = content:{sentence};
start = page ;
A --trace run revealed the problem right away.

How to identify the subject of a sentence?

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.
You can use Spacy.
Code
import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]
print(sub_toks)
As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."
Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.
English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.
It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.
If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.
Also look at the patterns in passive voice and write rules for the same.
rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.
from rake_nltk import Rake
rake = Rake()
kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")
ranked_phrases = rake.get_ranked_phrases()
print(ranked_phrases)
# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']
By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:
rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!##$%^*/\''')
By default string.punctuation is used for punctuation.
The constructor also accepts a language keyword which can be any language supported by nltk.
Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.
Attaching screenshot of same:
code using spacy :
here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object
import spacy
nlp = spacy.load('en_core_web_lg')
def get_subject_object_phrase(doc, dep):
doc = nlp(doc)
for token in doc:
if dep in token.dep_:
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(doc[start:end])
You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.
Credits: https://github.com/explosion/spaCy/issues/380

Categories