Spacy Rule-Based Matching outputs undesired phrase bit - python

I was reproducing a Spacy rule-matching example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
doc = nlp("Good morning, I'm here. I'll say good evening!!")
pattern = [{"LOWER": "good"},{"LOWER": {"IN": ["morning", "evening"]}},{"IS_PUNCT": True}]
matcher.add("greetings", [pattern]) # good morning/evening with one pattern with the help of IN as follows
matches = matcher(doc)
for mid, start, end in matches:
print(start, end, doc[start:end])
which is supposed to match
Good morning good evening!
But the above code also matches "I" in both occasions
0 3 Good morning,
3 4 I
7 8 I
10 13 good evening!
I just want to remove the "I" from the Matching
Thank you

When I run your code on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4 with both the en_core_web_sm and en_core_web_trf pipelines), it produces a NameError because matcher is not defined. After defining matcher as an instantiation of the Matcher class in accordance with the spaCy Matcher documentation, I get the following (desired) output with both pipelines:
0 3 Good morning,
10 13 good evening!
The full working code is shown below. I'd suggest restarting your IDE and/or computer if you're still seeing your unexpected results.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Good morning, I'm here. I'll say good evening!!")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}, {"IS_PUNCT": True}]
matcher.add("greetings", [pattern]) # good morning/evening with one pattern with the help of IN as follows
matches = matcher(doc)
for match_id, start, end in matches:
print(start, end, doc[start:end])

Related

How to match repeating patterns in spacy?

I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).
For example, let's assume we analyze the following sentence:
"She told me that her dog was big, black and strong."
The following code would allow me to match the list of adjectives at the end of the sentence:
import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")
# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])
matches = matcher(doc)
Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".
How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True} can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}.
Thanks for any hints.
The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.
patterns = []
for ii in range(1, 5):
pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
patterns.append(pattern)
Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.
As a separate note, you are not handling sentences with the serial comma.

Patterns in Spacy

Im creating a simple programme using Spacy to learn how to use it. I have created a pattern to recognize when the user put "1 day" or "3 weeks", like this:
[{"IS_DIGIT": True},{"TEXT":"days"}],
[{"IS_DIGIT": True},{"TEXT":"day"}])
However, I also want it to recognize when the user put "4 days" instead. How can I achieve this?
You may achieve that with:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
txt = "This will take 1 day. That will take 3 days. It may take up to 3 weeks."
doc = nlp(txt)
matcher = Matcher(nlp.vocab)
pattern = [{"IS_DIGIT":True},{"LEMMA":{"REGEX":"day|week|month|year"}}]
matcher.add("HERE_IS_YOUR_MATCH",None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
print(nlp.vocab.strings[match_id], doc[start:end])
HERE_IS_YOUR_MATCH 1 day
HERE_IS_YOUR_MATCH 3 days
HERE_IS_YOUR_MATCH 3 weeks

Accessing out of range word in spaCy doc : why does it work?

I'm learning spaCy and am playing with Matchers.
I have:
a very basic sentence ("white shepherd dog")
a matcher object, searching for a pattern ("white shepherd")
a print to show the match, and the word and POS before that match
I just wanted to check how to handle the index out of range exception I'm expecting to get because there's nothing before the match. I didn't expect it to work, but it did and is returning 'dog', which is after the match... and now I'm confused.
It looks like spaCy uses a circular list (or deque I think) ?
This needs a language model to run, you can install it with the following command line, if you'd like to reproduce it:
python -m spacy download en_core_web_md
And this is the code
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
# Get previous token and its POS
print("Previous token: ", doc[start - 1].text, doc[start - 1].pos_) # I would expect the error here
I get the following:
>>> Matched span: white shepherd
>>> Previous token: dog PROPN
Can someone explain what's going on ?
Thanks !
You are looking for a token at index 0-1 which evaluated to -1, which is the last token.
I recommend using the Token.nbor method to look for the first token before the span, and if no previous token exists make it None or an empty string.
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
try:
nbor_tok = span[0].nbor(-1)
print("Previous token:", nbor_tok, nbor_tok.pos_)
except IndexError:
nbor_tok = ''
print("Previous token: None None")

python spacy looking for two (or more) words in a window

I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other.
For instance a concept would be any of the words
forest, tree, nature
in a distance less than 4 words from
fire, burn, overheat
I am learning spacy and so far I can use the matcher like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])
That would match hello world and hello, world (or tree firing for the above mentioned example)
I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.
I had a look into:
https://spacy.io/usage/rule-based-matching
and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.
Furthermore, I am not able to generalize that to more words as well.
Some ideas?
Thanks
For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.
Thus, you can write your matcher as
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])
which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for
doc = nlp(u"Hello brave new world")
for match_id, start, end in matcher(doc):
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
it will print you
15578876784678163569 HelloWorld 0 4 Hello brave new world
And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.
I'm relatively new to spaCy but I think the following pattern should work for any number of tokens between 'hello' and 'world' that are comprised of ASCII characters:
[{"LOWER": "hello"}, {'IS_ASCII': True, 'OP': '*'}, {"LOWER": "world"}]
I tested it using Explosion's rule-based match explorer and it works. Overlapping matches will return just one match (eg, "hello and I do mean hello world').

Negate a word inside a pattern Python & spaCy

I have this sentence:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
All I want is to make sure the word 'not' does not exist between will and be inside my text. Here is my code:
pattern = [{'LOWER':'purchase'},{'IS_SPACE':True, 'OP':'*'},{'LOWER':'order'},{'IS_SPACE':True, 'OP':'*'},{"IS_ASCII": True, "OP": "*"},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
I am using this:
{'LOWER':'not', 'OP':'!'}
Any idea why is not working?
Your code example seems to miss a statement that actually performs the match. So I added the method 'matcher.add()' that also verboses a match by calling the self-defined function 'on_match'.
But more importantly I hade to change your pattern by leaving out your space part {'IS_SPACE':True, 'OP':'*'} to gain a match.
Here's my working code that gives me a match:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
def on_match(matcher, doc, id, matches): # Added!
print("match")
# Changing your pattern for example to:
pattern = [{'LOWER':'purchase'},{'LOWER':'order'},{'LOWER':'expenditures'},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
matcher.add("ID_A1", on_match, pattern) # Added!
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
matches = matcher(doc)
print(matches)
If I replace:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
be frozen.')
with:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
not be frozen.')
I don't get a match anymore!
I reduced the complexity of your pattern - maybe too much. But I hope I could still help a bit.
Check this
"TEXT": {"NOT_IN": ["not"]}
See
"https://support.prodi.gy/t/negative-pattern-matching-regex/1764"

Categories