Im creating a simple programme using Spacy to learn how to use it. I have created a pattern to recognize when the user put "1 day" or "3 weeks", like this:
[{"IS_DIGIT": True},{"TEXT":"days"}],
[{"IS_DIGIT": True},{"TEXT":"day"}])
However, I also want it to recognize when the user put "4 days" instead. How can I achieve this?
You may achieve that with:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
txt = "This will take 1 day. That will take 3 days. It may take up to 3 weeks."
doc = nlp(txt)
matcher = Matcher(nlp.vocab)
pattern = [{"IS_DIGIT":True},{"LEMMA":{"REGEX":"day|week|month|year"}}]
matcher.add("HERE_IS_YOUR_MATCH",None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
print(nlp.vocab.strings[match_id], doc[start:end])
HERE_IS_YOUR_MATCH 1 day
HERE_IS_YOUR_MATCH 3 days
HERE_IS_YOUR_MATCH 3 weeks
Related
I was reproducing a Spacy rule-matching example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
doc = nlp("Good morning, I'm here. I'll say good evening!!")
pattern = [{"LOWER": "good"},{"LOWER": {"IN": ["morning", "evening"]}},{"IS_PUNCT": True}]
matcher.add("greetings", [pattern]) # good morning/evening with one pattern with the help of IN as follows
matches = matcher(doc)
for mid, start, end in matches:
print(start, end, doc[start:end])
which is supposed to match
Good morning good evening!
But the above code also matches "I" in both occasions
0 3 Good morning,
3 4 I
7 8 I
10 13 good evening!
I just want to remove the "I" from the Matching
Thank you
When I run your code on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4 with both the en_core_web_sm and en_core_web_trf pipelines), it produces a NameError because matcher is not defined. After defining matcher as an instantiation of the Matcher class in accordance with the spaCy Matcher documentation, I get the following (desired) output with both pipelines:
0 3 Good morning,
10 13 good evening!
The full working code is shown below. I'd suggest restarting your IDE and/or computer if you're still seeing your unexpected results.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Good morning, I'm here. I'll say good evening!!")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}, {"IS_PUNCT": True}]
matcher.add("greetings", [pattern]) # good morning/evening with one pattern with the help of IN as follows
matches = matcher(doc)
for match_id, start, end in matches:
print(start, end, doc[start:end])
I have a string which is essentially a page's worth of text.
A sample would be: "Ultimately, biscuits earwax 12 as well as Reading Time: up to 15 minutes".
What I want to extract is the first occurence of a '2-digit number + minutes' AFTER the substring "Reading time". My string is MUCH larger and has some numbers scattered around everywhere, so I want to use regex to do this but I'm not sure how to proceed from hereon in.
Example:
Input: "Ultimately, biscuits earwax 12 as well as Reading Time: up to 15 minutes"
Output: "15 minutes"
This is it in one line:
print(s[s.find("Reading Time") + s[s.find("Reading Time") : len(s)].find("minutes") - 3 : s.find("Reading Time") + s[s.find("Reading Time") : len(s)].find("minutes") + 7])
This is a bit of a departure from regex, but why not leveraging a more powerful natural language processing Python library to achieve this?
Here's an example with spaCy's Matcher (https://spacy.io/usage/rule-based-matching should be more flexible and easy to use than regex, if you accept the additional dependency):
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "reading"}, # we require 'reading time' to be in the pattern
{"LOWER": "time"},
{"OP": "*"}, # there may be some stuff (optionally)
{"LIKE_NUM": True}, # then we look for a number and 'minutes'
{"LOWER": "minutes"}]
matcher.add("duration", [pattern])
# some tests, and just two of them should give in output something
tests = ["Ultimately, biscuits earwax 12 as well as Reading Time: up to 15 minutes",
"I wonder if this will take a reading time of more than 15 or 17 minutes in the end",
"Will it take us more than 50 minutes?",
"I don't have anything like 'reading time'",
"spaCy rocks!"]
# print results for each example
for test in tests:
doc = nlp(test)
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[end-2:end]) # just get the final two tokens
By tweaking the pattern you should be all set to match sentences according to your needs.
After using a amazon review scraper to build this data frame, I called on nlp in order to tokenize and create a new column containing the processed reviews as 'docs'
However, now I am trying to create a pattern in order to analyzing the reviews in the doc column, but I keep getting know matches, which makes me thinking I'm missing one more pre-processing step, or perhaps not pointing the matcher in the right direction.
While the following code executes without any errors, I receive a matches list with 0 - even though I know the word exists in the doc column. The docs for spaCy are still a tad slim, and I'm not too sure the matcher.add is correct, as the one specific in the tutorial
matcher.add("Name_of_List", None, pattern)
returns an error saying that only 2 arguments are required for this class.
source -- https://course.spacy.io/en/chapter1
Question: What do I need to change to accurately analyze the df doc column for the pattern created?
Thanks!
Full code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_md')
df = pd.read_csv('paper_towel_US.csv')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
df['doc'].apply(find_matches)
df sample for reproduction via df.iloc[596:600, :].to_clipboard(sep=',')
,product,title,rating,body,doc,num_tokens
596,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Awesome!,5,Great towels!,Great towels!,3
597,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Good buy!,5,Love these,Love these,2
598,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Meh,3,"Does not clean countertop messes well. Towels leave a large residue. They are durable, though","Does not clean countertop messes well. Towels leave a large residue. They are durable, though",18
599,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Exactly as Described. Packaged Well and Mailed Promptly,4,Exactly as Described. Packaged Well and Mailed Promptly,Exactly as Described. Packaged Well and Mailed Promptly,9
You are trying to get the matches from the "df.doc" string with doc = nlp("df.doc"). You need to extract matches from the df['doc'] column instead.
An example solution is to remove doc = nlp("df.doc") and use the nlp = spacy.load('en_core_web_sm'):
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
>>> df['doc'].apply(find_matches)
0 None
1 (0, 2, Love these)
2 None
3 None
Name: doc, dtype: object
Full code snippet:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
df = pd.read_csv(r'C:\Users\admin\Desktop\s.txt')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
#doc = nlp("df.doc")
#matches = matcher(doc)
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
print(df['doc'].apply(find_matches))
I have a following algorithm:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)
CAT = [nlp.make_doc(text) for text in ['pension', 'underwriter', 'health', 'client']]
phrase_matcher.add("CATEGORY 1",None, *CAT)
text = 'The client works as a marine assistant underwriter. He has recently opted to stop paying into his pension. '
doc = nlp(text)
matches = phrase_matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
# Output
CATEGORY 1 client
CATEGORY 1 underwriter
CATEGORY 1 pension
Can I ask to return the result when all words can be found in the sentence. I expect not to see anything here as 'health' is not part of the sentence.
Can I do this type of matching with PhraseMatcher? or Do I need to change for another type of rule based match? Thank you
I have this sentence:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
All I want is to make sure the word 'not' does not exist between will and be inside my text. Here is my code:
pattern = [{'LOWER':'purchase'},{'IS_SPACE':True, 'OP':'*'},{'LOWER':'order'},{'IS_SPACE':True, 'OP':'*'},{"IS_ASCII": True, "OP": "*"},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
I am using this:
{'LOWER':'not', 'OP':'!'}
Any idea why is not working?
Your code example seems to miss a statement that actually performs the match. So I added the method 'matcher.add()' that also verboses a match by calling the self-defined function 'on_match'.
But more importantly I hade to change your pattern by leaving out your space part {'IS_SPACE':True, 'OP':'*'} to gain a match.
Here's my working code that gives me a match:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
def on_match(matcher, doc, id, matches): # Added!
print("match")
# Changing your pattern for example to:
pattern = [{'LOWER':'purchase'},{'LOWER':'order'},{'LOWER':'expenditures'},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
matcher.add("ID_A1", on_match, pattern) # Added!
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
matches = matcher(doc)
print(matches)
If I replace:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
be frozen.')
with:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
not be frozen.')
I don't get a match anymore!
I reduced the complexity of your pattern - maybe too much. But I hope I could still help a bit.
Check this
"TEXT": {"NOT_IN": ["not"]}
See
"https://support.prodi.gy/t/negative-pattern-matching-regex/1764"