spaCy matcher unable to identitfy the pattern besides the first - python

Unable to find where did my pattern go wrong to cause the outcome.
The Sentence I want to find:"#1 – January 31, 2015" and any date that follows this format.
The pattern pattern1=[{'ORTH':'#'},{'is_digital':True},{'is_space':True},{'ORTH':'-'},{'is_space':True},{'is_alpha':True},{'is_space':True},{'is_digital':True},{'is_punct':True},{'is_space':True},{'is_digital':True}]
The print code:print("Matches1:", [doc[start:end].text for match_id, start, end in matches1])
The result: ['#', '#', '#']
Expected result: ['#1 – January 31, 2015','#5 – March 15, 2017','#177 – Novenmber 22, 2019']

Spacy's matcher operates over tokens, single spaces in the sentence do not yield tokens.
Also there are different characters which resemble hyphens : dashes, minus signs etc.. one has to be careful about that.
The following code works:
import spacy
nlp = spacy.load('en_core_web_lg')
from spacy.matcher import Matcher
pattern1=[{'ORTH':'#'},{'IS_DIGIT':True},{'ORTH':'–'},{'is_alpha':True},{'IS_DIGIT':True},{'is_punct':True},{'IS_DIGIT':True}]
doc = nlp("#1 – January 31, 2015")
matcher = Matcher(nlp.vocab)
matcher.add("p1", None, pattern1)
matches1 = matcher(doc)
print(" Matches1:", [doc[start:end].text for match_id, start, end in matches1])
# Matches1: ['#1 – January 31, 2015']

Related

Extract data if between substrings else full string

I have string pattern like these:
Beginning through June 18, 2022 at Noon standard time\n
Jan 20, 2022
Beginning through April 26, 2022 at 12:01 a.m. standard time
I want to extract the data part presetnt after "through" and before "at" word using python regex.
June 18, 2022
Jan 20, 2022
April 26, 2022
I can extract for the long text using re group.
s ="Beginning through June 18, 2022 at Noon standard time"
re.search(r'(.*through)(.*) (at.*)', s).group(2)
However it will not work for
s ="June 18, 2022"
Can anyone help me on that.
You may use this regex with a capture group:
(?:.* through |^)(.+?)(?: at |$)
RegEx Demo
RegEx Details:
(?:.* through |^): Match anything followed by " though " or start position
(.+?): Match 1+ of any character and capture it in group #1
(?: at |$): Match " at " or end of string
Code:
import re
arr = ['Beginning through June 18, 2022 at Noon standard time',
'Jan 20, 2022',
'Beginning through April 26, 2022 at 12:01 a.m. standard time']
for i in arr:
print (re.findall(r'(?:.* through |^)(.+?)(?: at |$)', i))
Output:
['June 18, 2022']
['Jan 20, 2022']
['April 26, 2022']
How about playing with optional groups and backtracking.
^(?:.*?through )?(.*?)(?: at.*)?$
See this demo at regex101 or a Python demo at tio.run
Note that if just one of the substrings are present, it will either match from the first to end of the string or from start of string to the latter. If none are present, it will match the full string.
Another idea could be to use PyPI regex which supports branch reset groups.
^(?|.*?through (.+?) at|(.+))
This one extracts the part between if both are present, else the full string. Afaik the regex module is widely compatible to Python's regex functions, just use import regex as re instead.
Demo at regex101 or Python demo at tio.run

Matcher is returning some duplicates entry

I want output as ["good customer service","great ambience"] but I am getting ["good customer","good customer service","great ambience"] because pattern is matching with good customer also but this phrase doesn't make any sense. How can I remove these kind of duplicates
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: adjective followed by one or more noun
pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]
matcher.add("ADJ_NOUN_PATTERN", None,pattern)
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
You may post-process the matches by grouping the tuples against the start index and only keeping the one with the largest end index:
from itertools import *
#...
matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for key, group in groupby(matches, lambda prop: prop[1])]
print("Matches:", [doc[start:end].text for match_id, start, end in results])
# => Matches: ['good customer service', 'great ambience']
The groupby(matches, lambda prop: prop[1]) will group the matches by the start index, here, resulting in [(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)] and (5488211386492616699, 4, 6). max(list(group),key=lambda x: x[2]) will grab the item where end index (Value #3) is the biggest.
Spacy has a built-in function to do just that. Check filter_spans:
The documentation says:
When spans overlap, the (first) longest span is preferred over shorter spans.
Example:
doc = nlp("This is a sentence.")
spans = [doc[0:2], doc[0:2], doc[0:4]]
filtered = filter_spans(spans)

Negate a word inside a pattern Python & spaCy

I have this sentence:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
All I want is to make sure the word 'not' does not exist between will and be inside my text. Here is my code:
pattern = [{'LOWER':'purchase'},{'IS_SPACE':True, 'OP':'*'},{'LOWER':'order'},{'IS_SPACE':True, 'OP':'*'},{"IS_ASCII": True, "OP": "*"},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
I am using this:
{'LOWER':'not', 'OP':'!'}
Any idea why is not working?
Your code example seems to miss a statement that actually performs the match. So I added the method 'matcher.add()' that also verboses a match by calling the self-defined function 'on_match'.
But more importantly I hade to change your pattern by leaving out your space part {'IS_SPACE':True, 'OP':'*'} to gain a match.
Here's my working code that gives me a match:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
def on_match(matcher, doc, id, matches): # Added!
print("match")
# Changing your pattern for example to:
pattern = [{'LOWER':'purchase'},{'LOWER':'order'},{'LOWER':'expenditures'},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
matcher.add("ID_A1", on_match, pattern) # Added!
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
matches = matcher(doc)
print(matches)
If I replace:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
be frozen.')
with:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
not be frozen.')
I don't get a match anymore!
I reduced the complexity of your pattern - maybe too much. But I hope I could still help a bit.
Check this
"TEXT": {"NOT_IN": ["not"]}
See
"https://support.prodi.gy/t/negative-pattern-matching-regex/1764"

Python Regex - Different Results in findall and sub

I am trying to replace occurrences of the work 'brunch' with 'BRUNCH'. I am using a regex which correctly identifies the occurrence, but when I try to use re.sub it is replacing more text than identified with re.findall. The regex that I am using is:
re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
The string is
str = 'Valid only for dine-in January 2 - March 31, 2015. Excludes brunch, happy hour, holidays, and February 13 - 15, 2015.'
I want it to produce:
'Valid only for dine-in January 2 - March 31, 2015. Excludes BRUNCH, happy hour, holidays, and February 13 - 15, 2015.'
The steps:
>>> reg.findall(str)
>>> ['brunch']
>>> reg.sub('BRUNCH',str)
>>> Valid only for dine-in January 2 - March 31, 2015BRUNCH, happy hour, holidays, and February 13 - 15, 2015.
Edit:
The final solution that I used was:
re.compile(r'((?:^|\.))(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)',re.IGNORECASE)
re.sub('\g<1>\g<2>BRUNCH',str)
For re.sub use
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)(brunch)
Replace by \1\2BRUNCH.See demo.
https://regex101.com/r/eZ0yP4/16
Through regex:
(^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*)brunch
DEMO
Replace the matched characters by \1\2BRUNCH
Why does it match more than brunch
Because your regex actually does match more than brunch
See link on how the regex match
Why doesnt it show in findall?
Because you have wraped only the brunch in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)[^.]*(brunch)',re.IGNORECASE)
>>> reg.findall(str)
['brunch']
After wraping entire ([^.]*brunch) in paranthesis
>>> reg = re.compile(r'(?:^|\.)(?![^.]*saturday)(?![^.]*sunday)(?![^.]*weekend)([^.]*brunch)',re.IGNORECASE)
>>> reg.findall(str)
[' Excludes brunch']
re.findall ignores those are not caputred

One Way to Grab both text and percentages regex python distinguish numbers from letters within parantheses

I have this:
Bbc World News (57%); DANONE SA (FRANCE) (52%), Mn-Public-Radio-Intl; SIC123 Industry (52%)
I'd like to get:
[BBC World News, 57], [DANONE SA (FRANCE), 52], [Mn-Public-Radio Intl, 0], [SIC123 Industry, 52]
With the following helpfully suggested by Martijn Pieters, i can get everything besides DANONE SA (FRANCE). I'm not sure how to distinguish between (FRANCE) and (52%).
pat = r'(?(\b[\w\s\d!////&,:.%##$-]+\b)(?:\s+\((\d+)%\))?'
[(name, int(perc) if perc else np.nan)
for name, perc in re.findall(pat, inputtext)]
You can include the () characters in the character class, but then it'll match the first characters of the percentage text (so (57 in the case of Bbc World News (57%). To make this all work still, you need to do a look-ahead to match on the trailing , or ; or the end of the string:
re.findall(r'(\b[\w() -]+)(?:\s+\((\d+)%\))?(?=[,;]|$)', inputtext)
The (?=...) is a look-ahead match; that section is now anchored to any location that is followed either by a character matching the [,;] class or the end of a line. That makes the part before it, matching an optional (..%) percentage amount, only work before a comma or semicolon or the end of the text, and that then limits what the part before can match.
Demo:
>>> import re
>>> import numpy as np
>>> inputtext = 'Bbc World News (57%); DANONE SA (FRANCE) (52%), Mn-Public-Radio-Intl; SIC123 Industry (52%)'
>>> [(name, int(perc) if perc else np.nan)
... for name, perc in re.findall(r'(\b[\w() -]+)(?:\s+\((\d+)%\))?(?=[,;]|$)', inputtext)]
[('Bbc World News', 57), ('DANONE SA (FRANCE)', 52), ('Mn-Public-Radio-Intl', nan), ('SIC123 Industry', 52)]

Categories