Problem analyzing a doc column in a df with spaCy nlp - python

After using a amazon review scraper to build this data frame, I called on nlp in order to tokenize and create a new column containing the processed reviews as 'docs'
However, now I am trying to create a pattern in order to analyzing the reviews in the doc column, but I keep getting know matches, which makes me thinking I'm missing one more pre-processing step, or perhaps not pointing the matcher in the right direction.
While the following code executes without any errors, I receive a matches list with 0 - even though I know the word exists in the doc column. The docs for spaCy are still a tad slim, and I'm not too sure the matcher.add is correct, as the one specific in the tutorial
matcher.add("Name_of_List", None, pattern)
returns an error saying that only 2 arguments are required for this class.
source -- https://course.spacy.io/en/chapter1
Question: What do I need to change to accurately analyze the df doc column for the pattern created?
Thanks!
Full code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_md')
df = pd.read_csv('paper_towel_US.csv')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
df['doc'].apply(find_matches)
df sample for reproduction via df.iloc[596:600, :].to_clipboard(sep=',')
,product,title,rating,body,doc,num_tokens
596,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Awesome!,5,Great towels!,Great towels!,3
597,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Good buy!,5,Love these,Love these,2
598,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Meh,3,"Does not clean countertop messes well. Towels leave a large residue. They are durable, though","Does not clean countertop messes well. Towels leave a large residue. They are durable, though",18
599,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Exactly as Described. Packaged Well and Mailed Promptly,4,Exactly as Described. Packaged Well and Mailed Promptly,Exactly as Described. Packaged Well and Mailed Promptly,9

You are trying to get the matches from the "df.doc" string with doc = nlp("df.doc"). You need to extract matches from the df['doc'] column instead.
An example solution is to remove doc = nlp("df.doc") and use the nlp = spacy.load('en_core_web_sm'):
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
>>> df['doc'].apply(find_matches)
0 None
1 (0, 2, Love these)
2 None
3 None
Name: doc, dtype: object
Full code snippet:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
df = pd.read_csv(r'C:\Users\admin\Desktop\s.txt')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
#doc = nlp("df.doc")
#matches = matcher(doc)
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
print(df['doc'].apply(find_matches))

Related

filter custom spans overlaps in spacy doc

I have a bunch of regex in this way:
(for simplicity the regex patters are very easy, the real case the regex are very long and barely incomprehensible since they are created automatically from other tool)
I want to create spans in a doc based on those regex.
This is the code:
import spacy
from spacy.tokens import Doc, Span, Token
import re
rx1 = ["blue","blue print"]
text = " this is blue but there is a blue print. The light is red and the heat is in the infra red."
my_regexes = {'blue':["blue","blue print"],
'red': ["red", "infra red"] }
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.text)
for name, rxs in my_regexes.items():
doc.spans[name] = []
for rx in rxs:
for i, match in enumerate(re.finditer(rx, doc.text)):
start, end = match.span()
span = doc.char_span(start, end, alignment_mode="expand")
# This is a Span object or None if match doesn't map to valid token sequence
span_to_add = Span(doc, span.start, span.end, label=name +str(i))
doc.spans[name].append(span_to_add)
if span is not None:
print("Found match:", name, start, end, span.text )
It works.
Now I want to filter the spans in a way that when a series of tokens (for instance "infra red") contain another span ("red") only the longest one is kept.
I saw this:
How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?
but that looks to be for a matcher, and I can not make it work in my case. Since I would like to eliminate the token Span out of the document.
Any idea?
spacy.util.filter_spans will do this. The answer is the same as the linked question, where matcher results are converted to spans in order to filter them with this function.
docs.spans[name] = spacy.util.filter_spans(doc.spans[name])

How to write code to merge punctuations and phrases using spaCy

What I would like to do
I would like to perse and dependency analysis using spaCy, one of the open-source libraries for natural language processing.
And especially, I hope to know how to write code for the option to merge punctuations and phrases in Python.
Problem
There are bottons to mearge punctuations and phrases on the displaCy Dependency Vizualizer Web App.
However, I cannot find the way to write these options when it comes to writing code in the local environment.
The current code returns the following not merged version.
The sample sentence is from your dictionary.
Current Code
It is from the sample code on the spaCy official website.
Please let me know how to fix it to set punctuations and phrases merge options.
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
sentence = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(sentence)
displacy.render(doc, style="dep")
What I tried to do
There was one example for the merge implementation.
However it didn't work when I apply the sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories.")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
Example Code
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)
If you need to merge noun chunks, check out the built-in merge_noun_chunks pipeline component. When added to your pipeline using nlp.add_pipe, it will take care of merging the spans automatically.
You can just use the code from the displaCy Dependency Vizualizer:
import spacy
nlp = spacy.load("en_core_web_sm")
def merge_phrases(doc):
with doc.retokenize() as retokenizer:
for np in list(doc.noun_chunks):
attrs = {
"tag": np.root.tag_,
"lemma": np.root.lemma_,
"ent_type": np.root.ent_type_,
}
retokenizer.merge(np, attrs=attrs)
return doc
def merge_punct(doc):
spans = []
for word in doc[:-1]:
if word.is_punct or not word.nbor(1).is_punct:
continue
start = word.i
end = word.i + 1
while end < len(doc) and doc[end].is_punct:
end += 1
span = doc[start:end]
spans.append((span, word.tag_, word.lemma_, word.ent_type_))
with doc.retokenize() as retokenizer:
for span, tag, lemma, ent_type in spans:
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
retokenizer.merge(span, attrs=attrs)
return doc
text = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(text)
# Merge noun phrases into one token.
doc = merge_phrases(doc)
# Attach punctuation to tokens
doc = merge_punct(doc)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)

Patterns in Spacy

Im creating a simple programme using Spacy to learn how to use it. I have created a pattern to recognize when the user put "1 day" or "3 weeks", like this:
[{"IS_DIGIT": True},{"TEXT":"days"}],
[{"IS_DIGIT": True},{"TEXT":"day"}])
However, I also want it to recognize when the user put "4 days" instead. How can I achieve this?
You may achieve that with:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
txt = "This will take 1 day. That will take 3 days. It may take up to 3 weeks."
doc = nlp(txt)
matcher = Matcher(nlp.vocab)
pattern = [{"IS_DIGIT":True},{"LEMMA":{"REGEX":"day|week|month|year"}}]
matcher.add("HERE_IS_YOUR_MATCH",None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
print(nlp.vocab.strings[match_id], doc[start:end])
HERE_IS_YOUR_MATCH 1 day
HERE_IS_YOUR_MATCH 3 days
HERE_IS_YOUR_MATCH 3 weeks

Negate a word inside a pattern Python & spaCy

I have this sentence:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
All I want is to make sure the word 'not' does not exist between will and be inside my text. Here is my code:
pattern = [{'LOWER':'purchase'},{'IS_SPACE':True, 'OP':'*'},{'LOWER':'order'},{'IS_SPACE':True, 'OP':'*'},{"IS_ASCII": True, "OP": "*"},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
I am using this:
{'LOWER':'not', 'OP':'!'}
Any idea why is not working?
Your code example seems to miss a statement that actually performs the match. So I added the method 'matcher.add()' that also verboses a match by calling the self-defined function 'on_match'.
But more importantly I hade to change your pattern by leaving out your space part {'IS_SPACE':True, 'OP':'*'} to gain a match.
Here's my working code that gives me a match:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
def on_match(matcher, doc, id, matches): # Added!
print("match")
# Changing your pattern for example to:
pattern = [{'LOWER':'purchase'},{'LOWER':'order'},{'LOWER':'expenditures'},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
matcher.add("ID_A1", on_match, pattern) # Added!
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
matches = matcher(doc)
print(matches)
If I replace:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
be frozen.')
with:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
not be frozen.')
I don't get a match anymore!
I reduced the complexity of your pattern - maybe too much. But I hope I could still help a bit.
Check this
"TEXT": {"NOT_IN": ["not"]}
See
"https://support.prodi.gy/t/negative-pattern-matching-regex/1764"

New named entity class in Spacy

I need to train Spacy NER to be able to recognize 2 new classes for named entity recognition, all I have are files with list of items that are supposed to be in new classes.
For example: Rolling Stones, Muse, Arctic Monkeys - artists
Any idea how this can be done?
This seems like a perfect use case for Matcher or PhraseMatcher (if you care about performance).
import spacy
nlp = spacy.load('en')
def merge_phrases(matcher, doc, i, matches):
'''
Merge a phrase. We have to be careful here because we'll change the token indices.
To avoid problems, merge all the phrases once we're called on the last match.
'''
if i != len(matches)-1:
return None
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])
matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='1', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Rolling'}, {spacy.attrs.ORTH: 'Stones'}]], on_match=merge_phrases)
matcher.add(entity_key='2', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Muse'}]], on_match=merge_phrases)
matcher.add(entity_key='3', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Arctic'}, {spacy.attrs.ORTH: 'Monkeys'}]], on_match=merge_phrases)
doc = nlp(u'The Rolling Stones are an English rock band formed in London in 1962. The first settled line-up consisted of Brian Jones, Ian Stewart, Mick Jagger, Keith Richards, Bill Wyman and Charlie Watts')
matcher(doc)
for ent in doc.ents:
print(ent)
See the documentation for more details. From my experience, with 400k entities in a Matcher it would take almost a second to match each document.
PhraseMatcher is much much faster but a bit trickier to use. Note that this is "strict" matcher, it won't match any entities it haven't seen before.

Categories