I have documents where the character \u0080 is used as a Euro. I want to add these and other characters to the currency symbol list so that the money entity gets picked up by the Spacy NER. What is the best way to deal with this?
Additionally I also have cases where money is represented as CAD 5,000 and these are not picked by the NER as Money. What is the best way to deal with this situation, training the NER or adding CAD as a currency symbol?
1. The 'u\0080' problem
First thing first, it seems that the interpretation of the 'u\0080' character depends on the platform you are using, it doesn't print on a Windows 7 machine but it works on a Linux machine...
For completeness, I have assumed that you get your text from an html document containing the '€' escape sequence (which should print as € in a browser), the '\u0080' character and some other arbitrary symbols we identify as currencies.
Before passing the text content to spaCy, we can call html.unescape which will take care of translating € to €, which in turn is going to be recognized by the default configuration as a currency.
text_html = ("I just found out that CAD 1,000 is about 641.3 €. "
"Some people call it 641.3 \u0080. "
"Fantastic! But in the U.K. I'd rather pay 344🎅 or \U0001F33B56.")
text = html.unescape(text_html)
Second, if there are symbols that are not recognized as a currency, like 🎅 and 🌻 for example, then we can change the Defaults of the language we use to qualify them as currencies.
That consists in replacing the lex_attr_getters[IS_CURRENCY] function by a custom one which holds a list of symbols describing a currency.
def is_currency_custom(text):
# Stripping punctuation
table = str.maketrans({key: None for key in string.punctuation})
text = text.translate(table)
all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"]
if text in all_currencies:
return True
return is_currency_original(text)
# Keep a reference to the original is_currency function
is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY]
# Assign a new function for IS_CURRENCY
EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom
2. The CAD 5,000 problem
For this one, a simple solution would be to define a special case. We say to the tokenizer that, wherever it meets CAD, this is a special case and it needs to do as instructed by us. We can set the IS_CURRENCY flag amongst other things.
special_case = [{
ORTH: u'CAD',
TAG: u'$',
IS_CURRENCY: True}]
nlp.tokenizer.add_special_case(u'CAD', special_case)
Note that this is not perfect as you may get false positives. Imagine a document from an Canadian company selling CAD drawing services... So this is good but not great.
If we want to be more precise, we can create a Matcher object that will look for patterns like CURRENCY[SPACE]NUMBER or NUMBER[SPACE]CURRENCY and associate the MONEY entity with it.
matcher = Matcher(nlp.vocab)
MONEY = nlp.vocab.strings['MONEY']
# This is the matcher callback that sets the MONEY entity
def add_money_ent(matcher, doc, i, matches):
match_id, start, end = matches[i]
doc.ents += ((MONEY, start, end),)
matcher.add(
'MoneyRedefined',
add_money_ent,
[{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}],
[{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}]
)
and you apply it to your doc object with matcher(doc). The 'OP' key makes the pattern optional, by allowing it to match 0 or 1 times.
3. The full code
import spacy
from spacy.symbols import IS_CURRENCY
from spacy.lang.en import EnglishDefaults
from spacy.matcher import Matcher
from spacy import displacy
import html
import string
def is_currency_custom(text):
# Stripping punctuation
table = str.maketrans({key: None for key in string.punctuation})
text = text.translate(table)
all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"]
if text in all_currencies:
return True
return is_currency_original(text)
# Keep a reference to the original is_currency function
is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY]
# Assign a new function for IS_CURRENCY
EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
MONEY = nlp.vocab.strings['MONEY']
# This is the matcher callback that sets the MONEY entity
def add_money_ent(matcher, doc, i, matches):
match_id, start, end = matches[i]
doc.ents += ((MONEY, start, end),)
matcher.add(
'MoneyRedefined',
add_money_ent,
[{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}],
[{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}]
)
text_html = ("I just found out that CAD 1,000 is about 641.3 €. "
"Some people call it 641.3 \u0080. "
"Fantastic! But in the U.K. I'd rather pay 344🎅 or \U0001F33B56.")
text = html.unescape(text_html)
doc = nlp(text)
matcher(doc)
displacy.serve(doc, style='ent')
This gives the expected:
Related
I have a bunch of regex in this way:
(for simplicity the regex patters are very easy, the real case the regex are very long and barely incomprehensible since they are created automatically from other tool)
I want to create spans in a doc based on those regex.
This is the code:
import spacy
from spacy.tokens import Doc, Span, Token
import re
rx1 = ["blue","blue print"]
text = " this is blue but there is a blue print. The light is red and the heat is in the infra red."
my_regexes = {'blue':["blue","blue print"],
'red': ["red", "infra red"] }
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.text)
for name, rxs in my_regexes.items():
doc.spans[name] = []
for rx in rxs:
for i, match in enumerate(re.finditer(rx, doc.text)):
start, end = match.span()
span = doc.char_span(start, end, alignment_mode="expand")
# This is a Span object or None if match doesn't map to valid token sequence
span_to_add = Span(doc, span.start, span.end, label=name +str(i))
doc.spans[name].append(span_to_add)
if span is not None:
print("Found match:", name, start, end, span.text )
It works.
Now I want to filter the spans in a way that when a series of tokens (for instance "infra red") contain another span ("red") only the longest one is kept.
I saw this:
How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?
but that looks to be for a matcher, and I can not make it work in my case. Since I would like to eliminate the token Span out of the document.
Any idea?
spacy.util.filter_spans will do this. The answer is the same as the linked question, where matcher results are converted to spans in order to filter them with this function.
docs.spans[name] = spacy.util.filter_spans(doc.spans[name])
I've been trying to solve a problem with the spacy Tokenizer for a while, without any success. Also, I'm not sure if it's a problem with the tokenizer or some other part of the pipeline.
Description
I have an application that for reasons besides the point, creates a spacy Doc from the spacy vocab and the list of tokens from a string (see code below). Note that while this is not the simplest and most common way to do this, according to spacy doc this can be done.
However, when I create a Doc for a text that contains compound words or dates with hyphen as a separator, the behavior I am getting is not what I expected.
import spacy
from spacy.language import Doc
# My current way
doc = Doc(nlp.vocab, words=tokens) # Tokens is a well defined list of tokens for a certein string
# Standard way
doc = nlp("My text...")
For example, with the following text, if I create the Doc using the standard procedure, the spacy Tokenizer recognizes the "-" as tokens but the Doc text is the same as the input text, in addition the spacy NER model correctly recognizes the DATE entity.
import spacy
doc = nlp("What time will sunset be on 2022-12-24?")
print(doc.text)
tokens = [str(token) for token in doc]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
What time will sunset be on 2022-12-24?
['What', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022-12-24
On the other hand, if I create the Doc from the model's vocab and the previously calculated tokens, the result obtained is different. Note that for the sake of simplicity I am using the tokens from doc, so I'm sure there are no differences in tokens. Also note that I am manually running each pipeline model in the correct order with the doc, so at the end of this process I would theoretically get the same results.
However, as you can see in the output below, while the Doc's tokens are the same, the Doc's text is different, there were blank spaces between the digits and the date separators.
doc2 = Doc(nlp.vocab, words=tokens)
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
what time will sunset be on 2022 - 12 - 24 ?
['what', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022 - 12 - 24
I know it must be something silly that I'm missing but I don't realize it.
Could someone please explain to me what I'm doing wrong and point me in the right direction?
Thanks a lot in advance!
EDIT
Following the Talha Tayyab suggestion, I have to create an array of booleans with the same length that my list of tokens to indicate for each one, if the token is followed by an empty space. Then pass this array in doc construction as follows: doc = Doc(nlp.vocab, words=words, spaces=spaces).
To compute this list of boolean values ​​based on my original text string and list of tokens, I implemented the following vanilla function:
def get_spaces(self, text: str, tokens: List[str]) -> List[bool]:
# Spaces
spaces = []
# Copy text to easy operate
t = text.lower()
# Iterate over tokens
for token in tokens:
if t.startswith(token.lower()):
t = t[len(token):] # Remove token
# If after removing token we have an empty space
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
else:
spaces.append(False)
return spaces
With these two improvements in my code, the result obtained is as expected. However, now I have the following question:
Is there a more spacy-like way to compute whitespace, instead of using my vanilla implementation?
Please try this:
from spacy.language import Doc
doc2 = Doc(nlp.vocab, words=tokens,spaces=[1,1,1,1,1,1,0,0,0,0,0,0])
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
# You can also replace 0 with False and 1 with True
This is the complete syntax:
doc = Doc(nlp.vocab, words=words, spaces=spaces)
spaces are a list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.
So you can choose which ones you gonna have space and which ones you do not need.
Reference: https://spacy.io/api/doc
Late to this, but as you've retrieved the tokens from a document to begin with I think you can just use the whitespace_ attribute of the token for this. Then your 'get_spaces` function looks like:
def get_spaces(tokens):
return [1 if token.whitespace_ else 0 for token in tokens]
Note that this won't work nicely if there multiple spaces or non-space whitespace (e.g. tabs), but then you probably need to update the tokenizer or use your existing solution and update this part:
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
to check for generic whitespace and remove more than just a leading space.
I'm trying to set a matcher finding word 'iPhone X'.
The sample code says I should follow below.
import spacy
# Import the Matcher
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
I tried another approach by putting like below.
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
Why is the second approach not working? I assumed if I put the two word 'iPhone' and 'X' together, it might work as the same way cause it regard the word with space in the middle as a long unique word. But it didn't.
The possible reason I could think of is,
matcher condition should be a single word without empty space.
Am I right? or is there another reason the second approach not working?
Thank you.
The answer is in how Spacy tokenizes the string:
>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']
As you see, the iPhone and X are separate tokens. See the Matcher reference:
A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.
Thus, you cannot use them both in one token definition.
I have this sentence:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
All I want is to make sure the word 'not' does not exist between will and be inside my text. Here is my code:
pattern = [{'LOWER':'purchase'},{'IS_SPACE':True, 'OP':'*'},{'LOWER':'order'},{'IS_SPACE':True, 'OP':'*'},{"IS_ASCII": True, "OP": "*"},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
I am using this:
{'LOWER':'not', 'OP':'!'}
Any idea why is not working?
Your code example seems to miss a statement that actually performs the match. So I added the method 'matcher.add()' that also verboses a match by calling the self-defined function 'on_match'.
But more importantly I hade to change your pattern by leaving out your space part {'IS_SPACE':True, 'OP':'*'} to gain a match.
Here's my working code that gives me a match:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
def on_match(matcher, doc, id, matches): # Added!
print("match")
# Changing your pattern for example to:
pattern = [{'LOWER':'purchase'},{'LOWER':'order'},{'LOWER':'expenditures'},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
matcher.add("ID_A1", on_match, pattern) # Added!
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
matches = matcher(doc)
print(matches)
If I replace:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
be frozen.')
with:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
not be frozen.')
I don't get a match anymore!
I reduced the complexity of your pattern - maybe too much. But I hope I could still help a bit.
Check this
"TEXT": {"NOT_IN": ["not"]}
See
"https://support.prodi.gy/t/negative-pattern-matching-regex/1764"
How do I match string items in list to a location on a larger string, particularly if those string items were derived from the larger string?
I currently receive my output from AlchemyAPI in this format.
each list = Name of Entity, Count of Entity Appears in text, Entity Type, Entity Sentiment
[['Social Entrepreneurship', u'25', u'Organization', u'0.854779'],
['Skoll Centre for Social Entrepreneurship',
u'6',
u'Organization',
u'0.552907'],
However, in order to evaluate the accuracy of this NER output, I'd like to map my alchemyAPI output of the entity type to text I already have. So for instance, if my text is the following (this is also the text I used to get my output for Alchemy API)
If
Social
Entrepreneurship
acts
like
This
Social
Entrepreneurship
I'd like to have the fact that Social Entrepreneurship is mentioned 25 times as an ORG applied to my text. So, this would be a snippet of those 25 times.
If
Social ORG
Entrepreneurship ORG
acts
like
This
Social ORG
Entrepreneurship ORG
I would go about this using a tokenizer on both the text that you're sending to the API and the returned entities to find matches. NLTK provides that functionality out of the box with their comprehensive "word_tokenize" method (http://www.nltk.org/book/ch03.html) though any tokenizer will work as long as it tokenizes the entities the same as the text (ie: raw.split()).
# Generic tokenizer (if you don't use NLTK's)
def word_tokenize(raw):
return raw.split()
With that, you would be iterating over each word (token) in the document, checking to see if you get a match the first token in the entities returned.
for word in word_tokenize(raw):
for entity in entity_results:
if word.upper() in ( ( e.upper() for e in word_tokenize(entity[0]) ):
print(" ".join([word] + entity[1:]))
else:
print(word)
You may want to expand on this to get an exact match for the full entity, testing for the length of the token list, and testing each element by index instead.
words = word_tokenize(raw)
ents = [ [ e for e in word_tokenize(entity[0]) ] for entity in entity_results ]
for word_idx in range(len(words)):
for ent in ents:
# Check the word against the first word in the entity
if words[word_idx].upper() in ( e[0].upper() for e in ents ):
match = True
# Check all words in entity
for ent_idx in range(len(ent)):
if ent[ent_idx] != words[word_idx + ent_idx]:
match = false
break
if match:
print(" ".join([words[word_idx]] + ent))
else:
print(words[word_idx])
else:
print(words[word_idx])
You may notice though that this prints out the full entity if it matches, it will only get you matches on the first word, and that this doesn't handle IndexError problems if the line "ent[ent_idx] != words[word_idx + ent_idx]" references an invalid index. Some work is needed, depending on what you want to do with the output.
Finally, this all assumes that AlchemyAPI isn't including co-references in their final count. Co-reference is when you refer to an entity using "he", "she", "it", "they", etc. That's something you'll have to test on your own.