I have seen some ways to create a custom tokenizer, but I am a little confused. What I am doing is using the Phrase Matcher to match patterns. However, it would match a 4-digit number pattern, say 1234, in 111-111-1234, since it splits on the dash.
All I want to do is modify the current tokenizer (from nlp = English()) and add a rule that it should not split on some characters but only for numeric patterns.
To do this you will need to overwrite spaCy's default infix tokenization scheme with your own. You can do this by modifying the infix tokenization scheme used by spaCy found here.
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
# default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("111-222-1234 for abcDE")
print([t.text for t in doc])
# modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\*^](?=[0-9-])", # Remove the hyphen
r"(?<=[{al}{q}])\.?(?=[{au}{q}])".format( # Make the dot optional
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
)
,
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp("111-222-1234 for abcDE")
print([t.text for t in doc])
Output
With default tokenizer:
['111', '-', '222', '-', '1234', 'for', 'abcDE']
With custom tokenizer:
['111-222-1234', 'for', 'abc', 'DE']
Related
I've been trying to solve a problem with the spacy Tokenizer for a while, without any success. Also, I'm not sure if it's a problem with the tokenizer or some other part of the pipeline.
Description
I have an application that for reasons besides the point, creates a spacy Doc from the spacy vocab and the list of tokens from a string (see code below). Note that while this is not the simplest and most common way to do this, according to spacy doc this can be done.
However, when I create a Doc for a text that contains compound words or dates with hyphen as a separator, the behavior I am getting is not what I expected.
import spacy
from spacy.language import Doc
# My current way
doc = Doc(nlp.vocab, words=tokens) # Tokens is a well defined list of tokens for a certein string
# Standard way
doc = nlp("My text...")
For example, with the following text, if I create the Doc using the standard procedure, the spacy Tokenizer recognizes the "-" as tokens but the Doc text is the same as the input text, in addition the spacy NER model correctly recognizes the DATE entity.
import spacy
doc = nlp("What time will sunset be on 2022-12-24?")
print(doc.text)
tokens = [str(token) for token in doc]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
What time will sunset be on 2022-12-24?
['What', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022-12-24
On the other hand, if I create the Doc from the model's vocab and the previously calculated tokens, the result obtained is different. Note that for the sake of simplicity I am using the tokens from doc, so I'm sure there are no differences in tokens. Also note that I am manually running each pipeline model in the correct order with the doc, so at the end of this process I would theoretically get the same results.
However, as you can see in the output below, while the Doc's tokens are the same, the Doc's text is different, there were blank spaces between the digits and the date separators.
doc2 = Doc(nlp.vocab, words=tokens)
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
what time will sunset be on 2022 - 12 - 24 ?
['what', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022 - 12 - 24
I know it must be something silly that I'm missing but I don't realize it.
Could someone please explain to me what I'm doing wrong and point me in the right direction?
Thanks a lot in advance!
EDIT
Following the Talha Tayyab suggestion, I have to create an array of booleans with the same length that my list of tokens to indicate for each one, if the token is followed by an empty space. Then pass this array in doc construction as follows: doc = Doc(nlp.vocab, words=words, spaces=spaces).
To compute this list of boolean values based on my original text string and list of tokens, I implemented the following vanilla function:
def get_spaces(self, text: str, tokens: List[str]) -> List[bool]:
# Spaces
spaces = []
# Copy text to easy operate
t = text.lower()
# Iterate over tokens
for token in tokens:
if t.startswith(token.lower()):
t = t[len(token):] # Remove token
# If after removing token we have an empty space
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
else:
spaces.append(False)
return spaces
With these two improvements in my code, the result obtained is as expected. However, now I have the following question:
Is there a more spacy-like way to compute whitespace, instead of using my vanilla implementation?
Please try this:
from spacy.language import Doc
doc2 = Doc(nlp.vocab, words=tokens,spaces=[1,1,1,1,1,1,0,0,0,0,0,0])
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
# You can also replace 0 with False and 1 with True
This is the complete syntax:
doc = Doc(nlp.vocab, words=words, spaces=spaces)
spaces are a list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.
So you can choose which ones you gonna have space and which ones you do not need.
Reference: https://spacy.io/api/doc
Late to this, but as you've retrieved the tokens from a document to begin with I think you can just use the whitespace_ attribute of the token for this. Then your 'get_spaces` function looks like:
def get_spaces(tokens):
return [1 if token.whitespace_ else 0 for token in tokens]
Note that this won't work nicely if there multiple spaces or non-space whitespace (e.g. tabs), but then you probably need to update the tokenizer or use your existing solution and update this part:
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
to check for generic whitespace and remove more than just a leading space.
I'm trying to set a matcher finding word 'iPhone X'.
The sample code says I should follow below.
import spacy
# Import the Matcher
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
I tried another approach by putting like below.
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
Why is the second approach not working? I assumed if I put the two word 'iPhone' and 'X' together, it might work as the same way cause it regard the word with space in the middle as a long unique word. But it didn't.
The possible reason I could think of is,
matcher condition should be a single word without empty space.
Am I right? or is there another reason the second approach not working?
Thank you.
The answer is in how Spacy tokenizes the string:
>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']
As you see, the iPhone and X are separate tokens. See the Matcher reference:
A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.
Thus, you cannot use them both in one token definition.
When using spacy to tokenize a sentence, I want it to not split into tokens on /
Example:
import en_core_web_lg
nlp = en_core_web_lg.load()
for i in nlp("Get 10ct/liter off when using our App"):
print(i)
Output:
Get
10ct
/
liter
off
when
using
our
App
I want it to be like Get , 10ct/liter, off, when ....
I was able to find how to add more ways to split into tokens for spacy, but not how to avoid specific splitting techniques.
I suggest using a custom tokenizer, see Modifying existing rule sets:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_trf")
text = "Get 10ct/liter off when using our App"
# Modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
#r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp(text)
print([t.text for t in doc])
## => ['Get', '10ct/liter', 'off', 'when', 'using', 'our', 'App']
Note the commented #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), line, I simply took out the / char from the [:<>=/] character class. This rule split at / that is between a letter/digit and a letter.
If you need to still split '12/ct' into three tokens, you will need to add another line below the r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA) line:
r"(?<=[0-9])/(?=[{a}])".format(a=ALPHA),
I would like to know if the spacy tokenizer could tokenize words only using the "space" rule.
For example:
sentence= "(c/o Oxford University )"
Normally, using the following configuration of spacy:
nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
print(token)
the result would be:
(
c
/
o
Oxford
University
)
Instead, I would like an output like the following (using spacy):
(c/o
Oxford
University
)
Is it possible to obtain a result like this using spacy?
Let's change nlp.tokenizer with a custom Tokenizer with token_match regex:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])
nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])
Before: [This, is, it, 's]
After : [This, is, it's]
You can further adjust Tokenizer by adding custom suffix, prefix, and infix rules.
An alternative, more fine grained way would be to find out why it's token is split like it is with nlp.tokenizer.explain():
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)
You'll find out that split is due to SPECIAL rules:
[('TOKEN', 'This'),
('TOKEN', 'is'),
('SPECIAL-1', 'it'),
('SPECIAL-2', "'s"),
('SUFFIX', '.'),
('SPECIAL-1', 'I'),
('SPECIAL-2', "'m"),
('TOKEN', 'fine')]
that could be updated to remove "it's" from exceptions like:
exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I, 'm, fine]
or remove split on apostrophe altogether:
filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I'm, fine]
Note the dot attached to the token, which is due to the suffix rules not specified.
You can find the solution to this very question in the spaCy docs: https://spacy.io/usage/linguistic-features#custom-tokenizer-example. In a nutshell, you create a function that takes a string text and returns a Doc object, and then assign that callable function to nlp.tokenizer:
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(' ')
# All tokens 'own' a subsequent space character in this tokenizer
spaces = [True] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])
I have an HTML document and I'd like to tokenize it using spaCy while keeping HTML tags as a single token.
Here's my code:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en', vectors=False, parser=False, entity=False)
nlp.tokenizer.add_special_case(u'<i>', [{ORTH: u'<i>'}])
nlp.tokenizer.add_special_case(u'</i>', [{ORTH: u'</i>'}])
doc = nlp('Hello, <i>world</i> !')
print([e.text for e in doc])
The output is:
['Hello', ',', '<', 'i', '>', 'world</i', '>', '!']
If I put spaces around the tags, like this:
doc = nlp('Hello, <i> world </i> !')
The output is as I want it:
['Hello', ',', '<i>', 'world', '</i>', '!']
but I'd like avoiding complicated pre-processing to the HTML.
Any idea how can I approach this?
You need to create a custom Tokenizer.
Your custom Tokenizer will be exactly as spaCy's tokenizer but it will have '<' and '>' symbols removed from prefixes and suffixes and also it will add one new prefix and one new suffix rule.
Code:
import spacy
from spacy.tokens import Token
Token.set_extension('tag', default=False)
def create_custom_tokenizer(nlp):
from spacy import util
from spacy.tokenizer import Tokenizer
from spacy.lang.tokenizer_exceptions import TOKEN_MATCH
prefixes = nlp.Defaults.prefixes + ('^<i>',)
suffixes = nlp.Defaults.suffixes + ('</i>$',)
# remove the tag symbols from prefixes and suffixes
prefixes = list(prefixes)
prefixes.remove('<')
prefixes = tuple(prefixes)
suffixes = list(suffixes)
suffixes.remove('>')
suffixes = tuple(suffixes)
infixes = nlp.Defaults.infixes
rules = nlp.Defaults.tokenizer_exceptions
token_match = TOKEN_MATCH
prefix_search = (util.compile_prefix_regex(prefixes).search)
suffix_search = (util.compile_suffix_regex(suffixes).search)
infix_finditer = (util.compile_infix_regex(infixes).finditer)
return Tokenizer(nlp.vocab, rules=rules,
prefix_search=prefix_search,
suffix_search=suffix_search,
infix_finditer=infix_finditer,
token_match=token_match)
nlp = spacy.load('en_core_web_sm')
tokenizer = create_custom_tokenizer(nlp)
nlp.tokenizer = tokenizer
doc = nlp('Hello, <i>world</i> !')
print([e.text for e in doc])
For the record, it might be that this has become easier: With the current version of Spacy, you don't have to create a custom tokenizer anymore. It suffices to 1. extend the infixes (to ensure tags are separated from words), and 2. add the tags as special cases:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load("en_core_web_trf")
infixes = nlp.Defaults.infixes + [r'(<)']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
nlp.tokenizer.add_special_case(f"<i>", [{ORTH: f"<i>"}])
nlp.tokenizer.add_special_case(f"</i>", [{ORTH: f"</i>"}])
text = """Hello, <i>world</i> !"""
doc = nlp(text)
print([e.text for e in doc])
Prints:
['Hello', ',', '<i>', 'world', '</i>', '!']
(This is more or less a condensed version of https://stackoverflow.com/a/66268015/1016514)