Prevent Spacy tokenizer from splitting on specific character

Prevent Spacy tokenizer from splitting on specific character - python

When using spacy to tokenize a sentence, I want it to not split into tokens on /
Example:
import en_core_web_lg
nlp = en_core_web_lg.load()
for i in nlp("Get 10ct/liter off when using our App"):
print(i)
Output:
Get
10ct
/
liter
off
when
using
our
App
I want it to be like Get , 10ct/liter, off, when ....
I was able to find how to add more ways to split into tokens for spacy, but not how to avoid specific splitting techniques.

I suggest using a custom tokenizer, see Modifying existing rule sets:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_trf")
text = "Get 10ct/liter off when using our App"
# Modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
#r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp(text)
print([t.text for t in doc])
## => ['Get', '10ct/liter', 'off', 'when', 'using', 'our', 'App']
Note the commented #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), line, I simply took out the / char from the [:<>=/] character class. This rule split at / that is between a letter/digit and a letter.
If you need to still split '12/ct' into three tokens, you will need to add another line below the r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA) line:
r"(?<=[0-9])/(?=[{a}])".format(a=ALPHA),

Related

Spacy tokenization add extra white space for dates with hyphen separator when I manually build the Doc

I've been trying to solve a problem with the spacy Tokenizer for a while, without any success. Also, I'm not sure if it's a problem with the tokenizer or some other part of the pipeline.
Description
I have an application that for reasons besides the point, creates a spacy Doc from the spacy vocab and the list of tokens from a string (see code below). Note that while this is not the simplest and most common way to do this, according to spacy doc this can be done.
However, when I create a Doc for a text that contains compound words or dates with hyphen as a separator, the behavior I am getting is not what I expected.
import spacy
from spacy.language import Doc
# My current way
doc = Doc(nlp.vocab, words=tokens) # Tokens is a well defined list of tokens for a certein string
# Standard way
doc = nlp("My text...")
For example, with the following text, if I create the Doc using the standard procedure, the spacy Tokenizer recognizes the "-" as tokens but the Doc text is the same as the input text, in addition the spacy NER model correctly recognizes the DATE entity.
import spacy
doc = nlp("What time will sunset be on 2022-12-24?")
print(doc.text)
tokens = [str(token) for token in doc]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
What time will sunset be on 2022-12-24?
['What', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022-12-24
On the other hand, if I create the Doc from the model's vocab and the previously calculated tokens, the result obtained is different. Note that for the sake of simplicity I am using the tokens from doc, so I'm sure there are no differences in tokens. Also note that I am manually running each pipeline model in the correct order with the doc, so at the end of this process I would theoretically get the same results.
However, as you can see in the output below, while the Doc's tokens are the same, the Doc's text is different, there were blank spaces between the digits and the date separators.
doc2 = Doc(nlp.vocab, words=tokens)
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
what time will sunset be on 2022 - 12 - 24 ?
['what', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022 - 12 - 24
I know it must be something silly that I'm missing but I don't realize it.
Could someone please explain to me what I'm doing wrong and point me in the right direction?
Thanks a lot in advance!
EDIT
Following the Talha Tayyab suggestion, I have to create an array of booleans with the same length that my list of tokens to indicate for each one, if the token is followed by an empty space. Then pass this array in doc construction as follows: doc = Doc(nlp.vocab, words=words, spaces=spaces).
To compute this list of boolean values based on my original text string and list of tokens, I implemented the following vanilla function:
def get_spaces(self, text: str, tokens: List[str]) -> List[bool]:
# Spaces
spaces = []
# Copy text to easy operate
t = text.lower()
# Iterate over tokens
for token in tokens:
if t.startswith(token.lower()):
t = t[len(token):] # Remove token
# If after removing token we have an empty space
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
else:
spaces.append(False)
return spaces
With these two improvements in my code, the result obtained is as expected. However, now I have the following question:
Is there a more spacy-like way to compute whitespace, instead of using my vanilla implementation?

Please try this:
from spacy.language import Doc
doc2 = Doc(nlp.vocab, words=tokens,spaces=[1,1,1,1,1,1,0,0,0,0,0,0])
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
# You can also replace 0 with False and 1 with True
This is the complete syntax:
doc = Doc(nlp.vocab, words=words, spaces=spaces)
spaces are a list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.
So you can choose which ones you gonna have space and which ones you do not need.
Reference: https://spacy.io/api/doc

Late to this, but as you've retrieved the tokens from a document to begin with I think you can just use the whitespace_ attribute of the token for this. Then your 'get_spaces` function looks like:
def get_spaces(tokens):
return [1 if token.whitespace_ else 0 for token in tokens]
Note that this won't work nicely if there multiple spaces or non-space whitespace (e.g. tabs), but then you probably need to update the tokenizer or use your existing solution and update this part:
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
to check for generic whitespace and remove more than just a leading space.

Tokenize text - Very slow when doing it

Question
I have a data frame with +90,000 rows and with a column ['text'] that contains the text of some news.
The length of the text has an average of 3.000 words and when I pass the word_tokenize it makes it very slow, Which could be a more efficent method to do it?
from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize)
df.head()
Also word_tokenize hasn't some punctuations and other characters that I don't want, so I created a function to filter them where I'm using spacy.
from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','\xa0']
def tokenize(phrase):
sentence_tokens = []
tokenized_phrase = nlp(phrase)
for token in tokenized_phrase:
if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
sentence_tokens.append(token.text.lower())
return sentence_tokens
Any other better method to do it?
Thanks for reading my maybe noob👨🏽‍💻 question😀, have a nice day🌻.
Appreciations
nlp is defined before
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()
I'm using spacy to tokenize but also using the nltk stop_words for spanish language.

If you are only tokenizing, use a blank model (which only contains a tokenizer) instead of es_core_news_sm:
nlp = spacy.blank("es")

In order to make spacy faster when you only wish to tokenize.
you can change:
nlp = es_core_news_sm.load()
To:
nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])
A small explanation:
Spacy gives a full language model which not merely tokenize your sentence but also do parsing, and pos and ner tagging. when actually most of the calculation time is being done for the other tasks (parse tree, pos, ner) and not the tokenization which is actually much 'lighter' task, computation wise.
But, as you can see spacy allow you to use only what you actually need and by that save you some time.
Another thing, you can make your function more efferent by lowering token only once and add the stop word to spacy (even if you didn't want to do so, the fact that otherCharacters is a list and not a set is not very efficient ).
I would also add this:
for w in stopwords.words('spanish'):
nlp.vocab[w].is_stop = True
for w in otherCharacters:
nlp.vocab[w].is_stop = True
for w in STOP_WORDS:
nlp.vocab[w].is_stop = True
and than:
for token in tokenized_phrase:
if not token.is_punct and not token.is_stop:
sentence_tokens.append(token.text.lower())

Spacy - modify tokenizer for numeric patterns

I have seen some ways to create a custom tokenizer, but I am a little confused. What I am doing is using the Phrase Matcher to match patterns. However, it would match a 4-digit number pattern, say 1234, in 111-111-1234, since it splits on the dash.
All I want to do is modify the current tokenizer (from nlp = English()) and add a rule that it should not split on some characters but only for numeric patterns.

To do this you will need to overwrite spaCy's default infix tokenization scheme with your own. You can do this by modifying the infix tokenization scheme used by spaCy found here.
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
# default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("111-222-1234 for abcDE")
print([t.text for t in doc])
# modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\*^](?=[0-9-])", # Remove the hyphen
r"(?<=[{al}{q}])\.?(?=[{au}{q}])".format( # Make the dot optional
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
)
,
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp("111-222-1234 for abcDE")
print([t.text for t in doc])
Output
With default tokenizer:
['111', '-', '222', '-', '1234', 'for', 'abcDE']
With custom tokenizer:
['111-222-1234', 'for', 'abc', 'DE']

Tokenize multi word in python

I'm new in python . I have a big data set from twitter and i want to tokenize it .
but i don't know how can i token verbs like this : "look for , take off ,grow up and etc." and it's important to me .
my code is :
>>> from nltk.tokenize import word_tokenize
>>> s = "I'm looking for the answer"
>>> word_tokenize(s)
['I', "'m", 'looking', 'for', 'the', 'answer']
my data set is big and i can't use this page code :
Find multi-word terms in a tokenized text in Python
so , how can i solve my problem?

You need to use parts of speech tags for that, or actually dependency parsing would be more accurate. I haven't tried with nltk, but with spaCy you can do it like this:
import spacy
nlp = spacy.load('en_core_web_lg')
def chunk_phrasal_verbs(lemmatized_sentence):
ph_verbs = []
for word in nlp(lemmatized_sentence):
if word.dep_ == 'prep' and word.head.pos_ == 'VERB':
ph_verb = word.head.text+ ' ' + word.text
ph_verbs.append(ph_verb)
return ph_verbs
I also suggest first lemmatizing the sentence to get rid of conjugations. Also if you need noun phrases, with the similar way you can use compound relationship.

Tokenizing an HTML document

I have an HTML document and I'd like to tokenize it using spaCy while keeping HTML tags as a single token.
Here's my code:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en', vectors=False, parser=False, entity=False)
nlp.tokenizer.add_special_case(u'<i>', [{ORTH: u'<i>'}])
nlp.tokenizer.add_special_case(u'</i>', [{ORTH: u'</i>'}])
doc = nlp('Hello, <i>world</i> !')
print([e.text for e in doc])
The output is:
['Hello', ',', '<', 'i', '>', 'world</i', '>', '!']
If I put spaces around the tags, like this:
doc = nlp('Hello, <i> world </i> !')
The output is as I want it:
['Hello', ',', '<i>', 'world', '</i>', '!']
but I'd like avoiding complicated pre-processing to the HTML.
Any idea how can I approach this?

You need to create a custom Tokenizer.
Your custom Tokenizer will be exactly as spaCy's tokenizer but it will have '<' and '>' symbols removed from prefixes and suffixes and also it will add one new prefix and one new suffix rule.
Code:
import spacy
from spacy.tokens import Token
Token.set_extension('tag', default=False)
def create_custom_tokenizer(nlp):
from spacy import util
from spacy.tokenizer import Tokenizer
from spacy.lang.tokenizer_exceptions import TOKEN_MATCH
prefixes = nlp.Defaults.prefixes + ('^<i>',)
suffixes = nlp.Defaults.suffixes + ('</i>$',)
# remove the tag symbols from prefixes and suffixes
prefixes = list(prefixes)
prefixes.remove('<')
prefixes = tuple(prefixes)
suffixes = list(suffixes)
suffixes.remove('>')
suffixes = tuple(suffixes)
infixes = nlp.Defaults.infixes
rules = nlp.Defaults.tokenizer_exceptions
token_match = TOKEN_MATCH
prefix_search = (util.compile_prefix_regex(prefixes).search)
suffix_search = (util.compile_suffix_regex(suffixes).search)
infix_finditer = (util.compile_infix_regex(infixes).finditer)
return Tokenizer(nlp.vocab, rules=rules,
prefix_search=prefix_search,
suffix_search=suffix_search,
infix_finditer=infix_finditer,
token_match=token_match)
nlp = spacy.load('en_core_web_sm')
tokenizer = create_custom_tokenizer(nlp)
nlp.tokenizer = tokenizer
doc = nlp('Hello, <i>world</i> !')
print([e.text for e in doc])

For the record, it might be that this has become easier: With the current version of Spacy, you don't have to create a custom tokenizer anymore. It suffices to 1. extend the infixes (to ensure tags are separated from words), and 2. add the tags as special cases:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load("en_core_web_trf")
infixes = nlp.Defaults.infixes + [r'(<)']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
nlp.tokenizer.add_special_case(f"<i>", [{ORTH: f"<i>"}])
nlp.tokenizer.add_special_case(f"</i>", [{ORTH: f"</i>"}])
text = """Hello, <i>world</i> !"""
doc = nlp(text)
print([e.text for e in doc])
Prints:
['Hello', ',', '<i>', 'world', '</i>', '!']
(This is more or less a condensed version of https://stackoverflow.com/a/66268015/1016514)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Prevent Spacy tokenizer from splitting on specific character - python

Related

Spacy tokenization add extra white space for dates with hyphen separator when I manually build the Doc

Tokenize text - Very slow when doing it

Spacy - modify tokenizer for numeric patterns

Tokenize multi word in python

Tokenizing an HTML document

Categories

Resources