Spacy tokenizer with only "Whitespace" rule - python

I would like to know if the spacy tokenizer could tokenize words only using the "space" rule.
For example:
sentence= "(c/o Oxford University )"
Normally, using the following configuration of spacy:
nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
print(token)
the result would be:
(
c
/
o
Oxford
University
)
Instead, I would like an output like the following (using spacy):
(c/o
Oxford
University
)
Is it possible to obtain a result like this using spacy?

Let's change nlp.tokenizer with a custom Tokenizer with token_match regex:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])
nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])
Before: [This, is, it, 's]
After : [This, is, it's]
You can further adjust Tokenizer by adding custom suffix, prefix, and infix rules.
An alternative, more fine grained way would be to find out why it's token is split like it is with nlp.tokenizer.explain():
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)
You'll find out that split is due to SPECIAL rules:
[('TOKEN', 'This'),
('TOKEN', 'is'),
('SPECIAL-1', 'it'),
('SPECIAL-2', "'s"),
('SUFFIX', '.'),
('SPECIAL-1', 'I'),
('SPECIAL-2', "'m"),
('TOKEN', 'fine')]
that could be updated to remove "it's" from exceptions like:
exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I, 'm, fine]
or remove split on apostrophe altogether:
filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I'm, fine]
Note the dot attached to the token, which is due to the suffix rules not specified.

You can find the solution to this very question in the spaCy docs: https://spacy.io/usage/linguistic-features#custom-tokenizer-example. In a nutshell, you create a function that takes a string text and returns a Doc object, and then assign that callable function to nlp.tokenizer:
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(' ')
# All tokens 'own' a subsequent space character in this tokenizer
spaces = [True] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])

Related

Is it possible to merge a list of spacy tokens into a doc

I have a document which I've tokenized using Spacy tokenizer.
I want to apply ner on a sequence of tokens(a section of this document).
Currently I'm creating a doc first and then applying ner
nlp = spacy.load("en_core_web_sm")
# tokens_list is a list of Spacy tokens
words = [tok.text for tok in tokens_list]
spaces = [True if tok.whitespace_ else False for tok in tokens_list]
doc = spacy.tokens.doc.Doc(blackstone_nlp.vocab,
words=words, spaces=spaces)
doc = nlp.get_pipe("ner")(doc)
But this is not ideal because I loose their original ids within the document, which is important.
Is there a way to merge tokens into a doc and still maintain their ids(including other future extensions)?
To merge a list of tokens back into Doc you may wish to try:
import spacy
nlp = spacy.load("en_core_web_sm")
txt = "This is some text"
doc = nlp(txt)
words = [tok.text for tok in doc]
spaces = [True if tok.whitespace_ else False for tok in doc]
doc2 = spacy.tokens.doc.Doc(nlp.vocab, words=words, spaces=spaces)

Expected str instance, spacy.tokens.token.Token found

I am executing a data extraction use-case. To preprocess and tokenize my data, I am using both spacy English and German tokenizers, because the sentences are in both the languages. Here's my code:
import spacy
from spacy.lang.de import German
from spacy.lang.en import English
from spacy.lang.de import STOP_WORDS as stp_wrds_de
from spacy.lang.en.stop_words import STOP_WORDS as stp_wrds_en
import string
punctuations = string.punctuation
# German Parser
parser_de = German()
# English Parser
parser_en = English()
def spacy_tokenizer_de(document):
# Token object for splitting text into 'units'
tokens = parser_de(document)
# Lemmatization: Grammatical conversion of words
tokens = [word.lemma_.strip() if word.lemma_ != '-PRON-' else word for word in tokens]
# Remove punctuations
tokens = [word for word in tokens if word not in punctuations]
tokens_de_str = converttostr(tokens,' ')
tokens_en = spacy_tokenizer_en(tokens_de_str)
print("Tokens EN: {}".format(tokens_en))
tokens_en = converttostr(tokens_en,' ')
return tokens_en
def converttostr(input_seq, separator):
# Join all the strings in list
final_str = separator.join(input_seq)
return final_str
def spacy_tokenizer_en(document):
tokens = parser_en(document)
tokens = [word.lemma_.strip() if word.lemma_ != '-PRON-' else word for word in tokens]
return tokens
Here's a further elucidation of the above code:
1. spacy_tokenizer_de(): Method to parse and tokenize document in German
2. spacy_tokenizer_en(): Method to parse and tokenize document in English
3. converttostr(): Converts list of tokens to a string, so that the English spacy tokenizer can read the input (only accepts document/string format) and tokenize the data.
However, some sentences when parsed, lead to the following error:
Why is a spacy token object coming up in such scenarios, whereas, some of the sentences are being processed successfully? Can anyone please help here?
token.lemma_.strip() if token.lemma_ != '-PRON-' else token.text for token in tokens
You're supposed to get a list of words here, right? Instead, sometimes you return a string (when lemma doesn't equal to '-PRON-') but other times just token but not a string.
You may get a string from token.text.

Spacy custom sentence segmentation on line break

I'm trying to split this document into paragraphs. Specifically, I would like to split the text whenever there is a line break (<br>)
This is the code I'm using but is not producing the results I hoped
nlp = spacy.load("en_core_web_lg")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == "<br>":
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print([sent.text for sent in doc.sents])
A similar solution could be achieved by using NLTK's TextTilingTokenizer but wanted to check whether there is anything similar within Spacy
You're almost there, but the problem is that the default Tokenizer splits on '<' and '>', hence the condition token.text == "<br>" is never true. I'd add space before and after <br>. E.g.
import spacy
from spacy.symbols import ORTH
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == "<br>":
doc[token.i+1].is_sent_start = True
return doc
nlp = spacy.load("en_core_web_sm")
text = "the quick brown fox<br>jumps over the lazy dog"
text = text.replace('<br>', ' <br> ')
special_case = [{ORTH: "<br>"}]
nlp.tokenizer.add_special_case("<br>", special_case)
nlp.add_pipe(set_custom_boundaries, first=True)
doc = nlp(text)
print([sent.text for sent in doc.sents])
Also take a look at this PR, after it's merged to master, it'll no longer be necessary to wrap in spaces.
https://github.com/explosion/spaCy/pull/4259

CV Parser name matching

I am using NLP with python to find the names from the string. I am able to find the if i have a full name (first name and last name) but in the string i have only first name means my code is not able to recognize as Person. Below is my code.
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
string = """
Sriram is working as a python developer
"""
def ie_preprocess(document):
document = ' '.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def extract_names(document):
names = []
sentences = ie_preprocess(document)
#print(sentences)
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
#print("Out Side ",chunk)
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
print("In Side ",chunk)
names.append(' '.join([c[0] for c in chunk]))
return names
if __name__ == '__main__':
names = extract_names(string)
print(names)
My advice is to use the StanfordNLP/Spacy NER, using nltk ne chunks is a little janky. StanfordNLP is more commonly used by researchers, but Spacy is easier to work with. Here is an example using Spacy to print the name of each named entity and its type:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> text = 'Sriram is working as a python developer'
>>> doc = nlp(text)
>>> for ent in doc.ents:
print(ent.text,ent.label_)
Sriram ORG
>>>
Note that it classifies Sriram as an organization, which may be because it is not a common English name and Spacy is trained on English corpa. Good luck!

Tokenizing an HTML document

I have an HTML document and I'd like to tokenize it using spaCy while keeping HTML tags as a single token.
Here's my code:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en', vectors=False, parser=False, entity=False)
nlp.tokenizer.add_special_case(u'<i>', [{ORTH: u'<i>'}])
nlp.tokenizer.add_special_case(u'</i>', [{ORTH: u'</i>'}])
doc = nlp('Hello, <i>world</i> !')
print([e.text for e in doc])
The output is:
['Hello', ',', '<', 'i', '>', 'world</i', '>', '!']
If I put spaces around the tags, like this:
doc = nlp('Hello, <i> world </i> !')
The output is as I want it:
['Hello', ',', '<i>', 'world', '</i>', '!']
but I'd like avoiding complicated pre-processing to the HTML.
Any idea how can I approach this?
You need to create a custom Tokenizer.
Your custom Tokenizer will be exactly as spaCy's tokenizer but it will have '<' and '>' symbols removed from prefixes and suffixes and also it will add one new prefix and one new suffix rule.
Code:
import spacy
from spacy.tokens import Token
Token.set_extension('tag', default=False)
def create_custom_tokenizer(nlp):
from spacy import util
from spacy.tokenizer import Tokenizer
from spacy.lang.tokenizer_exceptions import TOKEN_MATCH
prefixes = nlp.Defaults.prefixes + ('^<i>',)
suffixes = nlp.Defaults.suffixes + ('</i>$',)
# remove the tag symbols from prefixes and suffixes
prefixes = list(prefixes)
prefixes.remove('<')
prefixes = tuple(prefixes)
suffixes = list(suffixes)
suffixes.remove('>')
suffixes = tuple(suffixes)
infixes = nlp.Defaults.infixes
rules = nlp.Defaults.tokenizer_exceptions
token_match = TOKEN_MATCH
prefix_search = (util.compile_prefix_regex(prefixes).search)
suffix_search = (util.compile_suffix_regex(suffixes).search)
infix_finditer = (util.compile_infix_regex(infixes).finditer)
return Tokenizer(nlp.vocab, rules=rules,
prefix_search=prefix_search,
suffix_search=suffix_search,
infix_finditer=infix_finditer,
token_match=token_match)
nlp = spacy.load('en_core_web_sm')
tokenizer = create_custom_tokenizer(nlp)
nlp.tokenizer = tokenizer
doc = nlp('Hello, <i>world</i> !')
print([e.text for e in doc])
For the record, it might be that this has become easier: With the current version of Spacy, you don't have to create a custom tokenizer anymore. It suffices to 1. extend the infixes (to ensure tags are separated from words), and 2. add the tags as special cases:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load("en_core_web_trf")
infixes = nlp.Defaults.infixes + [r'(<)']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
nlp.tokenizer.add_special_case(f"<i>", [{ORTH: f"<i>"}])
nlp.tokenizer.add_special_case(f"</i>", [{ORTH: f"</i>"}])
text = """Hello, <i>world</i> !"""
doc = nlp(text)
print([e.text for e in doc])
Prints:
['Hello', ',', '<i>', 'world', '</i>', '!']
(This is more or less a condensed version of https://stackoverflow.com/a/66268015/1016514)

Categories