Use spacy on pretokenized text - python

I want to use spacy for processing an already pre-tokenized text. Parsing a list of tokens to spacy does not work.
import spacy
nlp = spacy.load("en_core_web_sm")
nlp(["This", "is", "a", "sentence"])
This gives a TypeError (which makes sense):
TypeError: Argument 'string' has incorrect type (expected str, got list)
I could replace the tokenizer with a custom one, but I feel like that would overcomplicate things and is not the preferred way.
Thank you for your help :D

You can use this method:
tokens = ["This", "is", "a", "sentence"]
sentence = nlp.tokenizer.tokens_from_list(tokens)
print(sentence)
This is a sentence

As of spaCy 3.0+, nlp.tokenizer.tokens_from_list() has been deprecated. Use the Doc object instead.
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")
sent = ["This", "is", "a", "sentence"]
doc = Doc(nlp.vocab, sent)
for token in nlp(doc):
print(token.text, token.pos_)

If you use:
sentence = nlp.tokenizer.tokens_from_list(tokens) with spacy.matcher / Matcher you'll get error:
Try using nlp() instead of nlp.make_doc() or list(nlp.pipe()) instead
of list(nlp.tokenizer.pipe()).
The way I solved it: I iterate over each item inside of a for loop:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{'LEMMA': 'sentence', 'POS': 'NOUN'}]
matcher.add('Searched Word', None, pattern)
X = ["Sentence one", "Sentence two", "Sentence three", "sentence last !"]
for i in X.index:
doc = nlp(X[i])
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
A a better way of doing this is using nlp.pipe:
for doc in nlp.pipe(X):
print([token.text for token in doc])
Also good for faster algo running and more efficient text processing.
Hope this helps. Thank you.

Related

How to remove stop words and lemmatize at the same time when using spaCy?

When I use spaCy for cleaning data, I run the following line:
df['text'] = df.sentence.progress_apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha))
Which lemmatizes each word in the text row if the word in not a stop-word. The problem is that text.lemma_ is applied to the token after the token is checked for being a stop-word or not. Therefore, if the stop-word is not in the lemmatized form, it will not be considered stop word. For example, if I add "friend" to the list of stop words, the output will still contain "friend" if the original token was "friends". The easy solution is to run this line twice. But that sounds silly. Anyone can suggest a solution to remove the stop words that are not in the lemmatized form in the first run?
Thanks!
You can simply check if the token.lemma_ is present in the nlp.Defaults.stop_words:
if token.lemma_.lower() not in nlp.Defaults.stop_words
For example:
df['text'] = df.sentence.progress_apply(
lambda text:
" ".join(
token.lemma_ for token in nlp(text)
if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha
)
)
See a quick test:
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.Defaults.stop_words.add("friend") # Adding "friend" to stopword list
>>> text = "I have a lot of friends"
>>> " ".join(token.lemma_ for token in nlp(text) if not token.is_stop and token.is_alpha)
'lot friend'
>>> " ".join(token.lemma_ for token in nlp(text) if token.lemma_.lower() not in nlp.Defaults.stop_words and token.is_alpha)
'lot'
If you add words in uppercase to the stopword list, you will need to use if token.lemma_.lower() not in map(str.lower, nlp.Defaults.stop_words).

spacy matcher returns right answer when two words are set as seperate 'TEXT' conditional object only. Why is it?

I'm trying to set a matcher finding word 'iPhone X'.
The sample code says I should follow below.
import spacy
# Import the Matcher
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
I tried another approach by putting like below.
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
Why is the second approach not working? I assumed if I put the two word 'iPhone' and 'X' together, it might work as the same way cause it regard the word with space in the middle as a long unique word. But it didn't.
The possible reason I could think of is,
matcher condition should be a single word without empty space.
Am I right? or is there another reason the second approach not working?
Thank you.
The answer is in how Spacy tokenizes the string:
>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']
As you see, the iPhone and X are separate tokens. See the Matcher reference:
A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.
Thus, you cannot use them both in one token definition.

Spacy tokenizer with only "Whitespace" rule

I would like to know if the spacy tokenizer could tokenize words only using the "space" rule.
For example:
sentence= "(c/o Oxford University )"
Normally, using the following configuration of spacy:
nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
print(token)
the result would be:
(
c
/
o
Oxford
University
)
Instead, I would like an output like the following (using spacy):
(c/o
Oxford
University
)
Is it possible to obtain a result like this using spacy?
Let's change nlp.tokenizer with a custom Tokenizer with token_match regex:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])
nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])
Before: [This, is, it, 's]
After : [This, is, it's]
You can further adjust Tokenizer by adding custom suffix, prefix, and infix rules.
An alternative, more fine grained way would be to find out why it's token is split like it is with nlp.tokenizer.explain():
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)
You'll find out that split is due to SPECIAL rules:
[('TOKEN', 'This'),
('TOKEN', 'is'),
('SPECIAL-1', 'it'),
('SPECIAL-2', "'s"),
('SUFFIX', '.'),
('SPECIAL-1', 'I'),
('SPECIAL-2', "'m"),
('TOKEN', 'fine')]
that could be updated to remove "it's" from exceptions like:
exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I, 'm, fine]
or remove split on apostrophe altogether:
filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I'm, fine]
Note the dot attached to the token, which is due to the suffix rules not specified.
You can find the solution to this very question in the spaCy docs: https://spacy.io/usage/linguistic-features#custom-tokenizer-example. In a nutshell, you create a function that takes a string text and returns a Doc object, and then assign that callable function to nlp.tokenizer:
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(' ')
# All tokens 'own' a subsequent space character in this tokenizer
spaces = [True] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])

Accessing out of range word in spaCy doc : why does it work?

I'm learning spaCy and am playing with Matchers.
I have:
a very basic sentence ("white shepherd dog")
a matcher object, searching for a pattern ("white shepherd")
a print to show the match, and the word and POS before that match
I just wanted to check how to handle the index out of range exception I'm expecting to get because there's nothing before the match. I didn't expect it to work, but it did and is returning 'dog', which is after the match... and now I'm confused.
It looks like spaCy uses a circular list (or deque I think) ?
This needs a language model to run, you can install it with the following command line, if you'd like to reproduce it:
python -m spacy download en_core_web_md
And this is the code
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
# Get previous token and its POS
print("Previous token: ", doc[start - 1].text, doc[start - 1].pos_) # I would expect the error here
I get the following:
>>> Matched span: white shepherd
>>> Previous token: dog PROPN
Can someone explain what's going on ?
Thanks !
You are looking for a token at index 0-1 which evaluated to -1, which is the last token.
I recommend using the Token.nbor method to look for the first token before the span, and if no previous token exists make it None or an empty string.
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
try:
nbor_tok = span[0].nbor(-1)
print("Previous token:", nbor_tok, nbor_tok.pos_)
except IndexError:
nbor_tok = ''
print("Previous token: None None")

Negate a word inside a pattern Python & spaCy

I have this sentence:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
All I want is to make sure the word 'not' does not exist between will and be inside my text. Here is my code:
pattern = [{'LOWER':'purchase'},{'IS_SPACE':True, 'OP':'*'},{'LOWER':'order'},{'IS_SPACE':True, 'OP':'*'},{"IS_ASCII": True, "OP": "*"},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
I am using this:
{'LOWER':'not', 'OP':'!'}
Any idea why is not working?
Your code example seems to miss a statement that actually performs the match. So I added the method 'matcher.add()' that also verboses a match by calling the self-defined function 'on_match'.
But more importantly I hade to change your pattern by leaving out your space part {'IS_SPACE':True, 'OP':'*'} to gain a match.
Here's my working code that gives me a match:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
def on_match(matcher, doc, id, matches): # Added!
print("match")
# Changing your pattern for example to:
pattern = [{'LOWER':'purchase'},{'LOWER':'order'},{'LOWER':'expenditures'},{'LOWER':'not', 'OP':'!'},{'LEMMA':'be'},{'LEMMA':'freeze'}]
matcher.add("ID_A1", on_match, pattern) # Added!
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen.')
matches = matcher(doc)
print(matches)
If I replace:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
be frozen.')
with:
doc = nlp(u'Non-revenue-generating purchase order expenditures will
not be frozen.')
I don't get a match anymore!
I reduced the complexity of your pattern - maybe too much. But I hope I could still help a bit.
Check this
"TEXT": {"NOT_IN": ["not"]}
See
"https://support.prodi.gy/t/negative-pattern-matching-regex/1764"

Categories