Overriding tokenizer of scikitlearn vectorizer with spacy

Overriding tokenizer of scikitlearn vectorizer with spacy - python

I want to implement lemmatization with Spacy package.
Here is my code :
regexp = re.compile( '(?u)\\b\\w\\w+\\b' )
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string))
def custom_tokenizer(document):
doc_spacy = en_nlp(document)
return [token.lemma_ for token in doc_spacy]
lemma_tfidfvect = TfidfVectorizer(tokenizer= custom_tokenizer,stop_words = 'english')
But this error message was occured when i run that code.
C:\Users\yu\Anaconda3\lib\runpy.py:193: DeprecationWarning: Tokenizer.from_list is now deprecated. Create a new Doc object instead and pass in the strings as the `words` keyword argument, for example:
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=[...])
"__main__", mod_spec)
How can i solve this problem ?

To customize spaCy's tokenizer, you need to pass it a list of dictionaries that specify the word that needs custom tokenization and the orths it should be split into. Here's the example code from the docs:
from spacy.attrs import ORTH, LEMMA
case = [{"don't": [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]}]
tokenizer.add_special_case(case)
If you're doing this all because you're wanting to make a custom lemmatizer, you might be better off just creating a custom lemma list directly. You'd have to modify the language data of spaCy itself, but the format is pretty simple:
"dustiest": ("dusty",),
"earlier": ("early",),
"earliest": ("early",),
"earthier": ("earthy",),
...
Those files live here for English.

I think that your code runs fine, you are just getting a DeprecationWarning, which is not really an error.
Following the advice given by the warning, I think you can modify your code substituting
en_nlp.tokenizer = lambda string: Doc(en_nlp.vocab, words = regexp.findall(string))
and that should run fine with no warnings (it does today on my machine).

Related

Using POS and PUNCT tokens in custom sentence boundaries in spaCy

I am trying to split sentences into clauses using spaCy for classification with a MLLib. I have searched for one of two solutions that I consider the best way to approach but haven't quite had much luck.
Option: Would be to use the tokens in the doc i.e. token.pos_ that match to SCONJ and split as a sentence.
Option: Would be to create a list using whatever spaCy has as a dictionary of values it identifies as SCONJ
The issue with 1 is that I only have .text, .i, and no .pos_ as the custom boundaries (as far as I am aware needs to be run before the parser.
The issue with 2 is that I can't seem to find the dictionary. It is also a really hacky approach.
import deplacy
from spacy.language import Language
# Uncomment to visualise how the tokens are labelled
# deplacy.render(doc)
custom_EOS = ['.', ',', '!', '!']
custom_conj = ['then', 'so']
#Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text in custom_EOS:
doc[token.i + 1].is_sent_start = True
if token.text in custom_conj:
doc[token.i].is_sent_start = True
return doc
def set_sentence_breaks(doc):
for token in doc:
if token == "SCONJ":
doc[token.i].is_sent_start = True
def main():
text = "In the add user use case, we need to consider speed and reliability " \
"so use of a relational DB would be better than using SQLite. Though " \
"it may take extra effort to convert #Bot"
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Current Output
Sentences: ['In the add user use case,',
'we need to consider speed and reliability,
'so the use of a relational DB would be better than using SQLite.',
'Though it may take extra effort to convert #Bot']

I would recommend not trying to do anything clever with is_sent_starts - while it is user-accessible, it's really not intended to be used in that way, and there is at least one unresolved issue related to it.
Since you just need these divisions for some other classifier, it's enough for you to just get the string, right? In that case I recommend you run the spaCy pipeline as usual and then split sentences on SCONJ tokens (if just using SCONJ is working for your use case). Something like:
out = []
for sent in doc.sents:
last = sent[0].i
for tok in sent:
if tok.pos_ == "SCONJ":
out.append(doc[last:tok.i])
last = tok.i + 1
out.append(doc[last:sent[-1].i])
Alternately, if that's not good enough, you can identify subsentences using the dependency parse to find verbs in subsentences (by their relation to SCONJ, for example), saving the subsentences, and then adding another sentence based on the root.

Plural to singular of french words in python

I have a list of words and I'm trying to turn plural words in singular in python, then I remove the duplicates. This is how I do it :
import spacy
nlp = spacy.load('fr_core_news_md')
words = ['animaux', 'poule', 'adresse', 'animal', 'janvier', 'poules']
clean_words = []
for word in words:
doc = nlp(word)
for token in doc:
clean_words.append(token.lemma_)
clean_words = list(set(clean_words))
This is the output :
['animal', 'janvier', 'poule', 'adresse']
It works well, but my problem is that 'fr_core_news_md' takes a little too long to load so I was wondering if there was another way to do this ?

The task you trying to do is called lemmatization and it does more than just converting plural to singular, it removes its flexions. It returns the canonical version of a word, the infinitive form of a verb for example.
If you want to use spacy you can make it load quicker by using the disable parameter.
For example spacy.load('fr_core_news_md', disable=['parser', 'textcat', 'ner', 'tagger']).
Alternatively, you use treetagger which is kinda hard to install but works great.
Or the FrenchLefffLemmatizer.

Add a custom component to pipeline in Spacy 3

I trained a NER model with Spacy3. I would like to add a custom component (add_regex_match) to the pipeline for NER task. The aim is to improve the existing NER results.
This is the code I want to implement:
import spacy
from spacy.language import Language
from spacy.tokens import Span
import re
nlp = spacy.load(r"\src\Spacy3\ner_spacy3_hortisem\training\ml_rule_model")
#Language.component("add_regex_match")
def add_regex_entities(doc):
new_ents = []
label_z = "Zeit"
regex_expression_z = r"^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Januar|März|Mai|Juli|August|Oktober|Dezember)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Januar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Februar))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Januar|Februar|März|April|Mai|Juni|Juli|August|September))|(?:1[0-2]|(?:Oktober|November|Dezember)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$"
for match in re.finditer(regex_expression_z, doc.text): # find match in text
start, end = match.span() # get the matched token indices
entity = Span(doc, start, end, label=label_z)
new_ents.append(entity)
label_b = "BBCH_Stadium"
regex_expression_b = r"BBCH(\s?\d+)\s?(\/|\-|(bis)?)\s?(\d+)?"
for match in re.finditer(regex_expression_b, doc.text): # find match in text
start, end = match.span() # get the matched token indices
entity = Span(doc, start, end, label=label_b)
new_ents.append(entity)
doc.ents = new_ents
return doc
nlp.add_pipe("add_regex_match", after="ner")
nlp.to_disk("./training/ml_rule_regex_model")
doc = nlp("20/03/2021 8 März 2021 BBCH 15, Fliegen, Flugbrand . Brandenburg, in Berlin, Schnecken, BBCH 13-48, BBCH 3 bis 34")
print([(ent.text, ent.label_) for ent in doc.ents])
when I want to evaluate the saved model ml_rule_regex_model using the command line python -m spacy project run evaluate, I got the error:
'ValueError: [E002] Can't find factory for 'add_regex_match' for language German (de). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator #Language.component (for function components) or #Language.factory (for class components).'
How should I do it? Has anyone had experience? Thank you very much for your tips.

when I want to evaluate the saved model ml_rule_regex_model using the command line python -m spacy project run evaluate, I got the error...
You haven't included the project.yml of your spacy project, where the evaluate command is defined. I will assume it calls spacy evaluate? If so, that command has a --code or -c flag to provide a path to a Python file with additional code, such as registered functions. By providing this file and pointing it to the definition of your new add_regex_match component, spaCy will be able to parse the configuration file and use the model.

Force spaCy lemmas to be lowercase

Is it possible to leave the token text true cased, but force the lemmas to be lowercased? I am interested in this because I want to use the PhraseMatcher where I run an input text through the pipleline, and then search for matching phrases on that text, where each search query can be case sensitive or not. In the case that I search by Lemma, i'd like the search to be case insensitive by default.
e.g.
doc = nlp(text)
for query in queries:
if case1:
attr = "LEMMA"
elif case2:
attr = "ORTH"
elif case3:
attr = "LOWER"
phrase_matcher = PhraseMatcher(self.vocab, attr=attr)
phrase_matcher.add(key, query)
matches = phrase_matcher(doc)
In case 1, I expect matching to be case insensitive, and if there were something in the spaCy library to enforce that lemmas are lowercased by default, this would be much more efficient than keeping multiple versions of the doc, and forcing one to have all lowercased characters.

This part of spacy is changing from version to version, last time I looked at the lemmatization was a few versions ago. So this solution might not be the most elegant one, but it is definitely a simple one:
# Create a pipe that converts lemmas to lower case:
def lower_case_lemmas(doc) :
for token in doc :
token.lemma_ = token.lemma_.lower()
return doc
# Add it to the pipeline
nlp.add_pipe(lower_case_lemmas, name="lower_case_lemmas", after="tagger")
You will need to figure out where in the pipeline to add it to. The latest documentation mentions that the Lemmatizer uses POS tagging info, so I am not sure at what point it is called. Placing your pipe after tagger is safe, all the lemmas should be figured out by then.
Another option I can think of is to derive a custom lemmatizer from Lemmatizer class and override its __call__ method, but this is likely to be quite invasive as you will need to figure out how (and where) to plug in your own lemmatizer.

SpaCy lemmatizer removes capitalization

I would like to lemmatize some textual data in Hungarian language and encountered a strange feature in spaCy. The token.lemma_ function works well in terms of lemmatization, however, it returns some of the sentences without first letter capitalization. This is quite annoying, as my next function, unnest_stences (R) requires first capital letters in order to identify and break the text down into individual sentences. 
First I thought the problem was that I used the latest version of spaCy since I had gotten a warning that
UserWarning: [W031] Model 'hu_core_ud_lg' (0.3.1) requires spaCy v2.1
and is incompatible with the current spaCy version (2.3.2). This may
lead to unexpected results or runtime errors. To resolve this,
download a newer compatible model or retrain your custom model with
the current spaCy version.
So I went ahead and installed spacy 2.1, but the problem still persists. 
The source of my data are some email messages I cannot share here, but here is a small, artificial example:
# pip install -U spacy==2.1 # takes 9 mins
# pip install hu_core_ud_lg # takes 50 mins
import spacy
from spacy.lemmatizer import Lemmatizer
import hu_core_ud_lg
import pandas as pd
nlp = hu_core_ud_lg.load()
a = "Tisztelt levélíró!"
b = "Köszönettel vettük megkeresését."
df = pd.DataFrame({'text':[a, b]})
output_lemma = []
for i in df.text:
mondat = ""
doc = nlp(i)
for token in doc:
mondat = mondat + " " + token.lemma_
output_lemma.append(mondat)
output_lemma
which yields
[' tisztelt levélíró !', ' köszönet vesz megkeresés .']
but I would expect
[' Tisztelt levélíró !', ' Köszönet vesz megkeresés .']
When I pass my original data to the function, it returns some sentences with upercase first letters, others with lowercase letters. For some strange reason I couldn't reproduce that pattern above, but I guess the main point is visible. The function does not work as expected.
Any ideas how I could fix this?
I'm using Jupyter Notebook, Python 2.7, Win 7 and a Toshiba laptop (Portégé Z830-10R i3-2367M).

Lowercasing is the expected behavior of spaCy's lemmatizer for non-proper-noun tokens.
One workaround is to check if each token is titlecased, and convert to original casing after lemmatizing (only applies to the first character).
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'This is a test sentence.'
doc = nlp(text)
newtext = ' '.join([tok.lemma_.title() if tok.is_title else tok.lemma_ for tok in doc])
print(newtext)
# This be a test sentence .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Overriding tokenizer of scikitlearn vectorizer with spacy - python

Related

Using POS and PUNCT tokens in custom sentence boundaries in spaCy

Plural to singular of french words in python

Add a custom component to pipeline in Spacy 3

Force spaCy lemmas to be lowercase

SpaCy lemmatizer removes capitalization

Categories

Resources