I would like to lemmatize some textual data in Hungarian language and encountered a strange feature in spaCy. The token.lemma_ function works well in terms of lemmatization, however, it returns some of the sentences without first letter capitalization. This is quite annoying, as my next function, unnest_stences (R) requires first capital letters in order to identify and break the text down into individual sentences.
First I thought the problem was that I used the latest version of spaCy since I had gotten a warning that
UserWarning: [W031] Model 'hu_core_ud_lg' (0.3.1) requires spaCy v2.1
and is incompatible with the current spaCy version (2.3.2). This may
lead to unexpected results or runtime errors. To resolve this,
download a newer compatible model or retrain your custom model with
the current spaCy version.
So I went ahead and installed spacy 2.1, but the problem still persists.
The source of my data are some email messages I cannot share here, but here is a small, artificial example:
# pip install -U spacy==2.1 # takes 9 mins
# pip install hu_core_ud_lg # takes 50 mins
import spacy
from spacy.lemmatizer import Lemmatizer
import hu_core_ud_lg
import pandas as pd
nlp = hu_core_ud_lg.load()
a = "Tisztelt levélíró!"
b = "Köszönettel vettük megkeresését."
df = pd.DataFrame({'text':[a, b]})
output_lemma = []
for i in df.text:
mondat = ""
doc = nlp(i)
for token in doc:
mondat = mondat + " " + token.lemma_
output_lemma.append(mondat)
output_lemma
which yields
[' tisztelt levélíró !', ' köszönet vesz megkeresés .']
but I would expect
[' Tisztelt levélíró !', ' Köszönet vesz megkeresés .']
When I pass my original data to the function, it returns some sentences with upercase first letters, others with lowercase letters. For some strange reason I couldn't reproduce that pattern above, but I guess the main point is visible. The function does not work as expected.
Any ideas how I could fix this?
I'm using Jupyter Notebook, Python 2.7, Win 7 and a Toshiba laptop (Portégé Z830-10R i3-2367M).
Lowercasing is the expected behavior of spaCy's lemmatizer for non-proper-noun tokens.
One workaround is to check if each token is titlecased, and convert to original casing after lemmatizing (only applies to the first character).
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'This is a test sentence.'
doc = nlp(text)
newtext = ' '.join([tok.lemma_.title() if tok.is_title else tok.lemma_ for tok in doc])
print(newtext)
# This be a test sentence .
Related
I am trying to split sentences into clauses using spaCy for classification with a MLLib. I have searched for one of two solutions that I consider the best way to approach but haven't quite had much luck.
Option: Would be to use the tokens in the doc i.e. token.pos_ that match to SCONJ and split as a sentence.
Option: Would be to create a list using whatever spaCy has as a dictionary of values it identifies as SCONJ
The issue with 1 is that I only have .text, .i, and no .pos_ as the custom boundaries (as far as I am aware needs to be run before the parser.
The issue with 2 is that I can't seem to find the dictionary. It is also a really hacky approach.
import deplacy
from spacy.language import Language
# Uncomment to visualise how the tokens are labelled
# deplacy.render(doc)
custom_EOS = ['.', ',', '!', '!']
custom_conj = ['then', 'so']
#Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text in custom_EOS:
doc[token.i + 1].is_sent_start = True
if token.text in custom_conj:
doc[token.i].is_sent_start = True
return doc
def set_sentence_breaks(doc):
for token in doc:
if token == "SCONJ":
doc[token.i].is_sent_start = True
def main():
text = "In the add user use case, we need to consider speed and reliability " \
"so use of a relational DB would be better than using SQLite. Though " \
"it may take extra effort to convert #Bot"
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Current Output
Sentences: ['In the add user use case,',
'we need to consider speed and reliability,
'so the use of a relational DB would be better than using SQLite.',
'Though it may take extra effort to convert #Bot']
I would recommend not trying to do anything clever with is_sent_starts - while it is user-accessible, it's really not intended to be used in that way, and there is at least one unresolved issue related to it.
Since you just need these divisions for some other classifier, it's enough for you to just get the string, right? In that case I recommend you run the spaCy pipeline as usual and then split sentences on SCONJ tokens (if just using SCONJ is working for your use case). Something like:
out = []
for sent in doc.sents:
last = sent[0].i
for tok in sent:
if tok.pos_ == "SCONJ":
out.append(doc[last:tok.i])
last = tok.i + 1
out.append(doc[last:sent[-1].i])
Alternately, if that's not good enough, you can identify subsentences using the dependency parse to find verbs in subsentences (by their relation to SCONJ, for example), saving the subsentences, and then adding another sentence based on the root.
I have a list of words and I'm trying to turn plural words in singular in python, then I remove the duplicates. This is how I do it :
import spacy
nlp = spacy.load('fr_core_news_md')
words = ['animaux', 'poule', 'adresse', 'animal', 'janvier', 'poules']
clean_words = []
for word in words:
doc = nlp(word)
for token in doc:
clean_words.append(token.lemma_)
clean_words = list(set(clean_words))
This is the output :
['animal', 'janvier', 'poule', 'adresse']
It works well, but my problem is that 'fr_core_news_md' takes a little too long to load so I was wondering if there was another way to do this ?
The task you trying to do is called lemmatization and it does more than just converting plural to singular, it removes its flexions. It returns the canonical version of a word, the infinitive form of a verb for example.
If you want to use spacy you can make it load quicker by using the disable parameter.
For example spacy.load('fr_core_news_md', disable=['parser', 'textcat', 'ner', 'tagger']).
Alternatively, you use treetagger which is kinda hard to install but works great.
Or the FrenchLefffLemmatizer.
I trained a NER model with Spacy3. I would like to add a custom component (add_regex_match) to the pipeline for NER task. The aim is to improve the existing NER results.
This is the code I want to implement:
import spacy
from spacy.language import Language
from spacy.tokens import Span
import re
nlp = spacy.load(r"\src\Spacy3\ner_spacy3_hortisem\training\ml_rule_model")
#Language.component("add_regex_match")
def add_regex_entities(doc):
new_ents = []
label_z = "Zeit"
regex_expression_z = r"^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Januar|März|Mai|Juli|August|Oktober|Dezember)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Januar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Februar))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Januar|Februar|März|April|Mai|Juni|Juli|August|September))|(?:1[0-2]|(?:Oktober|November|Dezember)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$"
for match in re.finditer(regex_expression_z, doc.text): # find match in text
start, end = match.span() # get the matched token indices
entity = Span(doc, start, end, label=label_z)
new_ents.append(entity)
label_b = "BBCH_Stadium"
regex_expression_b = r"BBCH(\s?\d+)\s?(\/|\-|(bis)?)\s?(\d+)?"
for match in re.finditer(regex_expression_b, doc.text): # find match in text
start, end = match.span() # get the matched token indices
entity = Span(doc, start, end, label=label_b)
new_ents.append(entity)
doc.ents = new_ents
return doc
nlp.add_pipe("add_regex_match", after="ner")
nlp.to_disk("./training/ml_rule_regex_model")
doc = nlp("20/03/2021 8 März 2021 BBCH 15, Fliegen, Flugbrand . Brandenburg, in Berlin, Schnecken, BBCH 13-48, BBCH 3 bis 34")
print([(ent.text, ent.label_) for ent in doc.ents])
when I want to evaluate the saved model ml_rule_regex_model using the command line python -m spacy project run evaluate, I got the error:
'ValueError: [E002] Can't find factory for 'add_regex_match' for language German (de). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator #Language.component (for function components) or #Language.factory (for class components).'
How should I do it? Has anyone had experience? Thank you very much for your tips.
when I want to evaluate the saved model ml_rule_regex_model using the command line python -m spacy project run evaluate, I got the error...
You haven't included the project.yml of your spacy project, where the evaluate command is defined. I will assume it calls spacy evaluate? If so, that command has a --code or -c flag to provide a path to a Python file with additional code, such as registered functions. By providing this file and pointing it to the definition of your new add_regex_match component, spaCy will be able to parse the configuration file and use the model.
I would like to use textacy for key term extraction but the function I am using keyterms.key_terms.pagerank(doc) is just returning an empty list.
I have tried related functions including the longer keyterms.key_terms_from_semantic_network(doc) with no success. I have also tried using longer pieces of text than shown below but it still does not find any key terms. Other functions in textacy do seem to work so it seems to be a problem just with the keyterms class.
import spacy
import textacy
test_string = "Textacy key term extraction is not working properly. Textacy is built on top of SpaCy."
doc = textacy.make_spacy_doc(test_string)
textacy.keyterms.textrank(doc)
I am getting an empty list rather than a list of tuples with terms and ranking scores as expected.
This works for me
Note the following additions:
I explicitly imported keyterms in line 2.
I passed the spaCy English model in line 4.
import spacy
from textacy import keyterms
test_string = "Textacy key term extraction is not working properly. Textacy is built on top of SpaCy."
doc = textacy.make_spacy_doc(test_string, lang='en_core_web_sm')
textacy.keyterms.textrank(doc)
Here is the results I got from your example sentence:
[('term', 0.24594541923542018),
('textacy', 0.24594541923542018),
('extraction', 0.2390545807645797),
('key', 0.13452729038228986),
('spacy', 0.13452729038228986)]
Here is an example, working with newest version June 2021 :
import spacy
from textacy.extract import keyterms as kt
test_string = "Textacy key term extraction is not working properly. Textacy is built on top of SpaCy."
doc = textacy.make_spacy_doc(test_string, lang='en_core_web_sm')
kt.textrank(doc)
I want to implement lemmatization with Spacy package.
Here is my code :
regexp = re.compile( '(?u)\\b\\w\\w+\\b' )
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string))
def custom_tokenizer(document):
doc_spacy = en_nlp(document)
return [token.lemma_ for token in doc_spacy]
lemma_tfidfvect = TfidfVectorizer(tokenizer= custom_tokenizer,stop_words = 'english')
But this error message was occured when i run that code.
C:\Users\yu\Anaconda3\lib\runpy.py:193: DeprecationWarning: Tokenizer.from_list is now deprecated. Create a new Doc object instead and pass in the strings as the `words` keyword argument, for example:
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=[...])
"__main__", mod_spec)
How can i solve this problem ?
To customize spaCy's tokenizer, you need to pass it a list of dictionaries that specify the word that needs custom tokenization and the orths it should be split into. Here's the example code from the docs:
from spacy.attrs import ORTH, LEMMA
case = [{"don't": [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]}]
tokenizer.add_special_case(case)
If you're doing this all because you're wanting to make a custom lemmatizer, you might be better off just creating a custom lemma list directly. You'd have to modify the language data of spaCy itself, but the format is pretty simple:
"dustiest": ("dusty",),
"earlier": ("early",),
"earliest": ("early",),
"earthier": ("earthy",),
...
Those files live here for English.
I think that your code runs fine, you are just getting a DeprecationWarning, which is not really an error.
Following the advice given by the warning, I think you can modify your code substituting
en_nlp.tokenizer = lambda string: Doc(en_nlp.vocab, words = regexp.findall(string))
and that should run fine with no warnings (it does today on my machine).