Plural to singular of french words in python - python

I have a list of words and I'm trying to turn plural words in singular in python, then I remove the duplicates. This is how I do it :
import spacy
nlp = spacy.load('fr_core_news_md')
words = ['animaux', 'poule', 'adresse', 'animal', 'janvier', 'poules']
clean_words = []
for word in words:
doc = nlp(word)
for token in doc:
clean_words.append(token.lemma_)
clean_words = list(set(clean_words))
This is the output :
['animal', 'janvier', 'poule', 'adresse']
It works well, but my problem is that 'fr_core_news_md' takes a little too long to load so I was wondering if there was another way to do this ?

The task you trying to do is called lemmatization and it does more than just converting plural to singular, it removes its flexions. It returns the canonical version of a word, the infinitive form of a verb for example.
If you want to use spacy you can make it load quicker by using the disable parameter.
For example spacy.load('fr_core_news_md', disable=['parser', 'textcat', 'ner', 'tagger']).
Alternatively, you use treetagger which is kinda hard to install but works great.
Or the FrenchLefffLemmatizer.

Related

Using POS and PUNCT tokens in custom sentence boundaries in spaCy

I am trying to split sentences into clauses using spaCy for classification with a MLLib. I have searched for one of two solutions that I consider the best way to approach but haven't quite had much luck.
Option: Would be to use the tokens in the doc i.e. token.pos_ that match to SCONJ and split as a sentence.
Option: Would be to create a list using whatever spaCy has as a dictionary of values it identifies as SCONJ
The issue with 1 is that I only have .text, .i, and no .pos_ as the custom boundaries (as far as I am aware needs to be run before the parser.
The issue with 2 is that I can't seem to find the dictionary. It is also a really hacky approach.
import deplacy
from spacy.language import Language
# Uncomment to visualise how the tokens are labelled
# deplacy.render(doc)
custom_EOS = ['.', ',', '!', '!']
custom_conj = ['then', 'so']
#Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text in custom_EOS:
doc[token.i + 1].is_sent_start = True
if token.text in custom_conj:
doc[token.i].is_sent_start = True
return doc
def set_sentence_breaks(doc):
for token in doc:
if token == "SCONJ":
doc[token.i].is_sent_start = True
def main():
text = "In the add user use case, we need to consider speed and reliability " \
"so use of a relational DB would be better than using SQLite. Though " \
"it may take extra effort to convert #Bot"
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Current Output
Sentences: ['In the add user use case,',
'we need to consider speed and reliability,
'so the use of a relational DB would be better than using SQLite.',
'Though it may take extra effort to convert #Bot']
I would recommend not trying to do anything clever with is_sent_starts - while it is user-accessible, it's really not intended to be used in that way, and there is at least one unresolved issue related to it.
Since you just need these divisions for some other classifier, it's enough for you to just get the string, right? In that case I recommend you run the spaCy pipeline as usual and then split sentences on SCONJ tokens (if just using SCONJ is working for your use case). Something like:
out = []
for sent in doc.sents:
last = sent[0].i
for tok in sent:
if tok.pos_ == "SCONJ":
out.append(doc[last:tok.i])
last = tok.i + 1
out.append(doc[last:sent[-1].i])
Alternately, if that's not good enough, you can identify subsentences using the dependency parse to find verbs in subsentences (by their relation to SCONJ, for example), saving the subsentences, and then adding another sentence based on the root.

Force spaCy lemmas to be lowercase

Is it possible to leave the token text true cased, but force the lemmas to be lowercased? I am interested in this because I want to use the PhraseMatcher where I run an input text through the pipleline, and then search for matching phrases on that text, where each search query can be case sensitive or not. In the case that I search by Lemma, i'd like the search to be case insensitive by default.
e.g.
doc = nlp(text)
for query in queries:
if case1:
attr = "LEMMA"
elif case2:
attr = "ORTH"
elif case3:
attr = "LOWER"
phrase_matcher = PhraseMatcher(self.vocab, attr=attr)
phrase_matcher.add(key, query)
matches = phrase_matcher(doc)
In case 1, I expect matching to be case insensitive, and if there were something in the spaCy library to enforce that lemmas are lowercased by default, this would be much more efficient than keeping multiple versions of the doc, and forcing one to have all lowercased characters.
This part of spacy is changing from version to version, last time I looked at the lemmatization was a few versions ago. So this solution might not be the most elegant one, but it is definitely a simple one:
# Create a pipe that converts lemmas to lower case:
def lower_case_lemmas(doc) :
for token in doc :
token.lemma_ = token.lemma_.lower()
return doc
# Add it to the pipeline
nlp.add_pipe(lower_case_lemmas, name="lower_case_lemmas", after="tagger")
You will need to figure out where in the pipeline to add it to. The latest documentation mentions that the Lemmatizer uses POS tagging info, so I am not sure at what point it is called. Placing your pipe after tagger is safe, all the lemmas should be figured out by then.
Another option I can think of is to derive a custom lemmatizer from Lemmatizer class and override its __call__ method, but this is likely to be quite invasive as you will need to figure out how (and where) to plug in your own lemmatizer.

Can a token be removed from a spaCy document during pipeline processing?

I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in the document processing pipeline. Is there a way to remove a token from the document within a pipeline component?
spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.
While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency. One solution would be to add a custom extension attribute like is_excluded to the tokens, based on whatever objective you want to use:
from spacy.tokens import Token
def get_is_excluded(token):
# Getter function to determine the value of token._.is_excluded
return token.text in ['some', 'excluded', 'words']
Token.set_extension('is_excluded', getter=get_is_excluded)
When processing a Doc, you can now filter it to only get the tokens that are not excluded:
doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']
You can also make this more complex by using the Matcher or PhraseMatcher to find sequences of tokens in context and mark them as excluded.
Also, for completeness: If you do want to change the tokens in a Doc, you can achieve this by constructing a new Doc object with words (a list of strings) and optional spaces (a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array method with the attributes to set and a numpy array of the values (all IDs).

How to analyze nouns in a list

I would like to know if there is a way to analyze nouns in a list. For example, if there is an algorithm that discern different categories, so like if the noun is part of the category "animal", "plants", "nature" and so on.
I thought it was possible to achieve this result with Wordnet, but, if I am not wrong, all the nouns in WordNet are categorized as "entity". Here is a script of my WordNet analysis:
lemmas = ['dog', 'cat', 'garden', 'ocean', 'death', 'joy']
hypernyms = []
for i in lemmas:
dog = wn.synsets(i)[0]
temp_list = []
hypernyms_list = ([lemma.name() for synset in dog.root_hypernyms() for lemma in synset.lemmas()])
temp_list.append(hypernyms_list)
flat = list(set([item for sublist in temp_list for item in sublist]))
hypernyms.append(flat)
hypernyms
And the result is: [['entity'], ['entity'], ['entity'], ['entity'], ['entity'], ['entity']].
Can anybody suggest me some techniques to retrieve the category the names belong to, if there is anything available?
Thanks in advance.
One approach I can suggest is using Google's NLP API. This API have feature of identifying Part of Speech as part of Syntax Analysis. Please refer to documentation here -
Google's NLP API - Syntax Analysis
Another option is Stanford's NLP API. Here are reference docs - Stanford's NLP API

Overriding tokenizer of scikitlearn vectorizer with spacy

I want to implement lemmatization with Spacy package.
Here is my code :
regexp = re.compile( '(?u)\\b\\w\\w+\\b' )
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(regexp.findall(string))
def custom_tokenizer(document):
doc_spacy = en_nlp(document)
return [token.lemma_ for token in doc_spacy]
lemma_tfidfvect = TfidfVectorizer(tokenizer= custom_tokenizer,stop_words = 'english')
But this error message was occured when i run that code.
C:\Users\yu\Anaconda3\lib\runpy.py:193: DeprecationWarning: Tokenizer.from_list is now deprecated. Create a new Doc object instead and pass in the strings as the `words` keyword argument, for example:
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=[...])
"__main__", mod_spec)
How can i solve this problem ?
To customize spaCy's tokenizer, you need to pass it a list of dictionaries that specify the word that needs custom tokenization and the orths it should be split into. Here's the example code from the docs:
from spacy.attrs import ORTH, LEMMA
case = [{"don't": [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]}]
tokenizer.add_special_case(case)
If you're doing this all because you're wanting to make a custom lemmatizer, you might be better off just creating a custom lemma list directly. You'd have to modify the language data of spaCy itself, but the format is pretty simple:
"dustiest": ("dusty",),
"earlier": ("early",),
"earliest": ("early",),
"earthier": ("earthy",),
...
Those files live here for English.
I think that your code runs fine, you are just getting a DeprecationWarning, which is not really an error.
Following the advice given by the warning, I think you can modify your code substituting
en_nlp.tokenizer = lambda string: Doc(en_nlp.vocab, words = regexp.findall(string))
and that should run fine with no warnings (it does today on my machine).

Categories