Punctuation, stopwords and lemmatization with spacy

Punctuation, stopwords and lemmatization with spacy - python

I'm trying to apply punctuation removal, stopwords removal and lemmatization to a list of strings
I tried to use lemma_, is_stop and is_punct
data = ['We will pray and hope for the best',
'Though it may not make landfall all week if it follows that track',
'Heavy rains, capable of producing life-threatening flash floods, are possible']
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en")
doc = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if not w.is_stop and not w.is_punct and not w.like_num] for doc in data]
I have the following error:
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'lemma_'
(same problem for is_stop and is_punct)

You iterate over the unprocessed list of strings data in the outer-loop, but you need to iterate over doc.
Further, your variables have unfavorable names, the following naming should be less confusing:
docs = list(nlp.pipe(data))
data_clean = [[w.lemma_ for w in doc if (not w.is_stop and not w.is_punct and not w.like_num)] for doc in docs]

Related

Tokenize text - Very slow when doing it

Question
I have a data frame with +90,000 rows and with a column ['text'] that contains the text of some news.
The length of the text has an average of 3.000 words and when I pass the word_tokenize it makes it very slow, Which could be a more efficent method to do it?
from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize)
df.head()
Also word_tokenize hasn't some punctuations and other characters that I don't want, so I created a function to filter them where I'm using spacy.
from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','\xa0']
def tokenize(phrase):
sentence_tokens = []
tokenized_phrase = nlp(phrase)
for token in tokenized_phrase:
if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
sentence_tokens.append(token.text.lower())
return sentence_tokens
Any other better method to do it?
Thanks for reading my maybe noob👨🏽‍💻 question😀, have a nice day🌻.
Appreciations
nlp is defined before
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()
I'm using spacy to tokenize but also using the nltk stop_words for spanish language.

If you are only tokenizing, use a blank model (which only contains a tokenizer) instead of es_core_news_sm:
nlp = spacy.blank("es")

In order to make spacy faster when you only wish to tokenize.
you can change:
nlp = es_core_news_sm.load()
To:
nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])
A small explanation:
Spacy gives a full language model which not merely tokenize your sentence but also do parsing, and pos and ner tagging. when actually most of the calculation time is being done for the other tasks (parse tree, pos, ner) and not the tokenization which is actually much 'lighter' task, computation wise.
But, as you can see spacy allow you to use only what you actually need and by that save you some time.
Another thing, you can make your function more efferent by lowering token only once and add the stop word to spacy (even if you didn't want to do so, the fact that otherCharacters is a list and not a set is not very efficient ).
I would also add this:
for w in stopwords.words('spanish'):
nlp.vocab[w].is_stop = True
for w in otherCharacters:
nlp.vocab[w].is_stop = True
for w in STOP_WORDS:
nlp.vocab[w].is_stop = True
and than:
for token in tokenized_phrase:
if not token.is_punct and not token.is_stop:
sentence_tokens.append(token.text.lower())

create variable name based on text in the sentence

I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?

This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.

Correct way of using Phrases and preprocess_string gensim

What is the correct way to use gensim's Phrases and preprocess_string together ?, i am doing this way but it a little contrived.
from gensim.models.phrases import Phrases
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags
from gensim.parsing.preprocessing import strip_short
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim.parsing.preprocessing import stem_text
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import strip_numeric
import re
from gensim import utils
# removed "_" from regular expression
punctuation = r"""!"#$%&'()*+,-./:;<=>?#[\]^`{|}~"""
RE_PUNCT = re.compile(r'([%s])+' % re.escape(punctuation), re.UNICODE)
def strip_punctuation(s):
"""Replace punctuation characters with spaces in `s` using :const:`~gensim.parsing.preprocessing.RE_PUNCT`.
Parameters
----------
s : str
Returns
-------
str
Unicode string without punctuation characters.
Examples
--------
>>> from gensim.parsing.preprocessing import strip_punctuation
>>> strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
u'A semicolon is a stronger break than a comma but not as much as a full stop '
"""
s = utils.to_unicode(s)
return RE_PUNCT.sub(" ", s)
my_filter = [
lambda x: x.lower(), strip_tags, strip_punctuation,
strip_multiple_whitespaces, strip_numeric,
remove_stopwords, strip_short, stem_text
]
documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2)
sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
test = " ".join(bigram[sent])
print(preprocess_string(test))
print(preprocess_string(test, filters=my_filter))
The result is:
['mayor', 'new', 'york']
['mayor', 'new_york'] #correct
part of the code was taken from: How to extract phrases from corpus using gensim

I would recommend using gensim.utils.tokenize() instead of gensim.parsing.preprocessing.preprocess_string() for your example.
In many cases tokenize() does a very good job as it will only return sequences of alphabetic characters (no digits). This saves you the extra cleaning steps for punctuation etc.
However, tokenize() does not include removal of stopwords, short tokens nor stemming. This has to be cutomized anyway if you are dealing with other languages than English.
Here is some code for your (already clean) example documents which gives you the desired bigrams.
documents = ["the mayor of new york was there",
"machine learning can be useful sometimes",
"new york mayor was present"]
import gensim, pprint
# tokenize documents with gensim's tokenize() function
tokens = [list(gensim.utils.tokenize(doc, lower=True)) for doc in documents]
# build bigram model
bigram_mdl = gensim.models.phrases.Phrases(tokens, min_count=1, threshold=2)
# do more pre-processing on tokens (remove stopwords, stemming etc.)
# NOTE: this can be done better
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, stem_text
CUSTOM_FILTERS = [remove_stopwords, stem_text]
tokens = [preprocess_string(" ".join(doc), CUSTOM_FILTERS) for doc in tokens]
# apply bigram model on tokens
bigrams = bigram_mdl[tokens]
pprint.pprint(list(bigrams))
Output:
[['mayor', 'new_york'],
['machin', 'learn', 'us'],
['new_york', 'mayor', 'present']]

How to filter tokens from spaCy document

I would like to parse a document using spaCy and apply a token filter so that the final spaCy document does not include the filtered tokens. I know that I can take the sequence of tokens filtered, but I am insterested in having the actual Doc structure.
text = u"This document is only an example. " \
"I would like to create a custom pipeline that will remove specific tokesn from the final document."
doc = nlp(text)
def keep_token(tok):
# This is only an example rule
return tok.pos_ not not in {'PUNCT', 'NUM', 'SYM'}
final_tokens = list(filter(keep_token, doc))
# How to get a spacy.Doc from final_tokens?
I tried to reconstruct a new spaCy Doc from the tokens lists but the API is not clear how to do it.

I am pretty sure that you have found your solution till now but because it is not posted here I thought it may be useful to add it.
You can remove tokens by converting doc to numpy array, removing from numpy array and then converting back to doc.
Code:
import spacy
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
import numpy
def remove_tokens_on_match(doc):
indexes = []
for index, token in enumerate(doc):
if (token.pos_ in ('PUNCT', 'NUM', 'SYM')):
indexes.append(index)
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array = numpy.delete(np_array, indexes, axis = 0)
doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in indexes])
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
return doc2
# load english model
nlp = spacy.load('en')
doc = nlp(u'This document is only an example. \
I would like to create a custom pipeline that will remove specific tokens from \
the final document.')
print(remove_tokens_on_match(doc))
You can look to a similar question that I answered here.

Depending on what you want to do there are several approaches.
1. Get the original Document
Tokens in SpaCy have references to their document, so you can do this:
original_doc = final_tokens[0].doc
This way you can still get PoS, parse data etc. from the original sentence.
2. Construct a new document without the removed tokens
You can append the strings of all the tokens with whitespace and create a new document. See the token docs for information on text_with_ws.
doc = nlp(''.join(map(lambda x: x.text_with_ws, final_tokens)))
This is probably not going to give you what you want though - PoS tags will not necessarily be the same, and the resulting sentence may not make sense.
If neither of those was what you had in mind, let me know and maybe I can help.

How to remove stop words from documents in gensim?

I'm building a NLP chat application using Doc2Vec technique in Python using its gensim package. I have already done tokenizing and stemming. I want to remove the stop words (to test if it works better) from both the training set as well as the question which user throws.
Here is my code.
import gensim
import nltk
from gensim import models
from gensim import utils
from gensim import corpora
from nltk.stem import PorterStemmer
ps = PorterStemmer()
sentence0 = models.doc2vec.LabeledSentence(words=[u'sampl',u'what',u'is'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'sampl',u'tell',u'me',u'about'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'elig',u'what',u'is',u'my'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'limit', u'what',u'is',u'my'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'claim',u'how',u'much',u'can',u'I'],tags=["SENT_4"])
sentence5 = models.doc2vec.LabeledSentence(words=[u'retir',u'i',u'am',u'how',u'much',u'can',u'elig',u'claim'],tags=["SENT_5"])
sentence6 = models.doc2vec.LabeledSentence(words=[u'resign',u'i',u'have',u'how',u'much',u'can',u'i',u'claim',u'elig'],tags=["SENT_6"])
sentence7 = models.doc2vec.LabeledSentence(words=[u'promot',u'what',u'is',u'my',u'elig',u'post',u'my'],tags=["SENT_7"])
sentence8 = models.doc2vec.LabeledSentence(words=[u'claim',u'can,',u'i',u'for'],tags=["SENT_8"])
sentence9 = models.doc2vec.LabeledSentence(words=[u'product',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_9"])
sentence10 = models.doc2vec.LabeledSentence(words=[u'hotel',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_10"])
sentence11 = models.doc2vec.LabeledSentence(words=[u'onlin',u'product',u'can',u'i',u'for',u'bought',u'through',u'claim',u'sampl'],tags=["SENT_11"])
sentence12 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'guidelin',u'where',u'do',u'i',u'apply',u'form',u'sampl'],tags=["SENT_12"])
sentence13 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'procedur',u'rule',u'and',u'regul',u'what',u'is',u'the',u'for'],tags=["SENT_13"])
sentence14 = models.doc2vec.LabeledSentence(words=[u'can',u'i',u'submit',u'expenditur',u'on',u'behalf',u'of',u'my',u'friend',u'and',u'famili',u'claim',u'and',u'reimburs'],tags=["SENT_14"])
sentence15 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'procedur',u'can',u'i',u'submit',u'from',u'shopper stop',u'claim'],tags=["SENT_15"])
sentence16 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'can',u'i',u'submit',u'from',u'pantaloon',u'claim'],tags=["SENT_16"])
sentence17 = models.doc2vec.LabeledSentence(words=[u'invoic',u'procedur',u'can',u'i',u'submit',u'invoic',u'from',u'spencer',u'claim'],tags=["SENT_17"])
# User asks a question.
document = input("Ask a question:")
tokenized_document = list(gensim.utils.tokenize(document, lowercase = True, deacc = True))
#print(type(tokenized_document))
stemmed_document = []
for w in tokenized_document:
stemmed_document.append(ps.stem(w))
sentence19 = models.doc2vec.LabeledSentence(words= stemmed_document, tags=["SENT_19"])
# Building vocab.
sentences = [sentence0,sentence1,sentence2,sentence3, sentence4, sentence5,sentence6, sentence7, sentence8, sentence9, sentence10, sentence11, sentence12, sentence13, sentence14, sentence15, sentence16, sentence17, sentence19]
#I tried to remove the stop words but it didn't work out as LabeledSentence object has no attribute lower.
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in sentences]
..
Is there a way I can remove stop words from sentences directly and get a new set of vocab without stop words ?

Your sentences object is a already a list of LabeledSentence objects. You construct these above; they include a list-of-strings in words and a list-of-strings in tags.
So each item in that list (document in your list-comprehension) can't have a string method like .lower() applied to it. (Nor would it need to be .split(), as its words are already separate tokens.)
The cleanest approach would be to remove stop-words from the lists-of-words before they're used to construct LabeledSentence objects. For example, you could make a function without_stopwords(), defined at the top. Then your lines creating LabeledSentence objects could instead be like:
sentence0 = LabeledSentence(words=remove_stopwords([u'sampl', u'what', u'is']),
tags=["SENT_0"])
Alternatively, you could mutate the existing LabeledSentence objects so that each of their words attributes now lack stop-words. This would replace your last line with something more like:
for doc in sentences:
doc.words = [word for word in doc.words if word not in stoplist]
texts = sentences
Separately, things you didn't ask but should know:
TaggedDocument is now the preferred example-class name for Doc2Vec text objects – but in fact any object that has the two required properties words and tags will work fine.
Doc2Vec doesn't show many of the desired properties on tiny, toy-sized datasets – don't be surprised if a model built on dozens of sentences does not do anything useful, or misleads about what preprocessing/meta-parameter options are best. (Tens of thousands of texts, and texts at least tens-of-words long, are much better for meaningful results.)
Much Word2Vec/Doc2Vec work doesn't bother with stemming or stop-word removal, but it may sometimes be helpful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Punctuation, stopwords and lemmatization with spacy - python

Related

Tokenize text - Very slow when doing it

create variable name based on text in the sentence

Correct way of using Phrases and preprocess_string gensim

How to filter tokens from spaCy document

How to remove stop words from documents in gensim?

Categories

Resources