Fast text processing in python on dataframe - python

I am working on an e-commerce data in python. I have loaded that data in python and converted it into a pandas data frame. Now I want to perform text processing on that data like removing unwanted characters, stopwords, stemming etc. currently the code that I have applied is working fine but it takes a lot of time. I have around 2 million rows of data to process and it takes forever to process it. I tried that code on 10,000 rows and it took around 240 seconds. I am working on this kind of project for the first time. Any help to reduce time would be very helpful.
Thanks in advance.
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
def textprocessing(text):
stemmer = PorterStemmer()
# Remove unwanted characters
re_sp= re.sub(r'\s*(?:([^a-zA-Z0-9._\s "])|\b(?:[a-z])\b)'," ",text.lower())
# Remove single characters
no_char = ' '.join( [w for w in re_sp.split() if len(w)>1]).strip()
# Removing Stopwords
filtered_sp = [w for w in no_char.split(" ") if not w in stopwords.words('english')]
# Perform Stemming
stemmed_sp = [stemmer.stem(item) for item in filtered_sp]
# Converting it to string
stemmed_sp = ' '.join([x for x in stemmed_sp])
return stemmed_sp
I am calling this method on that dataframe:
files['description'] = files.loc[:,'description'].apply(lambda x: textprocessing(str(x)))
You can take any data as per your convenience. Due to some policy, I am not able to share the data.

you could try to finish it in one loop and not create stemmer/stop_word every loop
STEMMER = PorterStemmer()
STOP_WORD = stopwords.words('english')
def textprocessing(text):
return ''.join(STEMMER.stem(item) for token in re.sub(r'\s*(?:([^a-zA-Z0-9._\s "])|\b(?:[a-z])\b)'," ",text.lower()).split() if token not in STOP_WORD and len(token) > 1)
you could also use nltk to remove unwant word
from nltk.tokenize import RegexpTokenizer
STEMMER = PorterStemmer()
STOP_WORD = stopwords.words('english')
TOKENIZER = RegexpTokenizer(r'\w+')
def textprocessing(text):
return ''.join(STEMMER.stem(item) for token in TOKENIZER.tokenize(test.lower()) if token not in STOP_WORD and len(token) > 1)

Related

Text Clustering and Visualize Similarity in Python

I'm analyzing a book of 1167 pages (txt file). So far I've done preprocessing of the data (cleaning, removing punctuation, stop word removing, tokenization).
Now How can I cluster the text and visualize the similarity (plot the cluster)
For example -
text1 = "there is one unshakable conviction that people—whatever the degree of development of their understanding and whatever the form taken by the factors present in their individuality for engendering all kinds of ideal"
text2 = "That is why I now also, in setting forth on this venture quite new for me, namely authorship, begin by pronouncing this invocation"
In my task, I divided whole book into chapters. tex1 is chapter one text2 is chapter 2 and so on. Now I wanna compare chapter 1 and 2 like this.
Example code that I used for data pre processing:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(pages[1])
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]

Tokenize text - Very slow when doing it

Question
I have a data frame with +90,000 rows and with a column ['text'] that contains the text of some news.
The length of the text has an average of 3.000 words and when I pass the word_tokenize it makes it very slow, Which could be a more efficent method to do it?
from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize)
df.head()
Also word_tokenize hasn't some punctuations and other characters that I don't want, so I created a function to filter them where I'm using spacy.
from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','\xa0']
def tokenize(phrase):
sentence_tokens = []
tokenized_phrase = nlp(phrase)
for token in tokenized_phrase:
if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
sentence_tokens.append(token.text.lower())
return sentence_tokens
Any other better method to do it?
Thanks for reading my maybe noob👨🏽‍💻 question😀, have a nice day🌻.
Appreciations
nlp is defined before
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()
I'm using spacy to tokenize but also using the nltk stop_words for spanish language.
If you are only tokenizing, use a blank model (which only contains a tokenizer) instead of es_core_news_sm:
nlp = spacy.blank("es")
In order to make spacy faster when you only wish to tokenize.
you can change:
nlp = es_core_news_sm.load()
To:
nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])
A small explanation:
Spacy gives a full language model which not merely tokenize your sentence but also do parsing, and pos and ner tagging. when actually most of the calculation time is being done for the other tasks (parse tree, pos, ner) and not the tokenization which is actually much 'lighter' task, computation wise.
But, as you can see spacy allow you to use only what you actually need and by that save you some time.
Another thing, you can make your function more efferent by lowering token only once and add the stop word to spacy (even if you didn't want to do so, the fact that otherCharacters is a list and not a set is not very efficient ).
I would also add this:
for w in stopwords.words('spanish'):
nlp.vocab[w].is_stop = True
for w in otherCharacters:
nlp.vocab[w].is_stop = True
for w in STOP_WORDS:
nlp.vocab[w].is_stop = True
and than:
for token in tokenized_phrase:
if not token.is_punct and not token.is_stop:
sentence_tokens.append(token.text.lower())

create variable name based on text in the sentence

I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.

How to add more stopwords in nltk list?

I have the following code. I have to add more words in nltk stopword list. After i run thsi, it does not add the words in the list
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
new_words = open("stopwords_en.txt", "r")
new_stopwords = stop.union(new_word)
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in new_stopwords])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc).split() for doc in emails_body_text]
Don't do things blindly. Read in your new list of stopwords, inspect it to see that it's right, then add it to the other stopword list. Start with the code suggested by #greg_data, but you'll need to strip newlines and maybe do other things -- who knows what your stopwords file looks like?
This might do it, for example:
new_words = open("stopwords_en.txt", "r").read().split()
new_stopwords = stop.union(new_words)
PS. Don't keep splitting and joining your document; tokenize once and work with the list of tokens.

Stemming words with NLTK (python)

I am new to Python text processing, I am trying to stem word in text document, has around 5000 rows.
I have written below script
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
def Description_to_words(raw_Description ):
# 1. Remove HTML
Description_text = BeautifulSoup(raw_Description).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 5. stem words
words = ([stemmer.stem(w) for w in words])
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
clean_Description = Description_to_words(train["Description"][15])
But when I test results words were not stemmed , can anyone help me to know what is issue , I am doing something wrong in "Description_to_words" function
And, when I execute stem command separately like below it works.
from nltk.tokenize import sent_tokenize, word_tokenize
>>> words = word_tokenize("MOBILE APP - Unable to add reading")
>>>
>>> for w in words:
... print(stemmer.stem(w))
...
mobil
app
-
unabl
to
add
read
Here's each step of your function, fixed.
Remove HTML.
Description_text = BeautifulSoup(raw_Description).get_text()
Remove non-letters, but don't remove whitespaces just yet. You can also simplify your regex a bit.
letters_only = re.sub("[^\w\s]", " ", Description_text)
Convert to lower case, split into individual words: I recommend using word_tokenize again, here.
from nltk.tokenize import word_tokenize
words = word_tokenize(letters_only.lower())
Remove stop words.
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
Stem words. Here is another issue. Stem meaningful_words, not words.
return ' '.join(stemmer.stem(w) for w in meaningful_words])

Categories