Quick way to get document vector using GloVe - python

Problem
I am trying to use GloVe to represent entire document. However, GloVe is initially designed to get word embedding. One way to get the document embedding is to take the average of all word embeddings in the document.
I am following the solution posted here to load the GloVe look-up table. However, when I tried to get the document embedding, the runtime is extremely slow (about 1s per document for more than 1 million documents).
I am wondering if there is any way I could accelerate this process.
The GloVe look-up table could be downloaded here and the following is the code I use to get the document embedding. The data is stored in a pd.DataFrame(), where there is a review column.
Note there might be some words in the text_processed_list not present in the look-up table, that is why try...catch... comes into play.
import numpy as np
import pandas as pd
import string
import csv
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
remove_list = stopwords.words('english') + list(string.punctuation)
X = np.zeros((dataset_size, 300))
glove_model = pd.read_table("glove.42B.300d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
for iter in range(dataset_size):
text = data.loc[iter, "review"]
text_processed_list = [word for word in word_tokenize(text.lower()) if word not in remove_list]
for word in text_processed_list:
try:
X[iter] += glove_model.loc[word].values
except KeyError:
pass
X[iter] /= len(text_processed_list)

Related

More efficient way to preprocess large amount of text in a pandas df?

I have a series of text preprocessing steps organized in the following text_preprocessing() function below. (There are some more to it, such as converting emojis, removing punctuations etc., I dropped those for clarity.)
import spacy
nlp_model = spacy.load('en_core_web_lg')
nlp_model.add_pipe("merge_entities")
def text_preprocessing(text, lemmatizer):
text = text.lower()
text = " ".join([lemmatizer.lemmatize(w) for w in text.split()])
text = [w if not re.search(r'[^\x00-\x7F]', w) else "<FOREIGN>" for w in text.split()]
text = [w.text if (not w.ent_type_ or w.ent_type_ == 'PERSON' or w.ent_type_ == 'ORG')
else f"<{w.ent_type_}>" for w in nlp_model(" ".join(text))]
text = " ".join(text)
text = re.sub(r"<\s([A-Z]+?)", r"<\1", text)
text = re.sub(r"([A-Z]+?)\s>", r"\1>", text)
text = re.sub(r"\s(<[A-Z]+?)\s", r" \1> ", text)
text = re.sub(r"\s([A-Z]+?>)\s", r" <\1 ", text)
text = " ".join([w.upper() if ("<" in w and ">" in w) else w for w in text.split()])
return text
At the moment, I have a working solution which is as follows:
from nltk.stem import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
df['Preprocessed'] = df['Text'].apply(lambda x: text_preprocessing(x, lmtzr))
I already moved the instantiation of WordNetLemmatizer outside of text_preprocessing() and passed the instance as an argument. Now I am thinking of further optimizing this code as the database of messages to run on has increased considerably and is now nearing 30,000 rows (30,000 texts to preprocess, and the amount is growing day-by-day). Text preprocessing one-by-one takes plenty of hours already. I tried multiprocessing.Process earlier but didn't make much of an impact. I read about vectorization but I'm unsure how it could be applied to my situation. I'm also aware of external packages that apparently make it easier to set up multiprocessing for df.apply(), such as the swifter module, but I am hoping to speed things up a bit more than 2-4 times since I already have quite a lot of data and this will be even more in the future.
Example data can be created with the following function:
import pandas as pd
import numpy as np
import random
def generate_example_data(rows=100):
df = pd.DataFrame(np.random.randint(0,100,size=(rows, 4)), columns=list('ABCD'))
df['Text'] = pd.Series(["".join([random.choice("aábcčdeëfghijklmnoópqrsştuvwxyz ") for i in range(random.randint(25,400))]) for j in range(rows)])
return df
I think, your reading re vector calculation is the key here and I would go that way before considering multithreading or multiprocessing.
There are some operations that can be vectorized on your DataFrame already. You shouldnt care too much about adding columns to your frame.
for example your first operation
text = text.lower()
can be replaced with
df['some_col'].str.lower()
Furthermore, you can substitute some of your regex operations using this thread. Also, i find this a good source. In addition, try (not too sure your case is a good fit) as much as you can to make use of numpy library.
good luck

Removing all punctuation from string in dataframe

This is officially doing my head in. I am web scraping a collection of tweets for text analysis. The tweets have been scraped and put into a dataframe, where each row is a string containing the entire tweet. I can't for the life of me remove quotation marks or apostrophes, but removing all other punctuation is OK.
What I am trying to do is extract just the verbs, nouns and adjectives from each of the scraped tweets, which I have done, but anything in quotation marks is excluded.
The code that I have been using so far is below, but I can't for the life of me add quotation marks or apostrophes. I have also tried every other method I can find on this site , but it either does nothing, or produces errors.
tweets['Text_processed'] = tweets['Text'].map(lambda x: re.sub('[,\##.!?]', "", x))
The entire code base up until this point is:
import GetOldTweets3 as got
import pandas as pd
import re
from wordcloud import WordCloud# Join the different processed titles together.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter("ignore", DeprecationWarning)# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
import os
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import gensim
from gensim import corpora, models, similarities
import logging
import tempfile
from nltk.corpus import stopwords
from string import punctuation
from collections import OrderedDict
import pyLDAvis.gensim
import tempfile
%matplotlib inline
init_notebook_mode(connected=True) #do not miss this line
warnings.filterwarnings("ignore")
# Function that pulls tweets based on a general search query and turns to csv file
# Parameters: (text query you want to search), (max number of most recent tweets to pull from)
def text_query_to_csv(text_query, count):
# Creation of query object
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\
.setMaxTweets(count)
# Creation of list that contains all tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
# Creating list of chosen tweet data
text_tweets = [[tweet.date, tweet.text] for tweet in tweets]
# Creation of dataframe from tweets
tweets_df = pd.DataFrame(text_tweets, columns = ['Datetime', 'Text'])
# Converting tweets dataframe to csv file
tweets_df.to_csv('{}-{}-tweets.csv'.format(text_query, int(count)), sep=',')
############################################
# Search word and number of tweets to scrape
############################################
text_query = '#barackobama'
count = 5
# Calling function to query X amount of relevant tweets and create a CSV file
text_query_to_csv(text_query, count)
filename = '#barackobama-5-tweets.csv'
tweets = pd.read_csv(filename)
# Convert tweets to strings and lower case
tweets['Text'] = tweets['Text'].astype(str)
tweets['Text'] = tweets['Text'].map(lambda x: x.lower())
tweets
This is the offending bit of code below...
# remove punctuation
tweets['Text_processed'] = tweets['Text'].map(lambda x: re.sub('[,\##.!?]', "", x))
tweets['Text_processed'].head()
#####################################
# Extract nouns, verbs and adjectives
#####################################
import nltk
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from IPython.display import display
lemmatizer = nltk.WordNetLemmatizer()
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.label() =='NP'):
yield subtree.leaves()
def get_word_postag(word):
if pos_tag([word])[0][1].startswith('J'):
return wordnet.ADJ
if pos_tag([word])[0][1].startswith('V'):
return wordnet.VERB
if pos_tag([word])[0][1].startswith('N'):
return wordnet.NOUN
else:
return wordnet.NOUN
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
postag = get_word_postag(word)
word = lemmatizer.lemmatize(word,postag)
return word
def get_terms(tree):
for leaf in leaves(tree):
terms = [normalise(w) for w,t in leaf]
yield terms
tidied_tweets = []
for t in tweets['Text']:
#word tokenizeing and part-of-speech tagger
document = t
tokens = [nltk.word_tokenize(sent) for sent in [document]]
postag = [nltk.pos_tag(sent) for sent in tokens][0]
# Rule for NP chunk and VB Chunk
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
{<RB.?>*<VB.?>*<JJ>*<VB.?>+<VB>?} # Verbs and Verb Phrases
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
#Chunking
cp = nltk.RegexpParser(grammar)
# the result is a tree
tree = cp.parse(postag)
terms = get_terms(tree)
features = []
for term in terms:
_term = ''
for word in term:
_term += ' ' + word
features.append(_term.strip())
tidied_tweets.append(features)
tidied_tweets
The code base I have after this works OK, but the inability to remove quoted text or anything with an apostrophe is causing real problems.
EDITED TO ADD
I've managed to solve the problem, but in doing so, created another. The latest bit of code to extract the words sans the punctuation is:
tweet_list = []
ind_tweet = []
for tweets in tidied_tweets:
for words in tweets:
a = re.findall(r"[\w']+", words)
ind_tweet.append(a)
tweet_list.append(ind_tweet)
re.findall(r"[\w']+", words) does the job of extracting the word, but I can't create the same structured list I started with. What I wanted is 'tweet_list' to act as the parent list, and 'ind_tweet' to act as a sucession of child lists (nested). When I print out the result of the code above, I'm not able to create the nested list I am looking for. ind_tweet produces the output but in a single list with no nesting, and tweet_list duplicates ind_tweet. It probably isn't helping that it's 2:30am on a Saturday, but this should be much easier than I am making it...
And the answer is...
tweet_list = [[] for i in range(len(tidied_tweets))]
for tweets, t in zip(tidied_tweets, range(len(tidied_tweets))):
for words in tweets:
a = re.findall(r"[\w']+", words)
tweet_list[t].append(a)

On what basis my source is vectorizing & clustering data?

I am taking an input from a text wanted to build a semantic vocabulary, however without vocabulary I am just passing a token list of words. But I am not able to figure out on what basis vectorization & clustering is happening when vocabulary is not set? In the documentation it is mentioned that "If not given, a vocabulary is determined from the input documents." However, I am only taking one txt file for my input.
I have tried to create vocabulary out of the wordnet synonym set but not able to reach anywhere.
import string
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from nltk.corpus import wordnet
src = open('Sample.txt', 'r')
pageData = src.read().splitlines()
# preprocessing
def clean_text(text):
text = "".join([word.lower() for word in text if word not in string.punctuation])
tokenize = re.split("\W+", text) # tokenizing based on words
return text
filter_data = clean_text(pageData)
# Feature Extraction
Tfidf_vectorizer = TfidfVectorizer(tokenizer=clean_text, analyzer='char',
use_idf=True, stop_words=stopwords)
Tfidf_matrix = Tfidf_vectorizer.fit_transform(filter_data) # checking the
words in filter data to find relevance
terms = Tfidf_vectorizer.get_feature_names()
# Clustering
km = KMeans(n_clusters=5, n_jobs=-1)
labels = km.fit_transform(Tfidf_matrix)
clusters = km.labels_.tolist()
X = Tfidf_matrix.todense()
The vocabulary here is a mapping of words to coldumns.
If you don't predefine a vocabulary (which is necessary when processing multiple sources to get the same columns) it will simply be built by adding new columns when seeing new words.

How to filter out non-English data from csv using pandas

I'm currently writing a code to extract frequently used words from my csv file, and it works just fine until I get a barplot of strange words listed. I don't know why, probably because there are some foreign words involved. However, I don't know how to fix this.
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import matplotlib
from matplotlib import pyplot as plt
import sys
sys.setrecursionlimit(100000)
# import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\nlp_dataset\\commitment.csv", encoding='cp1252',na_values=" NaN")
data.shape
data['text'] = data.fillna({'text':'none'})
def remove_punctuation(text):
'' 'a function for removing punctuation'''
import string
#replacing the punctuations with no space,
#which in effect deletes the punctuation marks
translator = str.maketrans('', '', string.punctuation)
#return the text stripped of punctuation marks
return text.translate(translator)
#Apply the function to each examples
data['text'] = data['text'].apply(remove_punctuation)
data.head(10)
#Removing stopwords -- extract the stopwords
#extracting the stopwords from nltk library
sw= stopwords.words('english')
#displaying the stopwords
np.array(sw)
# function to remove stopwords
def stopwords(text):
'''a function for removing stopwords'''
#removing the stop words and lowercasing the selected words
text = [word.lower() for word in text.split() if word.lower() not in sw]
#joining the list of words with space separator
return " ". join(text)
# Apply the function to each examples
data['text'] = data ['text'].apply(stopwords)
data.head(10)
# Top words before stemming
# create a count vectorizer object
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text dta
count_vectorizer.fit(data['text'])
# collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items()
#store the vocab and counts in a pandas dataframe
vocab = []
count = []
#iterate through each vocav and count append the value to designated lists
for key, value in dictionary:
vocab.append(key)
count.append(value)
#store the count in pandas dataframe with vocab as indedx
vocab_bef_stem = pd.Series(count, index=vocab)
#sort the dataframe
vocab_bef_stem = vocab_bef_stem.sort_values(ascending = False)
# Bar plot of top words before stemming
top_vocab = vocab_bef_stem.head(20)
top_vocab.plot(kind = 'barh', figsize=(5,10), xlim = (1000, 5000))
I want a list of frequent words ordered in a bar-plot, but for now it just gives non-English words with all-same frequency. Please help me out
The problem is that you are not sorting your vocabulary with counts instead with some unique ID created by count vectorizer.
count_vectorizer.vocabulary_.items()
This doesn't contains the count of each feature. count_vectorizer don't save the count of each feature.
Hence, you are getting to see the rarest/mis-spelled words (since these gets more change of larger value - unique ID) from your corpus in the plot. The way to get the counts of the words, is by applying transform on your text data and sum the counts of each word on all documents.
By default, tf-idf removes the punctuation and also you can feed a list of stop words for the vectorizer to remove. Your code can be reduced as follows.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document ?',
]
sw= stopwords.words('english')
count_vectorizer = CountVectorizer(stop_words=sw)
X = count_vectorizer.fit_transform(corpus)
vocab = pd.Series( X.toarray().sum(axis=0), index = count_vectorizer.get_feature_names())
vocab.sort_values(ascending=False).plot.bar(figsize=(5,5), xlim = (0, 7))
Instead of corpus, plug in your text data column. The output of the above snippet will be

computing cosine-similarity between all texts in a corpus

I have a set of documents stored in a JOSN file. Along this line, I retrieve them using the following code so that they are stored under the term data:
import json
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
Integrating all texts into a single one to form the corpus is done by:
corpus = []
for i in range(len(data) -1):
corpus.append(data[i]['body'] + data[i+1]['body'])
Until now pretty straightforward manipulations. To build the tfidf I use the following lines of codes which remove stop words and punctuation, stems each term and tokenize the data.
import nltk
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
# stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## Lastly, a functionthat contains all the previous ones plus stopwords removal
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
I then try to apply this function to the corpus such:
tfidf = vectorizer.fit_transform(corpus)
print(((tfidf*tfidf.T).A)[0,1])
But nothing happens, any idea of how to proceed?
Kind regards

Categories