On what basis my source is vectorizing & clustering data? - python

I am taking an input from a text wanted to build a semantic vocabulary, however without vocabulary I am just passing a token list of words. But I am not able to figure out on what basis vectorization & clustering is happening when vocabulary is not set? In the documentation it is mentioned that "If not given, a vocabulary is determined from the input documents." However, I am only taking one txt file for my input.
I have tried to create vocabulary out of the wordnet synonym set but not able to reach anywhere.
import string
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from nltk.corpus import wordnet
src = open('Sample.txt', 'r')
pageData = src.read().splitlines()
# preprocessing
def clean_text(text):
text = "".join([word.lower() for word in text if word not in string.punctuation])
tokenize = re.split("\W+", text) # tokenizing based on words
return text
filter_data = clean_text(pageData)
# Feature Extraction
Tfidf_vectorizer = TfidfVectorizer(tokenizer=clean_text, analyzer='char',
use_idf=True, stop_words=stopwords)
Tfidf_matrix = Tfidf_vectorizer.fit_transform(filter_data) # checking the
words in filter data to find relevance
terms = Tfidf_vectorizer.get_feature_names()
# Clustering
km = KMeans(n_clusters=5, n_jobs=-1)
labels = km.fit_transform(Tfidf_matrix)
clusters = km.labels_.tolist()
X = Tfidf_matrix.todense()

The vocabulary here is a mapping of words to coldumns.
If you don't predefine a vocabulary (which is necessary when processing multiple sources to get the same columns) it will simply be built by adding new columns when seeing new words.

Related

Multilabel text classification/clustering in python with tf-idf

I have five plain text documents in a directory that are already clustered based on their content and named as such cluster1.txt, cluster2.txt and so on, so they are functioning as my corpus. Otherwise they don't have any labels, they are just named as such.
My task is to cluster a new text document with new sentences, but not the whole document as a whole, instead I should cluster each sentence into one of these 5 cluster or classes and also do a confusion matrix with the recall and precision score to show how similar the sentences are to the clusters.
I first tried to do it with a kNN and then a kmeans, but I think my logic is flawed since this is not a clustering problem, it's a classification problem, right?
Well at least I tried to preprocess the text (removing stop words, lemmatize, lowercasing, tokenizing) and then I calculated the termfrequency with a countvectorizer and then the tf-idf
I kinda have problems with the logic with this problem.
Anyway, this is what I tried so far, but now I'm kinda stuck, any help is appreciated
import glob
import os
file_list = glob.glob(os.path.join(os.getcwd(), 'C:/Users/ds191033/FH/Praktikum/Testdaten/Clusters', "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="utf8") as f_input:
corpus.append(f_input.read())
stopwords = nltk.corpus.stopwords.words('german')
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('german')
lem = WordNetLemmatizer()
def normalize_document(doc):
# lower case and remove special characters\whitespaces
doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
doc = doc.lower()
tokens = wpt.tokenize(doc)
# filter stopwords out of document
filtered_tokens = [token for token in tokens if token not in stop_words]
#lemmatize
for w in filtered_tokens:
lemmatized_tokens = [lem.lemmatize(t) for t in filtered_tokens]
# re-create document from filtered tokens
doc = ' '.join(lemmatized_tokens)
return doc
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(corpus)
norm_corpus
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix
# get all unique words in the corpus
vocab = cv.get_feature_names()
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Quick way to get document vector using GloVe

Problem
I am trying to use GloVe to represent entire document. However, GloVe is initially designed to get word embedding. One way to get the document embedding is to take the average of all word embeddings in the document.
I am following the solution posted here to load the GloVe look-up table. However, when I tried to get the document embedding, the runtime is extremely slow (about 1s per document for more than 1 million documents).
I am wondering if there is any way I could accelerate this process.
The GloVe look-up table could be downloaded here and the following is the code I use to get the document embedding. The data is stored in a pd.DataFrame(), where there is a review column.
Note there might be some words in the text_processed_list not present in the look-up table, that is why try...catch... comes into play.
import numpy as np
import pandas as pd
import string
import csv
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
remove_list = stopwords.words('english') + list(string.punctuation)
X = np.zeros((dataset_size, 300))
glove_model = pd.read_table("glove.42B.300d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
for iter in range(dataset_size):
text = data.loc[iter, "review"]
text_processed_list = [word for word in word_tokenize(text.lower()) if word not in remove_list]
for word in text_processed_list:
try:
X[iter] += glove_model.loc[word].values
except KeyError:
pass
X[iter] /= len(text_processed_list)

sklearn TfidfVectorizer : How to make few words to only be part of bi gram in the features

I want the featurization of TfidfVectorizer to consider some predefined words such as "script", "rule", only to be used in bigrams.
If I have text "Script include is a script that has rule which has a business rule"
for the above text if I use
tfidf = TfidfVectorizer(ngram_range=(1,2),stop_words='english')
I should get
['script include','business rule','include','business']
from sklearn.feature_extraction import text
# Given a vocabulary returns a filtered vocab which
# contain only tokens in include_list and which are
# not stop words
def filter_vocab(full_vocab, include_list):
b_list = list()
for x in full_vocab:
add = False
for t in x.split():
if t in text.ENGLISH_STOP_WORDS:
add = False
break
if t in include_list:
add = True
if add:
b_list.append(x)
return b_list
# Get all the ngrams (one can also use nltk.util.ngram)
ngrams = TfidfVectorizer(ngram_range=(1,2), norm=None, smooth_idf=False, use_idf=False)
X = ngrams.fit_transform(["Script include is a script that has rule which has a business rule"])
full_vocab = ngrams.get_feature_names()
# filter the full ngram based vocab
filtered_v = filter_vocab(full_vocab,["include", "business"])
# Get tfidf using the new filtere vocab
vectorizer = TfidfVectorizer(ngram_range=(1,2), vocabulary=filtered_v)
X = vectorizer.fit_transform(["Script include is a script that has rule which has a business rule"])
v = vectorizer.get_feature_names()
print (v)
Code is commented to explain what it is doing
Basically you are looking for customizing the n_grams creation based upon your special words (I call it as interested_words in the function). I have customized the default n_grams creation function for your purpose.
def custom_word_ngrams(tokens, stop_words=None, interested_words=None):
"""Turn tokens into a sequence of n-grams after stop words filtering"""
original_tokens = tokens
stop_wrds_inds = np.where(np.isin(tokens,stop_words))[0]
intersted_wrds_inds = np.where(np.isin(tokens,interested_words))[0]
tokens = [w for w in tokens if w not in stop_words+interested_words]
n_original_tokens = len(original_tokens)
# bind method outside of loop to reduce overhead
tokens_append = tokens.append
space_join = " ".join
for i in xrange(n_original_tokens - 1):
if not any(np.isin(stop_wrds_inds, [i,i+1])):
tokens_append(space_join(original_tokens[i: i + 2]))
return tokens
Now, we can plugin this function inside the usual analyzer of TfidfVectorizer, as following!
import numpy as np
from sklearn.externals.six.moves import xrange
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.feature_extraction import text
def analyzer():
base_vect = CountVectorizer()
stop_words = list(text.ENGLISH_STOP_WORDS)
preprocess = base_vect.build_preprocessor()
tokenize = base_vect.build_tokenizer()
return lambda doc: custom_word_ngrams(
tokenize(preprocess(base_vect.decode(doc))), stop_words, ['script', 'rule'])
#feed your special words list here
vectorizer = TfidfVectorizer(analyzer=analyzer())
vectorizer.fit(["Script include is a script that has rule which has a business rule"])
vectorizer.get_feature_names()
['business', 'business rule', 'include', 'script include']
TfidfVectorizer allows you to provide your own tokenizer, you can do something like below. But you will lose other words information in vocabulary.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["Script include is a script that has rule which has a business rule"]
vectorizer = TfidfVectorizer(ngram_range=(1,2),tokenizer=lambda corpus: [ "script", "rule"],stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

How to filter out non-English data from csv using pandas

I'm currently writing a code to extract frequently used words from my csv file, and it works just fine until I get a barplot of strange words listed. I don't know why, probably because there are some foreign words involved. However, I don't know how to fix this.
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import matplotlib
from matplotlib import pyplot as plt
import sys
sys.setrecursionlimit(100000)
# import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\nlp_dataset\\commitment.csv", encoding='cp1252',na_values=" NaN")
data.shape
data['text'] = data.fillna({'text':'none'})
def remove_punctuation(text):
'' 'a function for removing punctuation'''
import string
#replacing the punctuations with no space,
#which in effect deletes the punctuation marks
translator = str.maketrans('', '', string.punctuation)
#return the text stripped of punctuation marks
return text.translate(translator)
#Apply the function to each examples
data['text'] = data['text'].apply(remove_punctuation)
data.head(10)
#Removing stopwords -- extract the stopwords
#extracting the stopwords from nltk library
sw= stopwords.words('english')
#displaying the stopwords
np.array(sw)
# function to remove stopwords
def stopwords(text):
'''a function for removing stopwords'''
#removing the stop words and lowercasing the selected words
text = [word.lower() for word in text.split() if word.lower() not in sw]
#joining the list of words with space separator
return " ". join(text)
# Apply the function to each examples
data['text'] = data ['text'].apply(stopwords)
data.head(10)
# Top words before stemming
# create a count vectorizer object
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text dta
count_vectorizer.fit(data['text'])
# collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items()
#store the vocab and counts in a pandas dataframe
vocab = []
count = []
#iterate through each vocav and count append the value to designated lists
for key, value in dictionary:
vocab.append(key)
count.append(value)
#store the count in pandas dataframe with vocab as indedx
vocab_bef_stem = pd.Series(count, index=vocab)
#sort the dataframe
vocab_bef_stem = vocab_bef_stem.sort_values(ascending = False)
# Bar plot of top words before stemming
top_vocab = vocab_bef_stem.head(20)
top_vocab.plot(kind = 'barh', figsize=(5,10), xlim = (1000, 5000))
I want a list of frequent words ordered in a bar-plot, but for now it just gives non-English words with all-same frequency. Please help me out
The problem is that you are not sorting your vocabulary with counts instead with some unique ID created by count vectorizer.
count_vectorizer.vocabulary_.items()
This doesn't contains the count of each feature. count_vectorizer don't save the count of each feature.
Hence, you are getting to see the rarest/mis-spelled words (since these gets more change of larger value - unique ID) from your corpus in the plot. The way to get the counts of the words, is by applying transform on your text data and sum the counts of each word on all documents.
By default, tf-idf removes the punctuation and also you can feed a list of stop words for the vectorizer to remove. Your code can be reduced as follows.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document ?',
]
sw= stopwords.words('english')
count_vectorizer = CountVectorizer(stop_words=sw)
X = count_vectorizer.fit_transform(corpus)
vocab = pd.Series( X.toarray().sum(axis=0), index = count_vectorizer.get_feature_names())
vocab.sort_values(ascending=False).plot.bar(figsize=(5,5), xlim = (0, 7))
Instead of corpus, plug in your text data column. The output of the above snippet will be

computing cosine-similarity between all texts in a corpus

I have a set of documents stored in a JOSN file. Along this line, I retrieve them using the following code so that they are stored under the term data:
import json
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
Integrating all texts into a single one to form the corpus is done by:
corpus = []
for i in range(len(data) -1):
corpus.append(data[i]['body'] + data[i+1]['body'])
Until now pretty straightforward manipulations. To build the tfidf I use the following lines of codes which remove stop words and punctuation, stems each term and tokenize the data.
import nltk
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
# stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## Lastly, a functionthat contains all the previous ones plus stopwords removal
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
I then try to apply this function to the corpus such:
tfidf = vectorizer.fit_transform(corpus)
print(((tfidf*tfidf.T).A)[0,1])
But nothing happens, any idea of how to proceed?
Kind regards

Categories