TF-IDF Weighting after NLTK pre-processing - python

I am doing some textual preprocessing prior to machine learning. I have two features (Panda series) - abstract and title - and use the following function to preprocess the data (giving a numpy array, where each row contains the features for one training example):
def preprocessText(data):
stemmer = nltk.stem.porter.PorterStemmer()
preprocessed = []
for each in data:
tokens = nltk.word_tokenize(each.lower().translate(xlate))
filtered = [word for word in tokens if word not in stopwords]
preprocessed.append([stemmer.stem(item) for item in filtered])
print(Counter(sum([list(x) for x in preprocessed], [])))
return np.array(preprocessed)
I now need to use TF-IDF to weight the features - how can I do this?

From what I see, you have list of filtered words in preprocessed variable. One way to do TF-IDF transformation is to use scikit-learn, TfidfVectorizer. However, the class tokenizes the space for you i.e. you can provide list of processed documents each contain string. So you have to edit your code to:
preprocessed.append(' '.join([stemmer.stem(item) for item in filtered]))
Then you can transform list of documents as follows
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_model = TfidfVectorizer() # specify parameters here
X_tfidf = tfidf_model.fit_transform(preprocessed)
The output will be matrix in sparse compressed sparse row (CSR) format where you can transform to numpy array later on.
tfidf_model.vocabulary_ will contain dictionary map of stemming words to id.

Related

Get Bert Embeddings for every Token in a Sentence

I have a dataframe in python in which i have a column of textual data. I need to run a loop where i would take each row in that textual column and get the bert embedding for every token in that particular row. I then need to append those vector embeddings and try it out for some purpose.
e.g " My name is Obama"
get 768 vector embedding for 'My'
get 768 vector embedding for 'name'
get 768 vector embedding for 'is'
get 768 vector embedding for 'Obama'
final output: vector embedding of size 768*4 = 3072
assume every row has exact number of words present
I believe that you are trying to bring contextual based embedding for individual words of a sentence into picture, instead of fixed vectors like that of GloVe.
Your approach should be.
Tokenize your paragraphs into individual sentences ( look at some sentence tokenizers or SBD (sentence boundary detection) methods if applicable)
Now for each of the Sentence which constitute a paragraph, get the embedding for words.
Average that across so that you get vectors of consistent shapes across multiple paragraps (in your case dataframe cell - which is essentially paragraphs)
pip install sentence-transformers
once installed;
model = SentenceTransformer('paraphrase-distilroberta-base-v1')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)
#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
look at the embedding vector and aggregation techniques around embeddings.

How to apply tf-idf to rows of text

I have rows of blurbs (in text format) and I want to use tf-idf to define the weight of each word. Below is the code:
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
df["punc_blurb"] = df["blurb"].apply(remove_punctuations)
df = pd.DataFrame(df["punc_blurb"])
vectoriser = TfidfVectorizer()
df["blurb_Vect"] = list(vectoriser.fit_transform(df["punc_blurb"]).toarray())
df_vectoriser = pd.DataFrame(x.toarray(),
columns = vectoriser.get_feature_names())
print(df_vectoriser)
All I get is a massive list of numbers, which I am not even sure anymore if its the TF or TF-IDF that it is giving me as the frequent words (the, and, etc) all have a score of more than 0.
The goal is to see the weights in the tf-idf column shown below and I am unsure if I am doing this in the most efficient way:
Goal Output table
You don't need punctuation remover if you use TfidfVectorizer. It will take care of punctuation automatically, by virtue of default token_pattern param:
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame({"blurb":["this is a sentence", "this is, well, another one"]})
vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w+\\b')
df["tf_idf"] = list(vectorizer.fit_transform(df["blurb"].values.astype("U")).toarray())
vocab = sorted(vectorizer.vocabulary_.keys())
df["tf_idf_dic"] = df["tf_idf"].apply(lambda x: {k:v for k,v in dict(zip(vocab,x)).items() if v!=0})

calculate TF-IDF in pyton 2.7 (with three line of code). Does this code work?

I'm trying to calculate the tfidf value on a corpus of about 7000 documents.
Searching on internet, I founded a lot of examples (many of them locked when I tried to create the uniquewords matrix for each document). The only to seems to works is this code below
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
tfidf = TfidfVectorizer()
x = tfidf.fit_transform(corpus)
df_tfidf = pd.DataFrame(x.toarray(), columns=tfidf.get_feature_names())
print(df_tfidf)
Assuming the following corpus
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
It produced this output:
This code works also with my case and in fact it produce a matrix with 7180 rows and 10390 columns. But I'm not sure if it's correct. In your opinions, is this a valid solution for calculate tfidf for a set of documents ?
p.s: can I insert the link of guide that I followed ?
Yes, this is the correct approach for calculating the tf-idf matrix.
You are using
x = tfidf.fit_transform(corpus)
which first fits your TfidfVectorizer to your corpus and then transforms the corpus accordingly, so that you get your tf-idf matrix as the x

Is there a way to get only the IDF values of words using scikit or any other python package?

I have a text column in my dataset and using that column I want to have a IDF calculated for all the words that are present. TFID implementations in scikit, like tfidf vectorize, are giving me TFIDF values directly as against just word IDFs. Is there a way to get word IDFs give a set of documents?
You can just use TfidfVectorizer with use_idf=True (default value) and then extract with idf_.
from sklearn.feature_extraction.text import TfidfVectorizer
my_data = ["hello how are you", "hello who are you", "i am not you"]
tf = TfidfVectorizer(use_idf=True)
tf.fit_transform(my_data)
idf = tf.idf_
[BONUS] if you want to get the idf value for a particular word:
# If you want to get the idf value for a particular word, here "hello"
tf.idf_[tf.vocabulary_["hello"]]

How to give more weight to Proper Nouns in scikit TfidfVectorizer

I am using sci-kit's TdidfVectorizer to extract keywords from a list of scientific articles. There is an argument for stop_words, but I was wondering if I could give more weight/score to proper nouns such as "Bohr" or "Japan".
Will I have to implement my own custom tfidf vectorizer or can I still use this built in one?
tf = TfidfVectorizer(strip_accents='ascii',
analyzer='word',
ngram_range=(1,1),
min_df = 0,
stop_words = stopwords,
lowercase = True)
You can make your own postrpocessing to the TfIdf matrix for it.
At first you need to look through all the words indexes to find indexes for all the Proper Nouns, after that look through the matrix and increase weight for those indexes.

Categories