How to calculate per document probabilities under respective topics with BERTopics? - python

I am trying to use BERTopic to analyze the topic distribution of documents, after BERTopic is performed, I would like to calculate the probabilities under respective topics per document, how should I did it?
# define model
model = BERTopic(verbose=True,
vectorizer_model=vectorizer_model,
embedding_model='paraphrase-MiniLM-L3-v2',
min_topic_size= 50,
nr_topics=10)
# train model
headline_topics, _ = model.fit_transform(df1.review_processed3)
# examine one of the topic
a_topic = freq.iloc[0]["Topic"] # Select the 1st topic
model.get_topic(a_topic) # Show the words and their c-TF-IDF scores
Below is the words and their c-TF-IDF scores for one of the Topics
image 1
How should I change the result into Topic Distribution as below in order to calculate the topic distribution score and also identify the main topic?
image 2

First, to compute probabilities, you have to add to your model definition calculate_probabilities=True (this could slow down the extraction of topics if you have many documents, > 100000).
# define model
model = BERTopic(verbose=True,
vectorizer_model=vectorizer_model,
embedding_model='paraphrase-MiniLM-L3-v2',
min_topic_size= 50,
nr_topics=10,
calculate_probabilities=True)
Then, calling fit_transform, you should save the probabilities:
headline_topics, probs = model.fit_transform(df1.review_processed3)
Now, you can create a pandas dataframe which shows probabilities under respective topics per document.
import pandas as pd
probs_df=pd.DataFrame(probs)
probs_df['main percentage'] = pd.DataFrame({'max': probs_df.max(axis=1)})

Related

How to apply Target Encoding in test dataset?

I am working on a project, where I had to apply target encoding for 3 categorical variables:
merged_data['SpeciesEncoded'] = merged_data.groupby('Species')['WnvPresent'].transform(np.mean)
merged_data['BlockEncoded'] = merged_data.groupby('Block')['WnvPresent'].transform(np.mean)
merged_data['TrapEncoded'] = merged_data.groupby('Trap')['WnvPresent'].transform(np.mean)
I received the results and ran the model. Now the problem is that I have to apply the same model to test data that has columns Block, Trap, and Species, but doesn't have the values of the target variable WnvPresent (which has to be predicted).
How can I transfer my encoding from training sample to the test? I would greatly appreciate any help.
P.S. I hope it makes sense.
You need to same the mapping between the feature and the mean value, if you want to apply it to the test dataset.
Here is a possible solution:
species_encoding = df.groupby(['Species'])['WnvPresent'].mean().to_dict()
block_encoding = df.groupby(['Block'])['WnvPresent'].mean().to_dict()
trap_encoding = df.groupby(['Trap'])['WnvPresent'].mean().to_dict()
merged_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
merged_data['BlockEncoded'] = df['Block'].map(species_encoding)
merged_data['TrapEncoded'] = df['Trap'].map(species_encoding)
test_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
test_data['BlockEncoded'] = df['Block'].map(species_encoding)
test_data['TrapEncoded'] = df['Trap'].map(species_encoding)
This would answer your question, but I want to add, that this approach can be improved. Directly using mean values of targets could make the models overfit on the data.
There are many approaches to improve target encoding, one of them is smoothing, here is a link to an example: https://maxhalford.github.io/blog/target-encoding/
Here is an example:
m = 10
mean = df['WnvPresent'].mean()
# Compute the number of values and the mean of each group
agg = df.groupby('Species')['WnvPresent'].agg(['count', 'mean'])
counts = agg['count']
means = agg['mean']
# Compute the "smoothed" means
species_encoding = ((counts * means + m * mean) / (counts + m)).to_dict()
There are 2 open source Python libraries that offer this functionality off-the-shelf: Feature-engine and Category encoders.
Assuming that we have a train and a testing set...
With Feature engine it would work as follows:
from feature_engine.encoding import MeanEncoder
# set up the encoder
encoder = MeanEncoder(variables=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
We find the replacement values in the encoding_dict_ attribute as follows:
encoder.encoding_dict_
With category encoders it works as follows:
from category_encoders.target_encoder import TargetEncoder
# set up the encoder
encoder = TargetEncoder(cols=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
The replacement values can be found in the attribute mapping:
encoder.mapping
More details in the respective documentation:
MeanEncoder
TargetEncoder
Category encoders' TargetEncoder also offers smoothing as suggested by #andrey-lukyanenko out-of-the-box.

Partial fit or incremental learning for autoregressive model

I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic
Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)
I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.

Sentiment Analysis Feature Selection based on word to label correlation

In my sentiment analysis on a dataset of 194k review texts with labels (class 1-5), I am trying to reduce the features (words) based on a word to label correlation by which a classifier can be trained.
Using sklearn.feature_extraction.text.CountVectorizer with default parameterization, I get 86,7k features. When performing fit_transform, I got a CSR-sparse matrix which I tried to put into a data frame using toarray().
Unfortunately, an array of size (194439,86719) causes a Memory Error. I think I need it to be in the data frame in order to calculate the correlations with df.corr(). Below you find my coding:
corpus = data['reviewText']
vectorizer = CountVectorizer(analyzer ='word')
X = vectorizer.fit_transform(corpus)
content = X.toarray() # here comes the Memory Error
vocab = vectorizer.get_feature_names()
df = pd.DataFrame(data= X.toarray(), columns=vocab)
corr = pd.Series(df.corrwith(df['overall']) > 0.6)
new_vocab = df2[corr[corr == True].index] # should return features that we want to use
Is there a way to filter by correlation without having to change the format into a data frame?
Most posts that were going into the same direction of using correlation on df do not have to handle the large data amount.
I figured that there are other ways to implement a feature selection based on the correlation. With SelectKBest and the scoring function f_regression.

Displaying topics associated with a document/query in Gensim

Gensim has a tutorial saying how to, given a document/query string, say what other documents are most similar to it, in descending order:
http://radimrehurek.com/gensim/tut3.html
It can also display what topics are associated with an entire model at all:
How to print the LDA topics models from gensim? Python
But how do you find what topics are associated with a given document/query string? Ideally with some numeric similarity metric for each topic? I haven't been able to find anything on that.
If you want to find the topic distribution of unseen documents then you need to convert the document of interest into a bag of words representation
from gensim import utils, models
from gensim.corpora import Dictionary
lda = models.LdaModel.load('saved_lda.model') # load saved model
dictionary = Dictionary.load('saved_dictionary.dict') # load saved dict
text = ' '
with open('document', 'r') as inp: # convert file to string
for line in inp:
text += line + ' '
tkn_doc = utils.simple_preprocess(text) # filter & tokenize words
doc_bow = dictionary.doc2bow(tkn_doc) # use dictionary to create bow
doc_vec = lda[doc_bow] # this is the topic probability distribution for the document of interest
From this code you get a sparse vector where the indices represent the topics 0....n and each 'weight' is the probability that the words in the document belong to that topic in the model.
You can visualize the distribution by creating a bar graph using matplotlib.
y_axis = []
x_axis = []
for topic_id, dist in enumerate(doc_vec):
x_axis.append(topic_id + 1)
y_axis.append(dist)
width = 1
plt.bar(x_axis, y_axis, width, align='center', color='r')
plt.xlabel('Topics')
plt.ylabel('Probability')
plt.title('Topic Distribution for doc')
plt.xticks(np.arange(2, len(x_axis), 2), rotation='vertical', fontsize=7)
plt.subplots_adjust(bottom=0.2)
plt.ylim([0, np.max(y_axis) + .01])
plt.xlim([0, len(x_axis) + 1])
plt.savefig(output_path)
plt.close()
If you want to see the topn terms in each topic you can print them like this. Referencing the graph, you can look up the topn words you printed and determine how the document was interpreted by the model.
You can also find distances between two different document probability distribution vectors by using vector calculations like hellinger distance, euclidean, jensen shannon etc.

ML - Getting feature names after feature selection - SelectPercentile, python

I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support

Categories