Gensim has a tutorial saying how to, given a document/query string, say what other documents are most similar to it, in descending order:
http://radimrehurek.com/gensim/tut3.html
It can also display what topics are associated with an entire model at all:
How to print the LDA topics models from gensim? Python
But how do you find what topics are associated with a given document/query string? Ideally with some numeric similarity metric for each topic? I haven't been able to find anything on that.
If you want to find the topic distribution of unseen documents then you need to convert the document of interest into a bag of words representation
from gensim import utils, models
from gensim.corpora import Dictionary
lda = models.LdaModel.load('saved_lda.model') # load saved model
dictionary = Dictionary.load('saved_dictionary.dict') # load saved dict
text = ' '
with open('document', 'r') as inp: # convert file to string
for line in inp:
text += line + ' '
tkn_doc = utils.simple_preprocess(text) # filter & tokenize words
doc_bow = dictionary.doc2bow(tkn_doc) # use dictionary to create bow
doc_vec = lda[doc_bow] # this is the topic probability distribution for the document of interest
From this code you get a sparse vector where the indices represent the topics 0....n and each 'weight' is the probability that the words in the document belong to that topic in the model.
You can visualize the distribution by creating a bar graph using matplotlib.
y_axis = []
x_axis = []
for topic_id, dist in enumerate(doc_vec):
x_axis.append(topic_id + 1)
y_axis.append(dist)
width = 1
plt.bar(x_axis, y_axis, width, align='center', color='r')
plt.xlabel('Topics')
plt.ylabel('Probability')
plt.title('Topic Distribution for doc')
plt.xticks(np.arange(2, len(x_axis), 2), rotation='vertical', fontsize=7)
plt.subplots_adjust(bottom=0.2)
plt.ylim([0, np.max(y_axis) + .01])
plt.xlim([0, len(x_axis) + 1])
plt.savefig(output_path)
plt.close()
If you want to see the topn terms in each topic you can print them like this. Referencing the graph, you can look up the topn words you printed and determine how the document was interpreted by the model.
You can also find distances between two different document probability distribution vectors by using vector calculations like hellinger distance, euclidean, jensen shannon etc.
Related
I am trying to use BERTopic to analyze the topic distribution of documents, after BERTopic is performed, I would like to calculate the probabilities under respective topics per document, how should I did it?
# define model
model = BERTopic(verbose=True,
vectorizer_model=vectorizer_model,
embedding_model='paraphrase-MiniLM-L3-v2',
min_topic_size= 50,
nr_topics=10)
# train model
headline_topics, _ = model.fit_transform(df1.review_processed3)
# examine one of the topic
a_topic = freq.iloc[0]["Topic"] # Select the 1st topic
model.get_topic(a_topic) # Show the words and their c-TF-IDF scores
Below is the words and their c-TF-IDF scores for one of the Topics
image 1
How should I change the result into Topic Distribution as below in order to calculate the topic distribution score and also identify the main topic?
image 2
First, to compute probabilities, you have to add to your model definition calculate_probabilities=True (this could slow down the extraction of topics if you have many documents, > 100000).
# define model
model = BERTopic(verbose=True,
vectorizer_model=vectorizer_model,
embedding_model='paraphrase-MiniLM-L3-v2',
min_topic_size= 50,
nr_topics=10,
calculate_probabilities=True)
Then, calling fit_transform, you should save the probabilities:
headline_topics, probs = model.fit_transform(df1.review_processed3)
Now, you can create a pandas dataframe which shows probabilities under respective topics per document.
import pandas as pd
probs_df=pd.DataFrame(probs)
probs_df['main percentage'] = pd.DataFrame({'max': probs_df.max(axis=1)})
I have a data set that contains comments from bird watchers. I used TF-IDF vectorizer to convert the text comments into vector features, and then ran K-means clustering to separate my data into clusters. I have a set of clear clusters. However, I have been trying to find a way to find out which words made it into which clusters. I am aware of how to get the feature labels/names, but I want to see the actual data points under each feature, and then convert them back to the original words. I am using Python and Scikit-Learn's K-means algorithm.
def final_k_model(X, finalk):
final_k_mod = KMeans(n_clusters=finalk, init='random', n_init=10, max_iter=300, tol=1e-04, random_state=0)
final_k_mod.fit(X)
# plot the results:
centroids = final_k_mod.cluster_centers_
tsne_init = 'pca'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
random_state = 1
tsnemodel = TSNE(n_components=2, random_state=random_state, init=tsne_init, perplexity=tsne_perplexity,
early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = tsnemodel.fit_transform(centroids)
plt.figure(1)
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='x')
plt.savefig('plots\\cluster.png')
plt.show()
return final_k_mod
I included some code, but not sure if it helps as I don't have an error. I am just trying to figure out if this is even possible, I've been googling and looking at tutorials but haven't found it.
Assuming you calculated the X in your code by the following method,
#corpus = list of all documents
#vocab = list of all words in corpus
tdf_idf = TfidfVectorizer(vocabulary=vocab)
X = tdf_idf.fit_transform(corpus)
is the following that you are looking for?
for centroid in centroids:
score_this_centroid = {}
for word in tdf_idf.vocabulary_.keys():
score_this_centroid[word] = centroid[tdf_idf.vocabulary_[word]]
pass
How do I measure or find the Zipf distribution ? For example, I have a corpus of english words. How do I find the Zipf distribution ? I need to find the Zipf ditribution and then plot a graph of it. But I am stuck in the first step which is to find the Zipf distribution.
Edit: From the frequency count of each word, it is clear that it obeys the Zipf law. But my aim is to plot a zipf distribution graph. I have no idea about how to calculate the data for the distribution graph
I don't pretend to understand statistics. However, based upon reading from scipy site, here is a naive attempt in python.
Build Data
First we get our data. For example we download data from National Library of Medicine MeSH (Medical Subject Heading) ASCII file d2016.bin (28 MB).
Next, we open file, convert to string.
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
Next we locate individual words in the file and separate out words.
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
Finally we prepare a dict with unique words as key and word count as values.
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
Build zipf distribution data
For speed purpose we limit data to 1000 words.
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
After that we get frequency of values , convert to numpy array and use numpy.random.zipf function to draw samples from a zipf distribution.
Distribution parameter a =2. as a sample as it needs to be greater than 1.
For visibility purpose we limit data to 50 sample points.
s = frequency.values()
s = np.array(s)
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
And finally plot the data.
Putting All Together
import re
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Get our corpus of medical words
frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
#build dict of words based on frequency
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
#limit words to 1000
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
#convert value of frequency to numpy array
s = frequency.values()
s = np.array(s)
#Calculate zipf and plot the data
a = 2. # distribution parameter
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
Plot
I have been trying to create an image classifier in Python OpenCV 3.2.0 using keypoints and the bag of words technique. After some reading I found that I could peform this as follows
Extract image descriptors using AKAZE
Perform k-means clustering on the descriptors to generate the dictionary
Generate histograms of images based on dictionary
Train SVM using histograms
I managed to do steps 1 and 2 but have gotten stuck on steps 3 and 4.
I generated the histograms by using the labels returned by k-means clustering successfully (I think). However, when I wanted to use new test data that was not used to generate the dictionary I had some unexpected results. I tried to use a FLANN matcher like in this tutorial but the results I get from generating the histograms from the label data does not match the data returned from the FLANN matching.
I load up the images:
dictionary_size = 512
# Loading images
imgs_data = []
# imreads returns a list of all images in that directory
imgs = imreads(imgs_path)
for i in xrange(len(imgs)):
# create a numpy to hold the histogram for each image
imgs_data.insert(i, np.zeros((dictionary_size, 1)))
I then create an array of descriptors (desc):
def get_descriptors(img, detector):
# returns descriptors of an image
return detector.detectAndCompute(img, None)[1]
# Extracting descriptors
detector = cv2.AKAZE_create()
desc = np.array([])
# desc_src_img is a list which says which image a descriptor belongs to
desc_src_img = []
for i in xrange(len(imgs)):
img = imgs[i]
descriptors = get_descriptors(img, detector)
if len(desc) == 0:
desc = np.array(descriptors)
else:
desc = np.vstack((desc, descriptors))
# Keep track of which image a descriptor belongs to
for j in range(len(descriptors)):
desc_src_img.append(i)
# important, cv2.kmeans only accepts type32 descriptors
desc = np.float32(desc)
The descriptors are then clustered using k-means:
# Clustering
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 0.01)
flags = cv2.KMEANS_PP_CENTERS
# desc is a type32 numpy array of vstacked descriptors
compactness, labels, dictionary = cv2.kmeans(desc, dictionary_size, None, criteria, 1, flags)
Then I create histograms for each image using the labels returned from k-means:
# Getting histograms from labels
size = labels.shape[0] * labels.shape[1]
for i in xrange(size):
label = labels[i]
# Get this descriptors image id
img_id = desc_src_img[i]
# imgs_data is a list of the same size as the number of images
data = imgs_data[img_id]
# data is a numpy array of size (dictionary_size, 1) filled with zeros
data[label] += 1
ax = plt.subplot(311)
ax.set_title("Histogram from labels")
ax.set_xlabel("Visual words")
ax.set_ylabel("Frequency")
ax.plot(imgs_data[0].ravel())
This outputs a histogram like this which is very evenly distributed and what I expect.
I then attempt to do the same thing on the same image but using FLANN:
matcher = cv2.FlannBasedMatcher_create()
matcher.add(dictionary)
matcher.train()
descriptors = get_descriptors(imgs[0], detector)
result = np.zeros((dictionary_size, 1), np.float32)
# flan matcher needs descriptors to be type32
matches = matcher.match(np.float32(descriptors))
for match in matches:
visual_word = match.trainIdx
result[visual_word] += 1
ax = plt.subplot(313)
ax.set_title("Histogram from FLANN")
ax.set_xlabel("Visual words")
ax.set_ylabel("Frequency")
ax.plot(result.ravel())
This outputs a histogram like this which is very unevenly distributed and does not match up with the first histogram.
You can view the full code and images on GitHub. Change "imgs_path" (line 20) to a directory with images before running it.
Where am I going wrong? Why are the histograms so different? How do I generate the histograms for new data using the dictionary?
As a side note I tried using the OpenCV BOW implementation but found another issue where it gave the error: "_queryDescriptors.type() == trainDescType in function cv::BFMatcher::knnMatchImpl" and that's why I am trying to implement it myself. If someone could provide a working example using Python OpenCV BOW and AKAZE then that would be just as good.
It seems that you cannot train a FlannBasedMatcher using a dictionary before hand as show below:
matcher = cv2.FlannBasedMatcher_create()
matcher.add(dictionary)
matcher.train()
However you can pass the dictionary in when matching like this:
matcher = cv2.FlannBasedMatcher_create()
...
matches = matcher.match(np.float32(descriptors), dictionary)
I am not entirely sure why this. Perhaps its that the train method is only meant to be used by the match method as hinted in this post.
Also according to the opencv docs the parameters for match are:
queryDescriptors – Query set of descriptors.
trainDescriptors – Train set of descriptors. This set is not added to the train descriptors collection stored in the class object.
matches – Matches. If a query descriptor is masked out in mask , no match is added for this descriptor. So, matches size may be smaller than the query descriptors count.
So I guess you are just supposed to pass the dictionary in as trainDescriptors because that is what it is.
If anyone could shed more light on this it would be appreciated.
Here are the results after using the above method:
You can see the full updated code here.
I've generated a 100D word2vec model using my domain text corpus, merging common phrases, for example (good bye => good_bye). Then I've extracted 1000 vectors of desired words.
So I have a 1000 numpy.array like so:
[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
[-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
...
...[1000 Vectors]
]
And words array like so:
["hello","hi","bye","good_bye"...1000]
I have ran K-Means on my data, and the results I got made sense:
X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
print(l,words[idx])
--- Output ---
0 hello
0 hi
1 bye
1 good_bye
0 = greeting 1 = farewell
However, some words made me think that hierarchical clustering is more suitable for the task. I've tried using AgglomerativeClustering, Unfortunately ... for this Python nobee, things got complicated and I got lost.
How can I cluster my vectors, so the output would be a dendrogram, more or less, like the one found on this wiki page?
I had the same problem till now!
After finding always your post after searching it online (keyword = hierarchy clustering on word2vec).
I had to give you a perhaps valid solution.
sentences = ['hi', 'hello', 'hi hello', 'goodbye', 'bye', 'goodbye bye']
sentences_split = [s.lower().split(' ') for s in sentences]
import gensim
model = gensim.models.Word2Vec(sentences_split, min_count=2)
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
l = linkage(model.wv.syn0, method='complete', metric='seuclidean')
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')
dendrogram(
l,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=16., # font size for the x axis labels
orientation='left',
leaf_label_func=lambda v: str(model.wv.index2word[v])
)
plt.show()