Find nearest neighbors change algorithm - python

I am in the process of creating a recommender system that suggests 20 most suitable songs to a user. I've trained my model, I'm ready to recommend songs for a given playlist! However, one issue that I encountered is that I need the embedding of that new playlist in order to find the closest relevant playlists in that embedding space using kmeans.
To recommend songs, I first cluster the learned embeddings for all of the training playlists, and then select "neighbor" playlists for my given test playlist as all of the other playlists in that same cluster. I then take all of the tracks from these playlists and feed the test playlist embedding and these "neighboring" tracks into my model for prediction. This ranks the "neighboring" tracks by how likely they are (under my model) to occur next in the given test playlist.
desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')
mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())
# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))
print('\nPerforming kmeans to find the nearest users/playlists...')
# get 100 similar users
kmeans = KMeans(n_clusters=100, random_state=0, verbose=0).fit(user_latent_matrix)
desired_user_label = kmeans.predict(one_user_vector)
user_label = kmeans.labels_
neighbors = []
for user_id, user_label in enumerate(user_label):
if user_label == desired_user_label:
neighbors.append(user_id)
print('Found {0} neighbor users/playlists.'.format(len(neighbors)))
tracks = []
for user_id in neighbors:
tracks += list(df[df['pid'] == int(user_id)]['trackindex'])
print('Found {0} neighbor tracks from these users.'.format(len(tracks)))
users = np.full(len(tracks), desired_user_id, dtype='int32')
items = np.array(tracks, dtype='int32')
# and predict tracks for my user
results = model.predict([users,items],batch_size=100, verbose=0)
results = results.tolist()
print('Ranked the tracks!')
results_df = pd.DataFrame(np.nan, index=range(len(results)), columns=['probability','track_name', 'track artist'])
print(results_df.shape)
# loop through and get the probability (of being in the playlist according to my model), the track, and the track's artist
for i, prob in enumerate(results):
results_df.loc[i] = [prob[0], df[df['trackindex'] == i].iloc[0]['track_name'], df[df['trackindex'] == i].iloc[0]['artist_name']]
results_df = results_df.sort_values(by=['probability'], ascending=False)
results_df.head(20)
Instead of this code above, I would like to use this https://www.tensorflow.org/recommenders/examples/basic_retrieval#building_a_candidate_ann_index or the official GitHub repository from Spotify https://github.com/spotify/annoy.
Unfortunately I don't know exactly how to use this so that the new program gives me the 20 most popular tracks for a user.
How do I have to change this?
Edit:
What I tried:
from annoy import AnnoyIndex
import random
desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')
mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())
# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))
t = AnnoyIndex(desired_user_id , one_user_vector) #Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
u = AnnoyIndex(desired_user_id , one_user_vector)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
# Now how to I get the probability and the values?

You're almost there!
In your code starting with desired_user_id = 123, you have 4 main steps:
1 (L 1-12): retrieve the matrix of user embeddings (user_latent_matrix) from your saved model
2 (L 14-23): find the user's cluster label (desired_user_label) with kmeans and list other users in the cluster (neighbors). users in the same cluster should listen to similar songs to you.
3 (L 25-31): list songs that other users in the cluster like (tracks). music that you like will be similar to music listened to other people in your cluster. steps 2 and 3 are just to filter out 99% of all music so you only have to run your model on the last 1% to save time & money. Removing 2 and 3 and adding every song to tracks would still work (but take 100x longer).
4 (L 33+): use the saved model to predict if the songs liked by other users in the cluster are suitable to you (results_df)
Annoy is a drop in replacement for finding the similar users (step 2). Instead of using kmeans to find the user's cluster then finding other users in the cluster, it uses a k nearest neighbors style algorithm to find close users directly.
after you've found one_user_vector on line 12, replace step 2 (Lines 14-23) with something like
from annoy import AnnoyIndex
user_embedding_length = 23
t = AnnoyIndex(user_embedding_length, 'angular')
# add the user embeddings to annoy (your annoy userids will be the row indexes)
for user_id, user_embedding in enumerate(user_latent_matrix):
t.add_item(user_id, user_embedding)
# build the forrest
t.build(10) # 10 trees
# save the forest for later if you're using this again and don't want to rebuild the trees every time
t.save('test.ann')
# find the 100 nearest neighbor users
neighbors = t.get_nns_by_item(desired_user_id, 100)
if you want to run your stuff again but don't want to rebuild the trees and have already run it once, replace step 2 with just
from annoy import AnnoyIndex
user_embedding_length = 23
t = AnnoyIndex(user_embedding_length, 'angular')
# load the trees
t.load('test.ann')
# find the 100 nearest neighbor users
neighbors = t.get_nns_by_item(desired_user_id, 100)
After replacing the stuff in step 2, just run steps 3 and 4 (Lines 25+) like normal

In the existing code, you really seem to have two predictions steps: one for finding the 100 nearest neighbor users and another one for ranking all the tracks of all these users in relation to the current user. Now in general, you should decide which of these steps (or both) you want to replace by the Annoy algorithm.
Looking at the example code from GitHub, you do not need the t = AnnoyIndex ... part here, this just creates some sample data for showing the usage.
u = AnnoyIndex(f, metric) needs the number of dimensions as the input parameter f and one of the values "angular", "euclidean", "manhattan", "hamming", or "dot" as the metric.
From your question I cannot tell what the number of dimensions is in your case, and you will probably have to experiment with the metric yourself to find out what yields the best results.
After that you will have to import your data into the AnnoyIndex object, which must probably be derived from user_latent_matrix and/or users/items.
Finally, you should be able to retrieve the 20 nearest neighbors by running u.get_nns_by_item(i, 20) where i is the id of the concerning user or track. Setting include_distances=True will also give you the corresponding distances (not probabilities as in your approach).
Hope this gives you some hints for getting ahead.

Related

Solving memory issues when using Gensim LDA Multicore

For my project I am trying to use unsupervised learning to identify different topics from application descriptions, but I am running into a strange problem. Firstly, I have 3 different datasets, one with 15k documents another with 50k documents and last with 2m documents. I am trying to test models with different number of topics (k) ranging from 5 to 100 with a step size of 5. This is in order to check which k results in the best model assessed with initially with the highest coherence score. For each k, I also build 3 different models with chunksize 10, 100 and 1000.
So now moving onto the problem I am having. Obviously my own machine is too slow and does not have enough cores for this kind of computation hence I am using my university's server. The problem here is my program seems to be consuming too much memory and I am unsure of the reason. I already made some adjustments such that the corpus is not loaded entirely to memory (or atleast I think I did). The dataset with 50k entries already at iteration k=50 (so halfway) seems to have consumed the alloted 100GB of memory, which seems very huge.
I would appreciate any help in the right direction and thanks for taking the time to look at this. Below is the code from my topic_modelling.py file. Comments on the file are a bit outdated, sorry about that.
class MyCorpus:
texts: list
dictionary: dict
def __init__(self, descriptions, dictionary):
self.texts = descriptions
self.dictionary = dictionary
def __iter__(self):
for line in self.texts:
try:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line)
except StopIteration:
pass
# Function given a dataframe creates a dictionary and corupus
# These are used to create an LDA model. Here we automatically use the Descriptionb column
# from each dataframe
def create_dict_and_corpus(df):
text_descriptions = remove_characters_and_create_list(df, 'Description')
# print(text_descriptions)
dictionary = gensim.corpora.Dictionary(text_descriptions)
corpus = MyCorpus(text_descriptions, dictionary)
return text_descriptions, dictionary, corpus
# Given a dataframe remove and a column name in the data frame, extract all words and return a list
# Also to remove all chracters that are not alphanumeric or spaces
def remove_characters_and_create_list(df, column_name, split=True):
df[column_name] = df[column_name].astype(str)
texts = []
for x in range(df[column_name].size):
current_string = df[column_name][x]
filtered_string = re.sub(r'[^A-Za-z0-9 ]+', '', current_string)
if split:
texts.append(filtered_string.split())
else:
texts.append(filtered_string)
return texts
# This function given the parameters creates an LDA model for each number between
# the start limit and the end limit. After this the coherence and perplexity is calulated
# for each of those models and saved in a csv file to analyze later.
def test_lda_models(text, corpus, dictionary, start_limit, end_limit, path):
results = []
print("============Starting topic modelling============")
for k in range(start_limit, end_limit+1, 5):
for p in range(1, 4):
chunk = pow(10, p)
t0 = time.time()
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=chunk)
# To calculate the goodness of the model
perplexity = lda_model.bound(corpus)
coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
t1 = time.time()
print(f"=====Done K={k} model with passes={p} and chunksize={chunk}, took {t1-t0} seconds=====")
results.append((k, chunk, coherence_lda, perplexity))
# Storing teh results in a csv file except the actual lda model (this would not make sense)
path = make_dir_if_not_exists(path)
list_tuples_to_csv(results, ['#OfTopics', 'ChunkSize', 'CoherenceScore', 'Perplexity'], f"{path}/K={start_limit}to{end_limit}.csv")
return results
# Function plot the visualization of an LDA model. This visualization is then
# saved as an html file inside the given path
def single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path):
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(vis, f"{path}/visualization.html")
# Given the results produced by test_lda_models, loop though the models and save the
# topic words of each model and the visualization of the topics in the given path
def save_lda_result(k, c, lda_model, corpus, dictionary, path):
list_tuples_to_csv(lda_model.print_topics(num_topics=k), ['Topic#', 'Associated Words'], f"{path}/associated_words.csv")
single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path)
# This is the entire pipeline that needs to be performed for a single dataset,
# which includes computing the LDA models from start to end limit and calculating
# and saving the topic words and visual graphs for the top n topics with the highest
# coherence score.
def perform_topic_modelling_single_df(df, start_limit, end_limit, path):
# Extracting the necessary data required for LDA model computation
text_descriptions,dictionary, corpus = create_dict_and_corpus(df)
results_lda = test_lda_models(text_descriptions, corpus, dictionary, start_limit, end_limit, path)
# Sorting the results based on the 2nd tuple value returned which is 'coherence'
results_lda.sort(key=lambda x:x[2],reverse=True)
# Getting the top 5 results to save pass to save_lda_results function
results = results_lda[:5]
corpus_for_saving = [dictionary.doc2bow(text) for text in text_descriptions]
texts = remove_characters_and_create_list(df, 'Description', split=False)
# Perfrom application to topic modelling for the best lda model based on the
# coherence score (TODO maybe test with other lda models?)
print("getting descriptions for csv")
for k, c, _, _ in results:
dir_path = make_dir_if_not_exists(f"{path}/k={k}_chunk={c}")
p = int(math.log10(c))
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=c)
print(f"=====REDOING K={k} model with passes={p} and chunksize={c}=====")
save_lda_result(k,c, lda_model, corpus_for_saving, dictionary, dir_path)
application_to_topic_modelling(df, k, c, lda_model, corpus_for_saving, texts, dir_path)
# Performs the whole topic modelling pipeline taking different genre data sets
# and the entire dataset as a whole
def perform_topic_modelling_pipeline(path_ex):
# entire_df = pd.read_csv("../data/preprocessed_data/preprocessed_10000_trial.csv")
entire_df = pd.read_csv(os.path.join(ROOT_DIR, f"data/preprocessed_data/preprocessedData_{path_ex}.csv"))
print("size of df")
print(entire_df.shape)
# For entire df go from start limit to ngenres to find best LDA model
nGenres = row_counter(os.path.join(ROOT_DIR, f"data/genre_wise_data/data{path_ex}/genre_frequency.csv"))
nGenres_rounded = math.ceil(nGenres / 5) * 5
print(f"Original number of genres should be {nGenres}, but we are rounding to {nGenres_rounded}")
path = make_dir_if_not_exists(os.path.join(ROOT_DIR, f"results/data{path_ex}/aall_data"))
perform_topic_modelling_single_df(entire_df, 5, 100, path)

Collaborative Filtering Item-Based Recommender System Accuracy

I'm trying to finding a way to know the accuracy of my Recommender System. The method that I used was to create a KNN model based on a User X Movies matrix (where the content are the ratings that a given user gave to a given movie). Based on that model, I have a function, where I can input a movie title and it returns to me the K more similar movies to the one I used as input. Having that, I don't know how to measure if my model is accurate and if the movies shown are really similar to the one I used as input. Any ideas?
Here is a sample of the dataset I'm using
def create_sparse_matrix(df):
sparse_matrix = sparse.csr_matrix((df["rating"], (df["userId"], df["movieId"])))
return sparse_matrix
# getting the transpose - data_cf is the dataFrame name that I'm using
user_movie_matrix = create_sparse_matrix(data_cf).transpose()
knn_cf = NearestNeighbors(n_neighbors=N_NEIGHBORS, algorithm='auto', metric='cosine')
knn_cf.fit(user_movie_matrix)
# Creating function to get movies recommendations based in a movie input.
def get_recommendations_cf(movie_name, model):
# Getting the ID of the movie based on it's title
movieId = data_cf.loc[data_cf["title"] == movie_name]["movieId"].values[0]
distances, suggestions = model.kneighbors(user_movie_matrix.getrow(movieId).todense().tolist(), n_neighbors=10)
for i in range(0, len(distances.flatten())):
if(i == 0):
print('Recomendações para {0}: \n'.format(movie_name))
else:
print('{0}: {1}, com distância de {2}:'.format(i, data_cf.loc[data_cf["movieId"] == suggestions.flatten()[i]]["title"].values[0], distances.flatten()[i]))
return distances, suggestions
Calling the recommender function and showing the "distance" of each movie recommended
Translating:
"Recomendações para Spider-Man 2: " = "Recommendations for Spider-Man 2: "
"1: Spider-Man, com distância de 0.30051949781903664" = "1: Spider-Man, with distance of 0.30051949781903664"
...
"9: Finding Nemo, com distância de 0.4844064554284505:" = "9: Finding Nemo, with distance of 0.4844064554284505:"
When it comes to recommendation systems, measuring performance is never a straightforward task. That is because there are many desirable characteristics that we are looking for in a recommendation: accuracy, diversity, novelty, ... All of which can be measured in some way or another. There are many very helpful articles on the web that cover the topic. I will link a few references that deal with precision in specific:
https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54
https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)
Bear in mind that to do any sort of evaluation you need to split your data into a train and a test set. In the case of recommender systems, since all users and all items must be represented in both the train and test sets, you must use a stratified approach. That means that you should take set aside a percentage of the movie reviews for each user instead of simply sampling lines of your dataset.

Given an item, how do I get recommendation of users who have not rated this item?

My use case:
Given an item, I would like to get recommendations of users who have not rated this item.
I found this amazing Python library that can answer my use case:
python-recsys https://github.com/ocelma/python-recsys
The example is given as below.
Which users should see Toy Story? (e.g. which users -that have not rated Toy Story- would give it a high rating?)
svd.recommend(ITEMID)
# Returns: <USERID, Predicted Rating>
[(283, 5.716264440514446),
(3604, 5.6471765418323141),
(5056, 5.6218800339214496),
(446, 5.5707524860615738),
(3902, 5.5494529168484652),
(4634, 5.51643364021289),
(3324, 5.5138903299082802),
(4801, 5.4947999354188548),
(1131, 5.4941438045650068),
(2339, 5.4916048051511659)]
This implementation used SVD to predict ratings given by users, and return the user id of the highest rating user-movie which were initially not rated.
Unfortunately, this library is written using Python 2.7, which is not compatible with my project.
I also found the Scikit Surprise library which has a similar example.
https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-k-nearest-neighbors-of-a-user-or-item
import io # needed because of weird encoding of u.item file
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir
def read_item_names():
"""Read the u.item file from MovieLens 100-k dataset and return two
mappings to convert raw ids into movie names and movie names into raw ids.
"""
file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
rid_to_name = {}
name_to_rid = {}
with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
for line in f:
line = line.split('|')
rid_to_name[line[0]] = line[1]
name_to_rid[line[1]] = line[0]
return rid_to_name, name_to_rid
# First, train the algortihm to compute the similarities between items
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()
# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
for rid in toy_story_neighbors)
print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
print(movie)
Prints
The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)
How do I change the code to get the outcome like the python-recsys's example above?
Thanks in advance.
This is just an implementation of the k-nearest neighbors algorithm. Take a look at how it works before you continue.
What's happening is the second chunk of code you posted is just classifying movies based on some metrics. The first bit is (probably) taking the already seen movies and matching it up against all the existing classes. From there, it's computing a similarity score and returning the highest.
So you take Beauty and the Beast. That's been classified as a children's cartoon. You compare the watched movies of your users to the full set of movies and take the x highest users with a score that indicates a high similarity between the set of movies that Beauty and the Beast falls into and the user's previously watched movies, but also where Beauty and the Beast is unwatched.
This is the math behind the algorithm https://youtu.be/4ObVzTuFivY
i am not sure if its too late to answer but i too wanted to try the same thing and got this workaround from the surprise package, not sure if this is the right approach though,
movieid = 1
# get the list of the user ids
unique_ids = ratingSA['userID'].unique()
# get the list of the ids that the movieid has been watched
iids1001 = ratingSA.loc[ratingSA['item']==movieid, 'userID']
# remove the rated users for the recommendations
users_to_predict = np.setdiff1d(unique_ids,iids1001)
# predicting for movie 1
algo = KNNBaseline(n_epochs = training_parameters['n_epochs'], lr_all = training_parameters['lr_all'], reg_all = training_parameters['reg_all'])
algo.fit(trainset)
my_recs = []
for iid in users_to_predict:
my_recs.append((iid, algo.predict(uid=userid,iid=iid).est))
recomend=pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(5)
recomend= recomend.rename({'iid':'userId'},axis=1)
recomend

In Surprise package for recommender systems, how to print out the recommended movies for a given user?

For many algorithms for example SVD, the ready built-in functions are:
predictions = algo.fit(trainset).test(testset)
-- which prints the predicted rating score for the test set (so for movies that users have already given the ratings)
predictions = algo.predict(uid, iid)
-- predict the rating score of the iid of uid
But how can I print the top N recommended movie for a user (even though this user has not yet seen/given rating for some movies). I have tried: "algo.fit(trainset).test(data)" but it gives me error?
I have tried also using KNN in surprise to print k nearest neighbors of a user
In the surprise package example, it has the u.item file, but if I want to use my own data (one table that has uid, iid, and rating), how can I compute the "raw id" of an user and an item?
This code snippet is shared from Surprise Documentation FAQ and may be helpful.
from collections import defaultdict
from surprise import SVD
from surprise import Dataset
def get_top_n(predictions, n=10):
"""Return the top-N recommendation for each user from a set of predictions.
Args:
predictions(list of Prediction objects): The list of predictions, as
returned by the test method of an algorithm.
n(int): The number of recommendation to output for each user. Default
is 10.
Returns:
A dict where keys are user (raw) ids and values are lists of tuples:
[(raw item id, rating estimation), ...] of size n.
"""
# First map the predictions to each user.
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
top_n = get_top_n(predictions, n=10)
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
print(uid, [iid for (iid, _) in user_ratings])

how to get words of clusters

How can I get the words of each cluster
I divided them into groups
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in train['KARMA'].values:
all_content_train.append(LabeledSentence1(em,[j]))
j+=1
print('Number of texts processed: ', j)
d2v_model = Doc2Vec(all_content_train, vector_size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001)
d2v_model.train(all_content_train, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016)```
```kmeans_model = KMeans(n_clusters=10, init='k-means++', max_iter=100)
X = kmeans_model.fit(d2v_model.docvecs.doctag_syn0)
labels=kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(d2v_model.docvecs.doctag_syn0)
pca = PCA(n_components=2).fit(d2v_model.docvecs.doctag_syn0)
datapoint = pca.transform(d2v_model.docvecs.doctag_syn0)
I can get the text and its cluster but how can I learn the words which mainly created those groups
It's not an inherent feature of Doc2Vec to list words most-related to any document or doc-vector. (Other algorithms, such as LDA, will offer that.)
So, you could potentially write your own code, once you've split your documents into clusters, to report the words that are "most over-represented" in each cluster.
For example, calculate every word's frequency in the entire corpus, then each word's frequency in each cluster. For each cluster, report the N words whose in-cluster-frequency is the largest multiple of the full-corpus-frequency. Would this give helpful results on your data, for your needs? You'd have to try it.
Separately, regarding your use of Doc2Vec:
there's no good reason to alias the existing class TaggedDocument to a strange class name like LabeldSentence1. Just use TaggedDocument directly.
if you supply your corpus, all_content_train, to the object-inittialization – as your code does – then you don't need to also call train(). Training will have already happened automatically. If you do want more than the default amount of training (epochs=5), just supply a larger epochs value to the initialization.
the learning-rate values you've supplied to train() – start_alpha=0.002, end_alpha=-0.016 – are nonsensical & destructive. Few users should need to tinker with these alpha values at all, but especially, they should never increase from the beginning to end of a training cycle, as these values do.
If you were running with logging enabled at the INFO level, and/or watching the output closely, you would likely see readouts and warnings indicating that excessive training was happening, or problematic values used.

Categories